Apache Hudi

Open data lakehouse platform for ingesting, indexing, and managing data on cloud storage

Repository activity

Stars6.2k
Forks2.5k
Open Issues4.1k

apache-hudi health score - Linux Foundation Insights

License

Apache-2.0

Languages

Java
Scala
Jupyter Notebook

Get it:Website GitHub

About Apache Hudi

Apache Hudi is an open data lakehouse platform built on a high-performance open table format. It ingests, indexes, stores, serves, transforms, and manages data on cloud storage in open formats, bringing transactions, upserts, deletes, and incremental processing to large datasets across multiple cloud environments.

Built-in ingestion tools cover Apache Spark and Apache Flink, with a Kafka connect sink for external sources. Hudi adds timeline metadata, automatic file sizing, savepoints, and schema evolution, plus a scalable indexing subsystem with record-level and expression indexes. On top of one table it serves snapshot, incremental, change-data-capture, time-travel, and read-optimized queries.

Table services such as cleaning, clustering, compaction, and catalog sync run inside the writers or independently, and sync with Apache Hive Metastore, AWS Glue, Google BigQuery, and Apache XTable. Hudi is an Apache Software Foundation project under the Apache 2.0 license.

Key features

Built-in ingestion for Apache Spark and Apache Flink
Apache Kafka sink for external data sources
Snapshot, incremental, CDC, time-travel, and read optimized queries
Atomic commits with rollback and restore support
Catalog sync with Hive Metastore, AWS Glue, BigQuery, and Apache XTable

Details

First released: 2016
Language: Java · Scala
Type: Open table format / lakehouse
Deployment: Apache Spark · Apache Flink
License: Apache 2.0
Governance: Apache Software Foundation