Open data lakehouse platform for ingesting, indexing, and managing data on cloud storage
Apache-2.0
- Java
- Scala
- Jupyter Notebook

About Apache Hudi
Apache Hudi is an open data lakehouse platform built on a high-performance open table format. It ingests, indexes, stores, serves, transforms, and manages data on cloud storage in open formats, bringing transactions, upserts, deletes, and incremental processing to large datasets across multiple cloud environments.
Built-in ingestion tools cover Apache Spark and Apache Flink, with a Kafka connect sink for external sources. Hudi adds timeline metadata, automatic file sizing, savepoints, and schema evolution, plus a scalable indexing subsystem with record-level and expression indexes. On top of one table it serves snapshot, incremental, change-data-capture, time-travel, and read-optimized queries.
Table services such as cleaning, clustering, compaction, and catalog sync run inside the writers or independently, and sync with Apache Hive Metastore, AWS Glue, Google BigQuery, and Apache XTable. Hudi is an Apache Software Foundation project under the Apache 2.0 license.
Key features
- Built-in ingestion for Apache Spark and Apache Flink
- Apache Kafka sink for external data sources
- Snapshot, incremental, CDC, time-travel, and read optimized queries
- Atomic commits with rollback and restore support
- Catalog sync with Hive Metastore, AWS Glue, BigQuery, and Apache XTable
Details
- First released
- 2016
- Language
- Java · Scala
- Type
- Open table format / lakehouse
- Deployment
- Apache Spark · Apache Flink
- License
- Apache 2.0
- Governance
- Apache Software Foundation
