Apache Hudi logo

Apache Hudi

Open data lakehouse platform for ingesting, indexing, and managing data on cloud storage

Repository activity
  • Stars6.2k
  • Forks2.5k
  • Open Issues4.1k
apache-hudi health score - Linux Foundation Insights
License

Apache-2.0

Languages
  • Java
  • Scala
  • Jupyter Notebook
Apache Hudi screenshot

About Apache Hudi

Apache Hudi is an open data lakehouse platform built on a high-performance open table format. It ingests, indexes, stores, serves, transforms, and manages data on cloud storage in open formats, bringing transactions, upserts, deletes, and incremental processing to large datasets across multiple cloud environments.

Built-in ingestion tools cover Apache Spark and Apache Flink, with a Kafka connect sink for external sources. Hudi adds timeline metadata, automatic file sizing, savepoints, and schema evolution, plus a scalable indexing subsystem with record-level and expression indexes. On top of one table it serves snapshot, incremental, change-data-capture, time-travel, and read-optimized queries.

Table services such as cleaning, clustering, compaction, and catalog sync run inside the writers or independently, and sync with Apache Hive Metastore, AWS Glue, Google BigQuery, and Apache XTable. Hudi is an Apache Software Foundation project under the Apache 2.0 license.

Key features

  • Built-in ingestion for Apache Spark and Apache Flink
  • Apache Kafka sink for external data sources
  • Snapshot, incremental, CDC, time-travel, and read optimized queries
  • Atomic commits with rollback and restore support
  • Catalog sync with Hive Metastore, AWS Glue, BigQuery, and Apache XTable

Details

First released
2016
Language
Java · Scala
Type
Open table format / lakehouse
Deployment
Apache Spark · Apache Flink
License
Apache 2.0
Governance
Apache Software Foundation