Apache Iceberg

Open table format that lets Spark, Trino, and Flink read the same huge analytic tables at once

Repository activity

Stars9k
Forks3.3k
Open Issues755

apache-iceberg health score - Linux Foundation Insights

License

Apache-2.0

Languages

Java
Scala
Python

Get it:Website GitHub Docs Spec

About Apache Iceberg

Apache Iceberg is an open table format for very large analytic datasets. It gives big data the reliability and behavior of SQL tables, so engines like Spark, Trino, Flink, Presto, Hive, and Impala can safely query and write the same tables at the same time.

Iceberg handles tables backed by Parquet, Avro, and ORC, reads Parquet into Arrow memory, and works with the Hive metastore. Engine connectors plug Iceberg into Spark, Flink, and Hive, so existing pipelines can adopt it without changing their underlying file storage.

The table format is stable and gains new capabilities with each release. This project is the Java implementation, and separate clients cover Go, Python, Rust, and C++ for teams working outside the JVM.

Key features

Same tables read and written concurrently by Spark, Trino, Flink, Presto, Hive, and Impala
Tables backed by Parquet, Avro, and ORC files
Reads Parquet data into Arrow memory
Hive metastore integration for table metadata
Engine connectors for Spark, Flink, and Hive

Details

First released: 2018
Type: Open table format
Language: Java
Storage: Parquet · Avro · ORC
Compatibility: Spark · Trino · Flink · Presto · Hive
Governance: Apache Software Foundation