Open Source Data Lake
The expensive trap in a data lake is not storage, it is the format: once your tables are written in something only one engine reads, you have rebuilt a warehouse lock-in on top of cheap object storage and lost the whole point. The open source table formats and real-time OLAP engines here keep your data in open, engine-agnostic files, so you can query, swap compute, or add a new tool without rewriting a byte or asking a vendor for an export.

SeaweedFS
Distributed storage for object storage, file systems, and Iceberg tables with horizontal scaling

RustFS
Distributed object storage in Rust with S3 compatibility and OpenStack Swift support

Airbyte
Open-source ELT and data movement for moving data from APIs, databases, and files to warehouses, lakes, and AI apps

Apache Druid
High-performance real-time analytics database for fast ingest, ad hoc queries, and high concurrency

Citus
PostgreSQL extension that shards tables across a cluster for distributed SQL workloads

Apache Iceberg
Open table format that lets Spark, Trino, and Flink read the same huge analytic tables at once

Delta Lake
Open table format that adds ACID transactions to lakehouse tables across Spark, Trino, and more

Apache Hudi
Open data lakehouse platform for ingesting, indexing, and managing data on cloud storage

Apache Pinot
Real-time distributed OLAP datastore for low-latency analytics on streaming and batch data