Metadata platform for data discovery, governance, and observability across the data stack
Apache-2.0
- Python
- Java
- TypeScript

About DataHub
DataHub is an open source metadata platform for searching, discovering, and understanding data across a data ecosystem. Originally built at LinkedIn, it connects warehouses, lakes, BI tools, and pipelines into a unified metadata graph that powers data discovery, governance, and observability.
It ingests metadata through a push or pull framework with 80+ connectors for sources such as Snowflake, BigQuery, and dbt, streaming updates over Kafka so the graph stays current. Search, lineage, ownership, and dataset properties are exposed through the web UI, GraphQL and REST APIs, Python and Java SDKs, and a CLI, and an MCP server lets AI assistants query metadata.
DataHub can run locally via a Docker quickstart or on Kubernetes with Helm for production, and a managed DataHub Cloud is offered. It suits data teams that want an open catalog with column-level lineage, governance policies, and quality checks. Licensed under Apache-2.0.
Key features
- Unified metadata graph for search and discovery
- Column-level lineage and impact analysis
- 80+ connectors with push or pull ingestion
- Real-time metadata updates streamed over Kafka
- GraphQL and REST APIs, SDKs, CLI, and MCP server
Details
- First released
- 2020
- Language
- Python, Java, TypeScript
- Connectors
- 80+ ingestion sources
- API
- GraphQL, REST, SDKs, CLI, MCP
- Self-hosted
- Docker or Kubernetes via Helm
- License
- Apache-2.0
