DataHub

Metadata platform for data discovery, governance, and observability across the data stack

Repository activity

Stars12.1k
Forks3.5k
Open Issues911

datahub health score - Linux Foundation Insights

License

Apache-2.0

Languages

Python
Java
TypeScript

Get it:Website GitHub Docs Python Package

About DataHub

DataHub is an open source metadata platform for searching, discovering, and understanding data across a data ecosystem. Originally built at LinkedIn, it connects warehouses, lakes, BI tools, and pipelines into a unified metadata graph that powers data discovery, governance, and observability.

It ingests metadata through a push or pull framework with 80+ connectors for sources such as Snowflake, BigQuery, and dbt, streaming updates over Kafka so the graph stays current. Search, lineage, ownership, and dataset properties are exposed through the web UI, GraphQL and REST APIs, Python and Java SDKs, and a CLI, and an MCP server lets AI assistants query metadata.

DataHub can run locally via a Docker quickstart or on Kubernetes with Helm for production, and a managed DataHub Cloud is offered. It suits data teams that want an open catalog with column-level lineage, governance policies, and quality checks. Licensed under Apache-2.0.

Key features

Unified metadata graph for search and discovery
Column-level lineage and impact analysis
80+ connectors with push or pull ingestion
Real-time metadata updates streamed over Kafka
GraphQL and REST APIs, SDKs, CLI, and MCP server

Details

First released: 2020
Language: Python, Java, TypeScript
Connectors: 80+ ingestion sources
API: GraphQL, REST, SDKs, CLI, MCP
Self-hosted: Docker or Kubernetes via Helm
License: Apache-2.0