Introduction
A data catalog is a centralized inventory of an organization’s data assets. It provides metadata management, data discovery, and governance capabilities essential for modern data teams. As organizations accumulate more data, a well-designed data catalog becomes critical for enabling self-service analytics and maintaining data governance.
This comprehensive guide covers data catalog architecture, implementation strategies, popular tools, and best practices. You’ll learn how to build a catalog that enables users to find, understand, and trust data while ensuring proper governance and security.
Data Catalog Fundamentals
What is a Data Catalog?
A data catalog serves as the single source of truth for data assets across an organization. It addresses the fundamental problem of data discoveryโhelping analysts, engineers, and business users find the data they need without asking around or searching through countless folders.
Modern data catalogs provide:
- Metadata Management: Technical, business, and operational metadata
- Data Discovery: Search, browse, and filter data assets
- Data Lineage: Track data flow from source to consumption
- Governance: Access control, data quality, and compliance
- Collaboration: User contributions, ratings, and documentation
# Data Catalog Core Concepts
"""
Data Catalog Components:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Catalog โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Technical โ โ Business โ โ Operational โ โ
โ โ Metadata โ โ Metadata โ โ Metadata โ โ
โ โโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโค โ
โ โ - Schema โ โ - Definitionsโ โ - Quality โ โ
โ โ - Types โ โ - Owners โ โ - Usage โ โ
โ โ - Relations โ โ - Tags โ โ - SLA โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Search & โ โ Data โ โ Access โ โ
โ โ Discovery โ โ Lineage โ โ Control โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"""
Types of Metadata
Understanding different metadata types helps design comprehensive catalogs:
METADATA_TYPES = {
"technical": {
"description": "Technical information about data structures",
"examples": [
"Column names and data types",
"Table schemas and relationships",
"File formats and compression",
"Storage location and partition",
"Indexes and keys"
]
},
"business": {
"description": "Business context and meaning",
"examples": [
"Business definitions",
"Calculation formulas",
"Business rules and constraints",
"Department ownership",
"Data sensitivity classification"
]
},
"operational": {
"description": "Operational information about data",
"examples": [
"Data quality scores",
"Last updated timestamps",
"Update frequency (SLA)",
"Usage statistics",
"Processing costs"
]
},
"structural": {
"description": "Information about data relationships",
"examples": [
"Table relationships (FK)",
"Data lineage (upstream/downstream)",
"Derived columns",
"Dependencies"
]
},
"administrative": {
"description": "Management and governance information",
"examples": [
"Data owners and stewards",
"Access permissions",
"Retention policies",
"Compliance requirements",
"Change history"
]
}
}
Data Catalog Architecture
High-Level Architecture
# Data Catalog Architecture
"""
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Catalog System โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Ingest โโโโโถโ Store โโโโโโ Query & โ โ
โ โ Layer โ โ Layer โ โ Search โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Metadata Store โ โ
โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ โ
โ โ โ Graph โ โ Search โ โDocument โ โ Lineage โ โ โ
โ โ โ Store โ โ Index โ โ Store โ โ Store โ โ โ
โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Connectors โ โ API โ โ UI โ โ
โ โ (Sources) โ โ (REST/SDK) โ โ (Web App) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Sources: Snowflake, BigQuery, PostgreSQL, S3, Kafka, Excel, etc.
"""
Metadata Storage
# Metadata storage options comparison
STORAGE_OPTIONS = {
"relational": {
"examples": ["PostgreSQL", "MySQL"],
"pros": [
"Mature technology",
"ACID compliance",
"SQL query support",
"Easy integration"
],
"cons": [
"Limited graph support",
"Harder to model lineage",
"Not optimized for search"
],
"best_for": "Structured metadata, access control"
},
"graph": {
"examples": ["Neo4j", "Amazon Neptune"],
"pros": [
"Natural lineage modeling",
"Relationship queries",
"Flexible schema"
],
"cons": [
"Less mature ecosystem",
"Steeper learning curve",
"Scaling challenges"
],
"best_for": "Lineage, relationships, impact analysis"
},
"search": {
"examples": ["Elasticsearch", "OpenSearch"],
"pros": [
"Full-text search",
"Fast queries",
"Scalable"
],
"cons": [
"Not primary store",
"Eventual consistency",
"Limited relationships"
],
"best_for": "Discovery, search, indexing"
},
"document": {
"examples": ["MongoDB", "DynamoDB"],
"pros": [
"Flexible schemas",
"JSON support",
"Easy updates"
],
"cons": [
"Limited querying",
"Not great for relationships"
],
"best_for": "Technical metadata, schemas"
}
}
Implementation Approaches
Build Your Own
# Custom data catalog - core data model
from dataclasses import dataclass, field
from typing import List, Optional, Dict
from datetime import datetime
import uuid
@dataclass
class Dataset:
"""Represents a dataset in the catalog."""
id: str = field(default_factory=lambda: str(uuid.uuid4()))
name: str = ""
description: str = ""
owner: str = ""
department: str = ""
# Technical metadata
storage_type: str = "" # table, file, stream, etc.
location: str = ""
format: str = "" # parquet, csv, json, etc.
schema: List[Column] = field(default_factory=list)
# Business metadata
business_definition: str = ""
tags: List[str] = field(default_factory=list)
classifications: List[str] = field(default_factory=list)
# Operational metadata
quality_score: float = 0.0
last_updated: datetime = field(default_factory=datetime.utcnow)
update_frequency: str = "" # hourly, daily, weekly
row_count: int = 0
# Lineage
upstream_datasets: List[str] = field(default_factory=list)
downstream_datasets: List[str] = field(default_factory=list)
# Metadata
created_at: datetime = field(default_factory=datetime.utcnow)
created_by: str = ""
modified_at: datetime = field(default_factory=datetime.utcnow)
modified_by: str = ""
@dataclass
class Column:
"""Represents a column/field in a dataset."""
name: str = ""
description: str = ""
data_type: str = ""
# Technical
is_nullable: bool = True
is_primary_key: bool = False
is_foreign_key: bool = False
default_value: str = ""
# Business
business_name: str = ""
business_definition: str = ""
# Quality
completeness: float = 1.0 # % non-null
uniqueness: float = 1.0 # % unique values
@dataclass
class DataLineage:
"""Represents data lineage between datasets."""
id: str = field(default_factory=lambda: str(uuid.uuid4()))
source_id: str = ""
target_id: str = ""
transformation: str = "" # SQL, code, etc.
created_at: datetime = field(default_factory=datetime.utcnow)
Using Open Source Tools
# Apache Atlas - open source data catalog
# atlas_application.properties
configuration = """
# Atlas Server
atlas.server.http.port=21000
atlas.server.https.port=21443
# Graph Database
atlas.graph.storage.backend=hbase
atlas.graph.storage.hbase.table=apache_atlas_janus
# Search Index
atlas.search.index.backend=elasticsearch
# Authentication
atlas.authentication.method.kerberos=false
atlas.authentication.method.file=true
# Hook configurations for automatic metadata collection
atlas.hook.kafka.enabled=true
atlas.hook.kafka.bootstrap.servers=kafka:9092
"""
# Apache Atlas - Adding metadata via API
import requests
from requests.auth import HTTPBasicAuth
class AtlasClient:
"""Client for Apache Atlas API."""
def __init__(self, base_url: str, username: str, password: str):
self.base_url = base_url
self.auth = HTTPBasicAuth(username, password)
self.headers = {'Content-Type': 'application/json'}
def create_table(self, database: str, table: str, schema: list) -> dict:
"""Register a table in the catalog."""
entity = {
"entity": {
"typeName": "iceberg_table",
"attributes": {
"name": f"{database}.{table}",
"qualifiedName": f"{database}.{table}@prod",
"owner": "data_team",
"tableType": "EXTERNAL",
"viewOriginalText": "",
"viewExpandedText": "",
"columns": [
{
"typeName": "iceberg_column",
"attributes": {
"name": col["name"],
"type": col["type"],
"comment": col.get("description", ""),
"qualifiedName": f"{database}.{table}.{col['name']}@prod"
}
}
for col in schema
]
}
}
}
response = requests.post(
f"{self.base_url}/api/atlas/v2/entity",
json=entity,
auth=self.auth,
headers=self.headers
)
return response.json()
def search_by_name(self, query: str) -> list:
"""Search for entities by name."""
response = requests.get(
f"{self.base_url}/api/atlas/v2/search/basic",
params={"query": query},
auth=self.auth,
headers=self.headers
)
return response.json().get("entities", [])
def get_lineage(self, entity_id: str) -> dict:
"""Get lineage for an entity."""
response = requests.get(
f"{self.base_url}/api/atlas/v2/lineage/{entity_id}",
auth=self.auth,
headers=self.headers
)
return response.json()
Data Discovery and Search
Implementing Search
# Elasticsearch-based search for data catalog
from elasticsearch import Elasticsearch
from typing import List, Dict, Optional
class CatalogSearch:
"""Search functionality for data catalog."""
def __init__(self, es_client: Elasticsearch):
self.es = es_client
self.index = "data-catalog"
def index_dataset(self, dataset: Dict):
"""Index a dataset for search."""
document = {
"name": dataset["name"],
"description": dataset.get("description", ""),
"owner": dataset.get("owner", ""),
"department": dataset.get("department", ""),
"tags": dataset.get("tags", []),
"columns": [
{
"name": col["name"],
"description": col.get("description", "")
}
for col in dataset.get("schema", [])
],
"storage_type": dataset.get("storage_type", ""),
"location": dataset.get("location", ""),
"quality_score": dataset.get("quality_score", 0),
"last_updated": dataset.get("last_updated")
}
self.es.index(index=self.index, id=dataset["id"], document=document)
def search(self, query: str, filters: Optional[Dict] = None,
size: int = 10) -> List[Dict]:
"""Search datasets."""
# Build query
must = [
{
"multi_match": {
"query": query,
"fields": ["name^3", "description", "tags^2", "columns.name"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
]
# Add filters
if filters:
filter_clauses = []
if "department" in filters:
filter_clauses.append({"term": {"department": filters["department"]}})
if "owner" in filters:
filter_clauses.append({"term": {"owner": filters["owner"]}})
if "storage_type" in filters:
filter_clauses.append({"term": {"storage_type": filters["storage_type"]}})
if "min_quality" in filters:
filter_clauses.append({"range": {"quality_score": {"gte": filters["min_quality"]}}})
if filter_clauses:
must.append({"bool": {"filter": filter_clauses}})
search_body = {
"query": {"bool": {"must": must}},
"size": size,
"highlight": {
"fields": {
"name": {},
"description": {},
"columns.name": {}
}
}
}
response = self.es.search(index=self.index, body=search_body)
return [
{
"id": hit["_id"],
"score": hit["_score"],
"source": hit["_source"],
"highlights": hit.get("highlight", {})
}
for hit in response["hits"]["hits"]
]
def suggest(self, prefix: str, size: int = 5) -> List[str]:
"""Autocomplete suggestions."""
response = self.es.search(
index=self.index,
body={
"query": {
"match_phrase_prefix": {
"name": {
"query": prefix
}
}
},
"size": size,
"_source": ["name"]
}
)
return [hit["_source"]["name"] for hit in response["hits"]["hits"]]
Data Governance Integration
Access Control
# Data governance - access control
from enum import Enum
from typing import Set
class SensitivityLevel(Enum):
PUBLIC = "public"
INTERNAL = "internal"
CONFIDENTIAL = "confidential"
RESTRICTED = "restricted"
class AccessPolicy:
"""Define access policies for data assets."""
def __init__(self):
self.policies = {}
def add_policy(self, dataset_id: str, policy: Dict):
"""Add an access policy."""
if dataset_id not in self.policies:
self.policies[dataset_id] = []
self.policies[dataset_id].append(policy)
def check_access(self, user: str, dataset_id: str,
action: str) -> bool:
"""Check if user has access."""
policies = self.policies.get(dataset_id, [])
for policy in policies:
# Check conditions
if self._check_policy_conditions(user, policy):
if action in policy.get("allowed_actions", []):
return True
return False
def _check_policy_conditions(self, user: str, policy: Dict) -> bool:
"""Check if user matches policy conditions."""
# Check roles
if "required_roles" in policy:
user_roles = self._get_user_roles(user)
if not any(r in user_roles for r in policy["required_roles"]):
return False
# Check departments
if "required_departments" in policy:
user_dept = self._get_user_department(user)
if user_dept not in policy["required_departments"]:
return False
# Check IP range
if "ip_range" in policy:
user_ip = self._get_user_ip(user)
if not self._ip_in_range(user_ip, policy["ip_range"]):
return False
return True
def _get_user_roles(self, user: str) -> Set[str]:
# Implementation: fetch from identity provider
return {"analyst"}
def _get_user_department(self, user: str) -> str:
# Implementation: fetch from identity provider
return "engineering"
def _get_user_ip(self, user: str) -> str:
# Implementation: get from request context
return "10.0.0.1"
def _ip_in_range(self, ip: str, cidr: str) -> bool:
# Implementation: check IP against CIDR
return True
# Example policies
policy = AccessPolicy()
# Policy 1: Finance data - finance department only
policy.add_policy("revenue_dataset", {
"description": "Finance team access to revenue data",
"allowed_actions": ["read", "export"],
"required_departments": ["finance", "executive"],
"conditions": {
"min_sensitivity": "confidential"
}
})
# Policy 2: Customer PII - restricted access
policy.add_policy("customer_pii_dataset", {
"description": "Restricted customer data",
"allowed_actions": ["read"],
"required_roles": ["data_scientist", "analyst"],
"conditions": {
"purpose": ["analytics", "reporting"],
"requires_approval": True
}
})
Data Quality Integration
# Data quality scoring
class DataQualityScorer:
"""Calculate data quality scores."""
def __init__(self):
self.rules = []
def add_rule(self, name: str, rule_type: str,
column: str, threshold: float):
"""Add a quality rule."""
self.rules.append({
"name": name,
"type": rule_type,
"column": column,
"threshold": threshold
})
def calculate_score(self, dataset_id: str,
quality_results: Dict) -> Dict:
"""Calculate overall quality score."""
total_rules = len(self.rules)
passed_rules = sum(1 for r in self.rules
if quality_results.get(r["name"], False))
score = (passed_rules / total_rules * 100) if total_rules > 0 else 100
return {
"dataset_id": dataset_id,
"overall_score": score,
"passed_rules": passed_rules,
"total_rules": total_rules,
"failed_rules": [
r["name"] for r in self.rules
if not quality_results.get(r["name"], False)
],
"dimensions": self._calculate_dimensions(quality_results)
}
def _calculate_dimensions(self, results: Dict) -> Dict:
"""Calculate scores by quality dimension."""
dimensions = {
"completeness": [],
"accuracy": [],
"consistency": [],
"timeliness": []
}
# Group rules by dimension
# Calculate averages
return dimensions
# Quality dimensions
QUALITY_DIMENSIONS = {
"completeness": "Are all expected values present?",
"accuracy": "Do values match reality?",
"consistency": "Is data consistent across systems?",
"timeliness": "Is data up-to-date?",
"uniqueness": "Are there unwanted duplicates?",
"validity": "Do values conform to expected formats?"
}
Popular Data Catalog Tools
Tool Comparison
# Data Catalog Tools Comparison
CATALOG_TOOLS = {
"Amundsen": {
"type": "Open Source",
"provider": "Lyft (now community)",
"strengths": [
"Strong search with Elasticsearch",
"Popular with data scientists",
"Good Python integration",
"Active community"
],
"limitations": [
"Requires significant setup",
"Limited governance features",
"Documentation can be sparse"
],
"cloud_managed": False
},
"DataHub": {
"type": "Open Source",
"provider": "LinkedIn/Acryl Data",
"strengths": [
"Comprehensive metadata model",
"Strong lineage support",
"Graph-based discovery",
"Active development"
],
"limitations": [
"Complex initial setup",
"Steeper learning curve"
],
"cloud_managed": ["Acryl"]
},
"Apache Atlas": {
"type": "Open Source",
"provider": "Apache",
"strengths": [
"Enterprise-grade",
"Strong governance",
"Hadoop ecosystem integration"
],
"limitations": [
"Complex setup",
"Heavy for small teams",
"UI needs improvement"
],
"cloud_managed": ["Hortonworks", "Cloudera"]
},
"Alation": {
"type": "Commercial",
"provider": "Alation",
"strengths": [
"No-code search",
"Strong governance",
"Excellent business glossary",
"Automated scanning"
],
"limitations": [
"Expensive",
"Can be slow with large catalogs"
],
"cloud_managed": True
},
"Collibra": {
"type": "Commercial",
"provider": "Collibra",
"strengths": [
"Enterprise-grade governance",
"Strong workflow automation",
"Excellent reporting"
],
"limitations": [
"Very expensive",
"Complex configuration"
],
"cloud_managed": True
},
"Atlan": {
"type": "Modern SaaS",
"provider": "Atlan",
"strengths": [
"Modern UX",
"Slack integration",
"Quick time-to-value",
"Active development"
],
"limitations": [
"Newer product",
"Less enterprise history"
],
"cloud_managed": True
}
}
Best Practices
Implementation Checklist
# Data Catalog Implementation Checklist
IMPLEMENTATION_CHECKLIST = {
"Phase 1: Foundation": [
"Define catalog scope and objectives",
"Identify key stakeholders and data owners",
"Choose build vs buy approach",
"Design metadata model",
"Select technology stack"
],
"Phase 2: Metadata Collection": [
"Connect to primary data sources",
"Implement automated metadata extraction",
"Set up change data capture for metadata",
"Create manual entry workflows",
"Establish data ownership"
],
"Phase 3: Discovery Features": [
"Implement search functionality",
"Build browsing interfaces",
"Add data preview capabilities",
"Create documentation templates",
"Set up ratings and comments"
],
"Phase 4: Governance": [
"Define access control policies",
"Implement data classification",
"Set up data quality integration",
"Create approval workflows",
"Establish stewardship processes"
],
"Phase 5: Adoption": [
"Train data producers and consumers",
"Create internal documentation",
"Launch with high-value datasets",
"Gather feedback iteratively",
"Measure adoption metrics"
]
}
# Success Metrics
ADOPTION_METRICS = {
"searches_per_day": "How often catalog is used for discovery",
"documents_completed": "Percentage of datasets with full documentation",
"owners_identified": "Percentage of assets with assigned owners",
"quality_scores_populated": "Percentage of datasets with quality scores",
"daily_active_users": "Number of unique daily users",
"time_to_find_data": "Average time from search to finding relevant data"
}
Conclusion
A well-implemented data catalog transforms how organizations use data. Key takeaways:
- Start with clear objectives: Define what problems the catalog should solve
- Automate metadata collection: Manual processes don’t scale
- Focus on adoption: The best catalog is one people actually use
- Integrate governance: Security and quality should be built-in
- Iterate based on feedback: Continuously improve based on user needs
Whether you build your own or use a commercial solution, investing in a data catalog pays dividends in data literacy, governance, and productivity.
Resources
- Amundsen Documentation
- DataHub Documentation
- Apache Atlas Documentation
- Data Catalog Best Practices - Google Cloud
Comments