Introduction: The Data Quality Crisis in Modern Pipelines
Your analytics team just discovered that customer revenue figures have been inflated by 15% for the past three months. The root cause? A null value handling bug in a transformation that nobody caught. By the time the issue surfaced, stakeholders had already made business decisions based on corrupted data.
This scenario plays out across organizations daily. As data pipelines grow more complexโspanning multiple sources, transformation layers, and downstream consumersโthe surface area for quality issues expands exponentially. Traditional approaches to data quality, like manual spot-checks or reactive debugging, don’t scale.
The modern data stack demands a different approach: proactive, automated data observability built into the transformation pipeline itself. This is where Great Expectations and dbt converge to solve a critical problem that neither tool alone can fully address.
Why Data Quality Matters in Your Data Stack
Before diving into tools, let’s establish why this matters. Data quality failures don’t just create technical debtโthey erode trust in your entire analytics function. When stakeholders can’t rely on your data, they stop using it. When they stop using it, your data platform becomes a cost center rather than a strategic asset.
The challenge is that data quality isn’t monolithic. It encompasses multiple dimensions:
- Completeness: Are all expected records present?
- Accuracy: Do values match their real-world counterparts?
- Consistency: Do values conform to defined formats and business rules?
- Timeliness: Is data available when needed?
- Uniqueness: Are there unexpected duplicates?
- Validity: Do values fall within acceptable ranges?
A robust data quality strategy must address all these dimensions across your entire pipelineโfrom raw ingestion through final analytics tables.
Understanding Great Expectations: A Data Validation Framework
Great Expectations is a Python-based data validation framework that treats data quality as a first-class concern. Rather than burying quality checks in transformation code, Great Expectations makes them explicit, testable, and observable.
Core Concepts
Expectations are assertions about your data. They’re declarative statements about what your data should look like:
# Example: Customer table expectations
- expect_table_row_count_to_be_between:
min_value: 1000
max_value: 10000000
- expect_column_values_to_not_be_null:
column: customer_id
- expect_column_values_to_be_unique:
column: email
- expect_column_values_to_match_regex:
column: email
regex: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
- expect_column_values_to_be_between:
column: account_age_days
min_value: 0
max_value: 36500
Expectation Suites bundle related expectations into reusable, versioned collections. They serve as the contract for a datasetโthe single source of truth for what quality looks like.
Validation Results capture the outcome of running expectations against actual data. They include not just pass/fail status, but detailed metrics: how many values failed, what the actual distribution was, and whether the failure was severe enough to warrant action.
Why Great Expectations Excels at Validation
Great Expectations provides several advantages over ad-hoc validation logic:
-
Declarative, not imperative: You describe what should be true, not how to check it. This makes expectations readable and maintainable.
-
Comprehensive expectation library: 100+ built-in expectations cover most common validation scenarios without custom code.
-
Automatic documentation: Expectations generate human-readable documentation of your data contracts automatically.
-
Flexible execution: Run validations in batch, streaming, or real-time contexts. Execute against Pandas DataFrames, Spark, SQL databases, or cloud data warehouses.
-
Rich observability: Validation results include detailed metrics, not just pass/fail. You can track quality trends over time.
-
Extensibility: Custom expectations let you encode domain-specific business logic.
Understanding dbt: Transformation with Built-in Testing
dbt (data build tool) has revolutionized how data teams approach transformations. It provides a framework for writing, testing, and documenting SQL transformations as code.
dbt’s Testing Capabilities
dbt includes native testing functionality designed to catch transformation bugs:
-- models/staging/stg_customers.sql
{{ config(materialized='table') }}
select
customer_id,
email,
created_at,
updated_at
from {{ source('raw', 'customers') }}
where deleted_at is null
# models/staging/stg_customers.yml
models:
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: email
tests:
- unique
- not_null
- name: created_at
tests:
- not_null
dbt’s native tests are SQL-based and focus on structural integrity: uniqueness, non-nullness, referential integrity, and custom SQL assertions.
The Strengths and Limitations of dbt Testing
Strengths:
- Integrated into the transformation workflow
- Version controlled alongside models
- Lightweight and fast to execute
- Excellent for catching structural issues
- Built-in documentation and lineage
Limitations:
- Tests are SQL-based, limiting expressiveness for complex validations
- No built-in support for statistical validation (distributions, outliers, anomalies)
- Limited observabilityโtests pass or fail, but don’t capture detailed metrics
- No automatic documentation of data contracts
- Difficult to implement cross-table or cross-model validations
- No native support for conditional expectations or threshold-based validation
The Complementary Relationship: Great Expectations + dbt
This is where the two tools become powerful together. They solve different problems in the data quality spectrum:
| Aspect | dbt Tests | Great Expectations |
|---|---|---|
| Focus | Structural integrity | Comprehensive validation |
| Execution | SQL-based | Python-based, multi-backend |
| Observability | Pass/fail | Detailed metrics and trends |
| Documentation | Model-level | Data contract-level |
| Complexity | Simple rules | Complex business logic |
| Statistical | Limited | Extensive |
| Extensibility | Custom SQL | Custom Python |
A Practical Integration Pattern
The ideal approach uses both tools in a complementary workflow:
-
dbt handles structural validation: Use dbt’s native tests to catch schema issues, nullness violations, and uniqueness problems. These are fast, lightweight checks that should pass before any data leaves the transformation layer.
-
Great Expectations handles semantic validation: After dbt tests pass, Great Expectations validates business logic, statistical properties, and data contracts. This includes range checks, pattern matching, distribution analysis, and cross-table relationships.
-
Great Expectations provides observability: Capture detailed validation metrics that feed into monitoring dashboards and alerting systems.
Here’s what this looks like in practice:
Raw Data
โ
[dbt: Structural Tests]
โ (pass)
Transformed Data
โ
[Great Expectations: Semantic Validation]
โ (pass)
Analytics-Ready Data
โ
[Monitoring & Alerting]
Integration Points: The dbt-expectations Package
The dbt-expectations package (maintained by the dbt community) bridges these tools by bringing Great Expectations’ validation logic into dbt’s testing framework.
Using dbt-expectations
Install the package in your packages.yml:
packages:
- package: calogica/dbt_expectations
version: 0.8.0
Then use Great Expectations-style tests in your dbt YAML:
models:
- name: fct_orders
columns:
- name: order_id
tests:
- dbt_expectations.expect_column_values_to_be_unique
- dbt_expectations.expect_column_values_to_not_be_null
- name: order_amount
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 1000000
- dbt_expectations.expect_column_values_to_match_regex:
regex: '^\d+\.\d{2}$'
- name: customer_id
tests:
- dbt_expectations.expect_column_values_to_be_in_set:
value_set: [1, 2, 3, 4, 5]
- name: stg_customers
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 1000
max_value: 10000000
When to Use dbt-expectations vs. Standalone Great Expectations
Use dbt-expectations when:
- You want validation logic version-controlled with your dbt project
- Your validation needs are primarily structural or pattern-based
- You prefer YAML configuration over Python
- Your team is already comfortable with dbt
Use standalone Great Expectations when:
- You need advanced statistical validation (anomaly detection, distribution analysis)
- You’re validating data outside the dbt transformation pipeline
- You need rich observability and detailed validation metrics
- You’re building a centralized data quality platform
- You need to validate data in real-time or streaming contexts
Practical Use Cases: Where Combined Validation Shines
Use Case 1: Multi-Layer Transformation Pipeline
Consider a typical analytics pipeline:
Raw Events โ Staging Layer โ Intermediate Layer โ Mart Layer
dbt’s role: Structural validation at each layer
- Staging: Ensure raw events have required fields
- Intermediate: Verify joins didn’t create duplicates
- Mart: Confirm aggregations have expected row counts
Great Expectations’ role: Semantic validation at critical points
- After staging: Validate event timestamps are within reasonable ranges
- After intermediate: Check that aggregated metrics fall within historical bounds
- After mart: Ensure final metrics match business definitions
Use Case 2: Detecting Data Drift
Your customer dimension table has been stable for months. Then, suddenly, the average customer age drops by 10 years. This could indicate:
- A data quality issue in the source system
- A change in customer acquisition strategy
- A bug in the transformation logic
dbt’s tests won’t catch thisโthe data is still structurally valid. Great Expectations can:
# Expectation Suite for customer dimension
expectations:
- expect_column_mean_to_be_between:
column: age
min_value: 35
max_value: 55
- expect_column_quantile_values_to_be_between:
column: age
quantile: 0.5 # median
min_value: 40
max_value: 50
When the mean age drops to 25, Great Expectations flags it immediately, triggering investigation before downstream consumers see corrupted data.
Use Case 3: Cross-Table Referential Integrity
You have orders that reference customers. dbt can test that every order has a valid customer_id:
tests:
- relationships:
to: ref('dim_customers')
field: customer_id
But what if you need to validate that orders only reference active customers? Or that the order date is after the customer’s creation date? These require Great Expectations’ flexibility:
# Custom expectation
def expect_orders_reference_active_customers(self):
"""Orders should only reference customers with active status"""
query = """
select count(*) as invalid_count
from orders o
left join customers c on o.customer_id = c.customer_id
where c.status != 'active' or c.customer_id is null
"""
result = self.engine.execute(query)
invalid_count = result.fetchone()[0]
return invalid_count == 0
Best Practices for Implementation
1. Define Clear Data Contracts
Before writing any tests, document what your data should look like. Create a data contract that specifies:
- Required columns and their types
- Acceptable value ranges
- Uniqueness constraints
- Business rule constraints
- SLA requirements (freshness, completeness)
# data_contracts/fct_orders.yml
contract:
name: fct_orders
owner: analytics_team
sla:
freshness: 24 hours
completeness: 99.9%
columns:
order_id:
type: integer
constraints:
- unique
- not_null
order_amount:
type: decimal(10,2)
constraints:
- not_null
- between: [0, 1000000]
order_date:
type: date
constraints:
- not_null
- after: 2020-01-01
2. Layer Your Validations
Implement validation at multiple layers:
Layer 1: Source validation (Great Expectations)
- Validate raw data immediately after ingestion
- Catch upstream data quality issues early
- Prevent bad data from entering your pipeline
Layer 2: Structural validation (dbt tests)
- Validate transformation logic
- Ensure joins and aggregations work correctly
- Catch schema mismatches
Layer 3: Semantic validation (Great Expectations)
- Validate business logic
- Check statistical properties
- Ensure data meets business requirements
Layer 4: Observability (Great Expectations + monitoring)
- Track validation metrics over time
- Alert on quality degradation
- Feed metrics into dashboards
3. Make Validation Observable
Don’t just pass or fail testsโcapture metrics that tell a story:
# Instead of just checking if values are between 0 and 1000
# Also capture the distribution
expectations = [
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "order_amount",
"min_value": 0,
"max_value": 1000000
}
},
{
"expectation_type": "expect_column_mean_to_be_between",
"kwargs": {
"column": "order_amount",
"min_value": 50,
"max_value": 500
}
},
{
"expectation_type": "expect_column_quantile_values_to_be_between",
"kwargs": {
"column": "order_amount",
"quantile": 0.95,
"min_value": 800,
"max_value": 950
}
}
]
This gives you visibility into not just whether data is valid, but how it’s behaving.
4. Version Your Expectations
Treat expectation suites like code. Version them, review changes, and maintain a history:
expectations/
โโโ fct_orders/
โ โโโ v1.0.json (initial expectations)
โ โโโ v1.1.json (added order_date validation)
โ โโโ v2.0.json (updated amount ranges based on new product line)
โโโ dim_customers/
โ โโโ v1.0.json
When expectations change, you have a record of why and when.
5. Automate Validation in Your Pipeline
Integrate validation into your orchestration tool (Airflow, dbt Cloud, Prefect, etc.):
# Airflow DAG example
from airflow import DAG
from airflow.operators.bash import BashOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
with DAG('data_quality_pipeline') as dag:
# Run dbt transformations
dbt_run = BashOperator(
task_id='dbt_run',
bash_command='dbt run'
)
# Run dbt tests
dbt_test = BashOperator(
task_id='dbt_test',
bash_command='dbt test'
)
# Run Great Expectations validation
ge_validation = GreatExpectationsOperator(
task_id='ge_validation',
checkpoint_name='fct_orders_checkpoint'
)
dbt_run >> dbt_test >> ge_validation
6. Handle Validation Failures Gracefully
Not all validation failures should stop your pipeline. Implement a tiered response:
Critical failures (stop the pipeline):
- Null values in required columns
- Duplicate primary keys
- Missing entire tables
Warning failures (log and alert, but continue):
- Values outside expected ranges
- Unusual distributions
- Unexpected data drift
Informational (log only):
- Metrics for trend analysis
- Performance statistics
# Example: Tiered validation response
validation_result = run_expectations(data, expectation_suite)
critical_failures = [r for r in validation_result if r['severity'] == 'critical']
warning_failures = [r for r in validation_result if r['severity'] == 'warning']
if critical_failures:
raise DataQualityException(f"Critical validation failures: {critical_failures}")
if warning_failures:
logger.warning(f"Data quality warnings: {warning_failures}")
send_alert(warning_failures)
log_metrics(validation_result)
dbt Native Tests vs. Great Expectations: When to Use Each
dbt Native Tests Excel At:
- Uniqueness checks: Ensuring primary keys are unique
- Not-null checks: Verifying required fields are populated
- Referential integrity: Validating foreign key relationships
- Custom SQL assertions: Simple business logic in SQL
- Integration with dbt workflow: Tests run as part of
dbt test
Example:
tests:
- unique:
column_name: order_id
- not_null:
column_name: customer_id
- relationships:
to: ref('dim_customers')
field: customer_id
Great Expectations Excels At:
- Statistical validation: Distribution analysis, outlier detection
- Pattern matching: Regex validation, format checking
- Range validation: Numeric bounds, date ranges
- Anomaly detection: Identifying unusual patterns
- Cross-dataset validation: Comparing metrics across tables
- Observability: Detailed metrics and trend tracking
- Documentation: Automatic data contract generation
Example:
expectations = [
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {"column": "email", "regex": "^[^@]+@[^@]+\.[^@]+$"}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "age", "min_value": 18, "max_value": 120}
},
{
"expectation_type": "expect_column_kl_divergence_from_list_to_be_less_than",
"kwargs": {"column": "status", "partition_list": ["active", "inactive"], "threshold": 0.1}
}
]
Building Data Observability: The Complete Picture
Data observability goes beyond validation. It’s about understanding the health and behavior of your data in production.
The Three Pillars of Data Observability
1. Freshness: Is data current?
- dbt: Track model run times and completion status
- Great Expectations: Validate that data was updated within expected timeframes
2. Distribution: Is data behaving normally?
- Great Expectations: Track statistical properties over time
- Identify when distributions shift unexpectedly
3. Lineage: Where did this data come from?
- dbt: Automatic lineage tracking through
ref()andsource() - Great Expectations: Validation results linked to specific models and sources
Implementing Observability
Create a monitoring dashboard that combines signals from both tools:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Quality Dashboard โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ dbt Test Results โ
โ โโ stg_customers: 5/5 passed โ
โ โโ fct_orders: 8/8 passed โ
โ โโ dim_products: 3/3 passed โ
โ โ
โ Great Expectations Validation โ
โ โโ fct_orders: 47/50 expectations passed โ
โ โ โโ Warnings: order_amount distribution shifted โ
โ โโ dim_customers: 32/32 expectations passed โ
โ โโ stg_events: 28/30 expectations passed โ
โ โโ Failures: 2 null values in required field โ
โ โ
โ Data Freshness โ
โ โโ fct_orders: Updated 2 hours ago โ โ
โ โโ dim_customers: Updated 1 hour ago โ โ
โ โโ stg_events: Updated 15 minutes ago โ โ
โ โ
โ Quality Trends (Last 30 Days) โ
โ โโ Validation Pass Rate: 98.5% โ
โ โโ Average Validation Time: 45s โ
โ โโ Critical Failures: 0 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implementation Roadmap
Phase 1: Foundation (Weeks 1-2)
- Define data contracts for critical tables
- Implement dbt native tests for structural validation
- Set up basic Great Expectations suite for one key table
- Integrate validation into your orchestration tool
Phase 2: Expansion (Weeks 3-4)
- Extend Great Expectations to all critical tables
- Implement statistical validation and anomaly detection
- Create monitoring dashboard
- Set up alerting for validation failures
Phase 3: Maturity (Weeks 5-8)
- Implement cross-table validation logic
- Build automated remediation workflows
- Establish SLAs for data quality
- Create self-service validation tools for analysts
Phase 4: Optimization (Ongoing)
- Continuously refine expectations based on production data
- Implement machine learning-based anomaly detection
- Expand observability to include cost and performance metrics
- Build data quality scorecards by domain
Common Pitfalls to Avoid
1. Over-Testing
Not every column needs a test. Focus on:
- Columns used in critical business metrics
- Columns that frequently have quality issues
- Columns with strict business rules
2. Brittle Expectations
Avoid expectations that are too strict and fail on legitimate data variations:
# Too strict - will fail on normal variations
expect_column_mean_to_equal(column='order_amount', value=250)
# Better - allows reasonable variation
expect_column_mean_to_be_between(
column='order_amount',
min_value=200,
max_value=300
)
3. Ignoring Validation Results
Validation is only valuable if you act on it. Establish clear processes for:
- Investigating failures
- Updating expectations when business rules change
- Communicating issues to stakeholders
4. Treating Validation as One-Time
Data quality is not a projectโit’s a practice. Continuously:
- Review and update expectations
- Monitor validation trends
- Adjust thresholds based on production data
Conclusion: Building Trust in Your Data
The combination of Great Expectations and dbt creates a comprehensive data quality framework that addresses the full spectrum of validation needs. dbt handles the structural foundation, ensuring your transformations are logically sound. Great Expectations adds semantic validation and observability, ensuring your data meets business requirements and behaves as expected.
More importantly, this combination enables a cultural shift: from reactive debugging to proactive quality assurance. Your team stops asking “Is this data broken?” and starts asking “How do we know this data is good?”
That shiftโfrom reactive to proactive, from manual to automated, from opaque to observableโis what transforms data quality from a pain point into a competitive advantage.
Getting Started
- Audit your current pipeline: Identify critical tables and their quality requirements
- Define expectations: Document what good data looks like
- Implement dbt tests: Start with structural validation
- Add Great Expectations: Layer in semantic validation
- Automate and monitor: Integrate into your orchestration and create dashboards
- Iterate: Continuously refine based on production experience
Your data quality journey starts with a single expectation. Make it count.
Comments