Data Quality Control
Data quality is enforced at every layer — from YAML definition through ingestion to continuous monitoring.
Four-Layer Validation
Layer 1: Schema (Pydantic + JSON Schema)
↓
Layer 2: Ingestion (Pandera)
↓
Layer 3: Warehouse (dbt tests)
↓
Layer 4: Monitoring (Elementary)
Layer 1: Schema Validation
All YAML configs validated at load time:
- Pydantic models: Type validation for metadata objects (feed providers, CDM entities, project configs)
- JSON Schema: Structural validation of entities and relationships
- Naming conventions: Symbol (
^[A-Z0-9]{2,20}$), table names (^[a-z][a-z0-9_]{0,62}$)
Layer 2: Ingestion-Time Validation (Pandera)
DataFrame-level checks as data flows through the pipeline:
- Schema conformance at read time
- Column type enforcement
- Custom check functions per column
Layer 3: dbt Tests (Warehouse-Level)
Every feed provider definition includes declarative tests that become dbt test cases:
Column tests:
attributes:
symbol:
dtype: string
tests:
- not_null
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT, BNBUSDT, ADAUSDT, XRPUSDT, SOLUSDT]
price:
dtype: string
tests:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
- dbt_utils.expression_is_true: { expression: "price > 0" }
event_time:
dtype: bigint
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 1609459200000
max_value: "{{ var('current_timestamp_ms') }}"
Table tests:
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [trade_time, symbol, trade_id]
- dbt_utils.recency:
datepart: hour
field: event_time
interval: 24
- dbt_utils.expression_is_true:
expression: "event_time >= received_time"
Layer 4: Elementary (Data Observability)
Elementary integrates with the dbt artifacts DataInfra produces for continuous monitoring:
dbt run → dbt test → elementary run-operation → elementary report
| Capability | Description |
|---|---|
| Anomaly Detection | Auto-detect volume spikes, null rate changes, freshness deviations |
| Schema Change Alerts | Alert on new/removed columns or type changes |
| Test Results Dashboard | Centralized view of all dbt test results across runs |
| Column-Level Lineage | Visual dependency graph from sources → staging → CDM |
| Freshness Monitoring | Track data freshness SLAs per source per partition |
| Alerting | Slack/Email notifications on test failures and anomalies |
Quality Control Level
Quality control behavior is configured per test in the feed provider YAML. Tests can be set to:
- error: Pipeline stops on violations
- warning: Violations logged, pipeline continues
- silent: Violations recorded but not surfaced
Tests that fail at ingestion time (Pandera layer) block data from being written. Tests at the warehouse layer (dbt) run post-load and can be configured per test severity.