Skip to main content

Data Quality Control

Data quality is enforced at every layer — from YAML definition through ingestion to continuous monitoring.


Four-Layer Validation

Layer 1: Schema (Pydantic + JSON Schema)

Layer 2: Ingestion (Pandera)

Layer 3: Warehouse (dbt tests)

Layer 4: Monitoring (Elementary)

Layer 1: Schema Validation

All YAML configs validated at load time:

  • Pydantic models: Type validation for metadata objects (feed providers, CDM entities, project configs)
  • JSON Schema: Structural validation of entities and relationships
  • Naming conventions: Symbol (^[A-Z0-9]{2,20}$), table names (^[a-z][a-z0-9_]{0,62}$)

Layer 2: Ingestion-Time Validation (Pandera)

DataFrame-level checks as data flows through the pipeline:

  • Schema conformance at read time
  • Column type enforcement
  • Custom check functions per column

Layer 3: dbt Tests (Warehouse-Level)

Every feed provider definition includes declarative tests that become dbt test cases:

Column tests:

attributes:
symbol:
dtype: string
tests:
- not_null
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT, BNBUSDT, ADAUSDT, XRPUSDT, SOLUSDT]
price:
dtype: string
tests:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
- dbt_utils.expression_is_true: { expression: "price > 0" }
event_time:
dtype: bigint
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 1609459200000
max_value: "{{ var('current_timestamp_ms') }}"

Table tests:

tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [trade_time, symbol, trade_id]
- dbt_utils.recency:
datepart: hour
field: event_time
interval: 24
- dbt_utils.expression_is_true:
expression: "event_time >= received_time"

Layer 4: Elementary (Data Observability)

Elementary integrates with the dbt artifacts DataInfra produces for continuous monitoring:

dbt run → dbt test → elementary run-operation → elementary report
CapabilityDescription
Anomaly DetectionAuto-detect volume spikes, null rate changes, freshness deviations
Schema Change AlertsAlert on new/removed columns or type changes
Test Results DashboardCentralized view of all dbt test results across runs
Column-Level LineageVisual dependency graph from sources → staging → CDM
Freshness MonitoringTrack data freshness SLAs per source per partition
AlertingSlack/Email notifications on test failures and anomalies

Quality Control Level

Quality control behavior is configured per test in the feed provider YAML. Tests can be set to:

  • error: Pipeline stops on violations
  • warning: Violations logged, pipeline continues
  • silent: Violations recorded but not surfaced

Tests that fail at ingestion time (Pandera layer) block data from being written. Tests at the warehouse layer (dbt) run post-load and can be configured per test severity.