Data Quality Tests
Data quality is enforced at four layers: Pydantic schema validation at load time, Pandera DataFrame checks during ingestion, dbt tests at the warehouse level, and Elementary for continuous observability. This reference focuses on the YAML-declared dbt test layer — the primary interface users configure.
Test Structure
Tests are declared in DataTypeSchema using the TestSuite model — column_tests for per-column checks and table_tests for cross-column/table-wide checks.
data_types:
cdm_trades:
name: trades
stream: "trade"
unique_key: [symbol, trade_time, trade_id]
# Table-level tests
tests:
table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, trade_time, trade_id]
- dbt_utils.recency:
datepart: hour
field: event_time
interval: 24
# Column-level tests
schema:
symbol:
dtype: string
tests:
- not_null
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT, BNBUSDT, ADAUSDT, XRPUSDT, SOLUSDT]
price:
dtype: decimal
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
inclusive: false
event_time:
dtype: timestamp
tests:
- not_null
field_mappings:
- target: trade_id
source: t
transformation: "cast_safe(t, bigint)"
- target: price
source: p
transformation: "cast_safe(p, decimal)"
- target: event_time
source: "E"
transformation: "timestamp_ms(cast_safe(E, bigint))"
is_time_filter_field: true
Column-Level Tests
not_null
Ensures a column contains no NULL values.
tests:
- not_null
| Parameter | None | | Level | Column |
dbt_utils.accepted_values
Ensures a column only contains values from a predefined list.
tests:
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT, BNBUSDT]
| Parameter | Type | Required | Description |
|---|---|---|---|
values | list | Yes | Allowed values |
| Level | Column |
dbt_utils.accepted_range
Ensures numeric column values fall within a specified range.
tests:
- dbt_utils.accepted_range:
min_value: 0
inclusive: false
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min_value | float | No | — | Minimum allowed value |
max_value | float | str | No | — | Maximum value (supports dbt {{ var() }}) |
inclusive | bool | No | true | Whether endpoints are inclusive |
| Level | Column |
dbt_utils.not_accepted_values
Inverse of accepted_values — ensures a column does NOT contain specific values. Useful for excluding known bad data.
tests:
- dbt_utils.not_accepted_values:
values: [0, -1, -999]
| Parameter | Type | Required | Description |
|---|---|---|---|
values | list | Yes | Forbidden values |
| Level | Column |
dbt_expectations.expect_column_values_to_match_regex
Validates column values against a regex pattern. Useful for symbol formats, trade ID patterns, etc.
tests:
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[A-Z0-9]{5,12}$"
| Parameter | Type | Required | Description |
|---|---|---|---|
regex | str | Yes | Regex pattern to match |
row_condition | str | No | Conditional filter on rows |
| Level | Column |
dbt_expectations.expect_column_values_to_be_increasing
Ensures column values are monotonically increasing — critical for timestamp sequences in time-series market data.
tests:
- dbt_expectations.expect_column_values_to_be_increasing:
group_by: [symbol]
strictly: false
| Parameter | Type | Required | Description |
|---|---|---|---|
group_by | list[str] | No | Group columns for partitioning |
strictly | bool | No | Enforce strict inequality |
| Level | Column |
Table-Level Tests
dbt_utils.unique_combination_of_columns
Ensures a set of columns forms a unique key across the entire table.
table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, trade_time, trade_id]
| Parameter | Type | Required | Description |
|---|---|---|---|
combination_of_columns | list[str] | Yes | Columns that together must be unique |
| Level | Table |
dbt_utils.recency
Ensures data is fresh — the most recent field value is within interval dateparts of now.
table_tests:
- dbt_utils.recency:
datepart: hour
field: event_time
interval: 24
| Parameter | Type | Required | Description |
|---|---|---|---|
datepart | str | Yes | Time unit (hour, day, minute) |
field | str | Yes | Timestamp column to check |
interval | int | Yes | Maximum allowed age |
| Level | Table |
dbt_utils.expression_is_true
Validates that a SQL expression evaluates to true for all rows.
table_tests:
- dbt_utils.expression_is_true:
expression: "event_time >= received_time"
# Price sanity
table_tests:
- dbt_utils.expression_is_true:
expression: "price > 0"
| Parameter | Type | Required | Description |
|---|---|---|---|
expression | str | Yes | SQL expression that must be true |
| Level | Column or Table |
dbt_utils.relationships
Ensures referential integrity — values in a column exist in a reference table. Critical for symbol validation against a master symbol list.
table_tests:
- dbt_utils.relationships:
field: symbol
to: ref('symbol_reference')
to_field: symbol
| Parameter | Type | Required | Description |
|---|---|---|---|
field | str | Yes | Column to validate |
to | ref | Yes | Reference table |
to_field | str | Yes | Reference column |
| Level | Table |
dbt_utils.at_least_one
Ensures the table contains at least one row — catches empty partitions or missing data days.
table_tests:
- dbt_utils.at_least_one
| Parameter | None | | Level | Table |
dbt_utils.sequential_values
Detects gaps in sequential columns. Useful for trade sequence IDs, bar indices.
table_tests:
- dbt_utils.sequential_values:
field: trade_sequence
group_by: [symbol]
max_gap: 5
| Parameter | Type | Required | Description |
|---|---|---|---|
field | str | Yes | Sequential column to check |
group_by | list[str] | No | Partition columns |
max_gap | int | No | Maximum allowed gap |
| Level | Table |
dbt_utils.mutually_exclusive_ranges
Ensures no overlapping ranges for validity periods — prevents duplicate symbol mappings or bar windows.
table_tests:
- dbt_utils.mutually_exclusive_ranges:
lower_bound_column: valid_from
upper_bound_column: valid_to
group_by: [symbol]
| Parameter | Type | Required | Description |
|---|---|---|---|
lower_bound_column | str | Yes | Start of range |
upper_bound_column | str | Yes | End of range |
group_by | list[str] | No | Partition columns |
| Level | Table |
dbt_utils.cardinality_equality
Ensures row count consistency between source and CDM staging.
table_tests:
- dbt_utils.cardinality_equality:
field: cdm_id
to: ref('source_trades')
to_field: trade_id
| Parameter | Type | Required | Description |
|---|---|---|---|
field | str | Yes | Column in this table |
to | ref | Yes | Comparison table |
to_field | str | Yes | Comparison column |
| Level | Table |
dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B
Ensures ask > bid, event_time >= received_time, or any ordered column pair.
table_tests:
- dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
column_A: ask_price
column_B: bid_price
| Parameter | Type | Required | Description |
|---|---|---|---|
column_A | str | Yes | Column that should be greater |
column_B | str | Yes | Column that should be smaller |
row_condition | str | No | Conditional filter |
| Level | Table |
Recommended Test Profiles
High-Frequency Trade Data
tests:
column_tests:
symbol:
- not_null
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT]
price:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
size:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
event_time:
- not_null
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[0-9]{13}$"
trade_sequence:
- not_null
table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, trade_time, trade_id]
- dbt_utils.recency: { datepart: hour, field: event_time, interval: 24 }
- dbt_utils.relationships: { field: symbol, to: ref('symbol_reference'), to_field: symbol }
- dbt_utils.at_least_one
- dbt_utils.sequential_values: { field: trade_sequence, group_by: [symbol] }
Order Book Snapshots
tests:
column_tests:
symbol:
- not_null
best_bid_price:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
best_ask_price:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
snapshot_time:
- not_null
table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, snapshot_time]
- dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
column_A: best_ask_price
column_B: best_bid_price
- dbt_utils.expression_is_true:
expression: "best_bid_size > 0 AND best_ask_size > 0"
Quality Control Level
Test severity is configured per test in the feed provider YAML. Behavior on failure:
| Level | Behavior |
|---|---|
error | Pipeline stops on test failure |
warning | Violations logged; pipeline continues |
silent | Violations recorded internally; no user-facing alert |
Tests at the ingestion layer (Pandera) block data from being written. Tests at the warehouse layer (dbt) run post-load and can be configured per test severity.
Four-Layer Validation Architecture
Layer 1: Schema (Pydantic)
↓
Layer 2: Ingestion (Pandera)
↓
Layer 3: Warehouse (dbt tests) ← this reference
↓
Layer 4: Monitoring (Elementary)
Layer 1: Schema Validation
- Pydantic models validate all YAML configs at load time
- Naming conventions enforced: symbols, table names, field names
- Type safety —
DataTypeenum,AttributeConstraintranges
Layer 2: Ingestion-Time (Pandera)
- DataFrame-level schema conformance at read time
- Column type enforcement with coercion policies
- Custom check functions per column defined in code
Layer 3: Warehouse (dbt tests)
- All YAML-declared tests are compiled into dbt test cases
- Generated by the dbt generator as part of the dbt project
- Tests run via
dbt testagainst the actual warehouse data
Layer 4: Elementary (Observability)
- Auto-detects anomalies in volume, null rates, and freshness
- Alerts on schema changes (new/removed columns, type changes)
- Central dashboard for test results across runs
- Column-level lineage visualization (sources → staging → CDM)
- Slack/Email notifications on failures
Test Generation Pipeline
YAML Test Declarations
↓
dbt Generator (SourcesGenerator / ProcessingGenerator)
↓
dbt .yml test configurations
↓
dbt test (runs against warehouse)
↓
elementary run-operation (collects results)
↓
elementary report (dashboard + alerts)