Skip to main content

Data Quality Tests

Data quality is enforced at four layers: Pydantic schema validation at load time, Pandera DataFrame checks during ingestion, dbt tests at the warehouse level, and Elementary for continuous observability. This reference focuses on the YAML-declared dbt test layer — the primary interface users configure.


Test Structure

Tests are declared in DataTypeSchema using the TestSuite model — column_tests for per-column checks and table_tests for cross-column/table-wide checks.

data_types:
cdm_trades:
name: trades
stream: "trade"
unique_key: [symbol, trade_time, trade_id]

# Table-level tests
tests:
table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, trade_time, trade_id]
- dbt_utils.recency:
datepart: hour
field: event_time
interval: 24

# Column-level tests
schema:
symbol:
dtype: string
tests:
- not_null
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT, BNBUSDT, ADAUSDT, XRPUSDT, SOLUSDT]
price:
dtype: decimal
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
inclusive: false
event_time:
dtype: timestamp
tests:
- not_null

field_mappings:
- target: trade_id
source: t
transformation: "cast_safe(t, bigint)"
- target: price
source: p
transformation: "cast_safe(p, decimal)"
- target: event_time
source: "E"
transformation: "timestamp_ms(cast_safe(E, bigint))"
is_time_filter_field: true

Column-Level Tests

not_null

Ensures a column contains no NULL values.

tests:
- not_null

| Parameter | None | | Level | Column |


dbt_utils.accepted_values

Ensures a column only contains values from a predefined list.

tests:
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT, BNBUSDT]
ParameterTypeRequiredDescription
valueslistYesAllowed values

| Level | Column |


dbt_utils.accepted_range

Ensures numeric column values fall within a specified range.

tests:
- dbt_utils.accepted_range:
min_value: 0
inclusive: false
ParameterTypeRequiredDefaultDescription
min_valuefloatNoMinimum allowed value
max_valuefloat | strNoMaximum value (supports dbt {{ var() }})
inclusiveboolNotrueWhether endpoints are inclusive

| Level | Column |


dbt_utils.not_accepted_values

Inverse of accepted_values — ensures a column does NOT contain specific values. Useful for excluding known bad data.

tests:
- dbt_utils.not_accepted_values:
values: [0, -1, -999]
ParameterTypeRequiredDescription
valueslistYesForbidden values

| Level | Column |


dbt_expectations.expect_column_values_to_match_regex

Validates column values against a regex pattern. Useful for symbol formats, trade ID patterns, etc.

tests:
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[A-Z0-9]{5,12}$"
ParameterTypeRequiredDescription
regexstrYesRegex pattern to match
row_conditionstrNoConditional filter on rows

| Level | Column |


dbt_expectations.expect_column_values_to_be_increasing

Ensures column values are monotonically increasing — critical for timestamp sequences in time-series market data.

tests:
- dbt_expectations.expect_column_values_to_be_increasing:
group_by: [symbol]
strictly: false
ParameterTypeRequiredDescription
group_bylist[str]NoGroup columns for partitioning
strictlyboolNoEnforce strict inequality

| Level | Column |


Table-Level Tests

dbt_utils.unique_combination_of_columns

Ensures a set of columns forms a unique key across the entire table.

table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, trade_time, trade_id]
ParameterTypeRequiredDescription
combination_of_columnslist[str]YesColumns that together must be unique

| Level | Table |


dbt_utils.recency

Ensures data is fresh — the most recent field value is within interval dateparts of now.

table_tests:
- dbt_utils.recency:
datepart: hour
field: event_time
interval: 24
ParameterTypeRequiredDescription
datepartstrYesTime unit (hour, day, minute)
fieldstrYesTimestamp column to check
intervalintYesMaximum allowed age

| Level | Table |


dbt_utils.expression_is_true

Validates that a SQL expression evaluates to true for all rows.

table_tests:
- dbt_utils.expression_is_true:
expression: "event_time >= received_time"
# Price sanity
table_tests:
- dbt_utils.expression_is_true:
expression: "price > 0"
ParameterTypeRequiredDescription
expressionstrYesSQL expression that must be true

| Level | Column or Table |


dbt_utils.relationships

Ensures referential integrity — values in a column exist in a reference table. Critical for symbol validation against a master symbol list.

table_tests:
- dbt_utils.relationships:
field: symbol
to: ref('symbol_reference')
to_field: symbol
ParameterTypeRequiredDescription
fieldstrYesColumn to validate
torefYesReference table
to_fieldstrYesReference column

| Level | Table |


dbt_utils.at_least_one

Ensures the table contains at least one row — catches empty partitions or missing data days.

table_tests:
- dbt_utils.at_least_one

| Parameter | None | | Level | Table |


dbt_utils.sequential_values

Detects gaps in sequential columns. Useful for trade sequence IDs, bar indices.

table_tests:
- dbt_utils.sequential_values:
field: trade_sequence
group_by: [symbol]
max_gap: 5
ParameterTypeRequiredDescription
fieldstrYesSequential column to check
group_bylist[str]NoPartition columns
max_gapintNoMaximum allowed gap

| Level | Table |


dbt_utils.mutually_exclusive_ranges

Ensures no overlapping ranges for validity periods — prevents duplicate symbol mappings or bar windows.

table_tests:
- dbt_utils.mutually_exclusive_ranges:
lower_bound_column: valid_from
upper_bound_column: valid_to
group_by: [symbol]
ParameterTypeRequiredDescription
lower_bound_columnstrYesStart of range
upper_bound_columnstrYesEnd of range
group_bylist[str]NoPartition columns

| Level | Table |


dbt_utils.cardinality_equality

Ensures row count consistency between source and CDM staging.

table_tests:
- dbt_utils.cardinality_equality:
field: cdm_id
to: ref('source_trades')
to_field: trade_id
ParameterTypeRequiredDescription
fieldstrYesColumn in this table
torefYesComparison table
to_fieldstrYesComparison column

| Level | Table |


dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B

Ensures ask > bid, event_time >= received_time, or any ordered column pair.

table_tests:
- dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
column_A: ask_price
column_B: bid_price
ParameterTypeRequiredDescription
column_AstrYesColumn that should be greater
column_BstrYesColumn that should be smaller
row_conditionstrNoConditional filter

| Level | Table |


High-Frequency Trade Data

tests:
column_tests:
symbol:
- not_null
- dbt_utils.accepted_values:
values: [BTCUSDT, ETHUSDT]
price:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
size:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
event_time:
- not_null
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[0-9]{13}$"
trade_sequence:
- not_null
table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, trade_time, trade_id]
- dbt_utils.recency: { datepart: hour, field: event_time, interval: 24 }
- dbt_utils.relationships: { field: symbol, to: ref('symbol_reference'), to_field: symbol }
- dbt_utils.at_least_one
- dbt_utils.sequential_values: { field: trade_sequence, group_by: [symbol] }

Order Book Snapshots

tests:
column_tests:
symbol:
- not_null
best_bid_price:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
best_ask_price:
- not_null
- dbt_utils.accepted_range: { min_value: 0, inclusive: false }
snapshot_time:
- not_null
table_tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [symbol, snapshot_time]
- dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
column_A: best_ask_price
column_B: best_bid_price
- dbt_utils.expression_is_true:
expression: "best_bid_size > 0 AND best_ask_size > 0"

Quality Control Level

Test severity is configured per test in the feed provider YAML. Behavior on failure:

LevelBehavior
errorPipeline stops on test failure
warningViolations logged; pipeline continues
silentViolations recorded internally; no user-facing alert

Tests at the ingestion layer (Pandera) block data from being written. Tests at the warehouse layer (dbt) run post-load and can be configured per test severity.


Four-Layer Validation Architecture

Layer 1: Schema (Pydantic)

Layer 2: Ingestion (Pandera)

Layer 3: Warehouse (dbt tests) ← this reference

Layer 4: Monitoring (Elementary)

Layer 1: Schema Validation

  • Pydantic models validate all YAML configs at load time
  • Naming conventions enforced: symbols, table names, field names
  • Type safetyDataType enum, AttributeConstraint ranges

Layer 2: Ingestion-Time (Pandera)

  • DataFrame-level schema conformance at read time
  • Column type enforcement with coercion policies
  • Custom check functions per column defined in code

Layer 3: Warehouse (dbt tests)

  • All YAML-declared tests are compiled into dbt test cases
  • Generated by the dbt generator as part of the dbt project
  • Tests run via dbt test against the actual warehouse data

Layer 4: Elementary (Observability)

  • Auto-detects anomalies in volume, null rates, and freshness
  • Alerts on schema changes (new/removed columns, type changes)
  • Central dashboard for test results across runs
  • Column-level lineage visualization (sources → staging → CDM)
  • Slack/Email notifications on failures

Test Generation Pipeline

YAML Test Declarations

dbt Generator (SourcesGenerator / ProcessingGenerator)

dbt .yml test configurations

dbt test (runs against warehouse)

elementary run-operation (collects results)

elementary report (dashboard + alerts)