Data Quality Tests

Data quality is enforced at four layers: Pydantic schema validation at load time, Pandera DataFrame checks during ingestion, dbt tests at the warehouse level, and Elementary for continuous observability. This reference focuses on the YAML-declared dbt test layer — the primary interface users configure.

Test Structure

Tests are declared in DataTypeSchema using the TestSuite model — column_tests for per-column checks and table_tests for cross-column/table-wide checks.

data_types:
  cdm_trades:
    name: trades
    stream: "trade"
    unique_key: [symbol, trade_time, trade_id]

    # Table-level tests
    tests:
      table_tests:
        - dbt_utils.unique_combination_of_columns:
            combination_of_columns: [symbol, trade_time, trade_id]
        - dbt_utils.recency:
            datepart: hour
            field: event_time
            interval: 24

    # Column-level tests
    schema:
      symbol:
        dtype: string
        tests:
          - not_null
          - dbt_utils.accepted_values:
              values: [BTCUSDT, ETHUSDT, BNBUSDT, ADAUSDT, XRPUSDT, SOLUSDT]
      price:
        dtype: decimal
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: false
      event_time:
        dtype: timestamp
        tests:
          - not_null

    field_mappings:
      - target: trade_id
        source: t
        transformation: "cast_safe(t, bigint)"
      - target: price
        source: p
        transformation: "cast_safe(p, decimal)"
      - target: event_time
        source: "E"
        transformation: "timestamp_ms(cast_safe(E, bigint))"
        is_time_filter_field: true

Column-Level Tests

`not_null`

Ensures a column contains no NULL values.

tests:
  - not_null

`dbt_utils.accepted_values`

Ensures a column only contains values from a predefined list.

tests:
  - dbt_utils.accepted_values:
      values: [BTCUSDT, ETHUSDT, BNBUSDT]

Parameter	Type	Required	Description
`values`	`list`	Yes	Allowed values

| Level | Column |

`dbt_utils.accepted_range`

Ensures numeric column values fall within a specified range.

tests:
  - dbt_utils.accepted_range:
      min_value: 0
      inclusive: false

Parameter	Type	Required	Default	Description
`min_value`	`float`	No	—	Minimum allowed value
`max_value`	`float \| str`	No	—	Maximum value (supports dbt `{{ var() }}`)
`inclusive`	`bool`	No	`true`	Whether endpoints are inclusive

| Level | Column |

`dbt_utils.not_accepted_values`

Inverse of accepted_values — ensures a column does NOT contain specific values. Useful for excluding known bad data.

tests:
  - dbt_utils.not_accepted_values:
      values: [0, -1, -999]

Parameter	Type	Required	Description
`values`	`list`	Yes	Forbidden values

| Level | Column |

`dbt_expectations.expect_column_values_to_match_regex`

Validates column values against a regex pattern. Useful for symbol formats, trade ID patterns, etc.

tests:
  - dbt_expectations.expect_column_values_to_match_regex:
      regex: "^[A-Z0-9]{5,12}$"

Parameter	Type	Required	Description
`regex`	`str`	Yes	Regex pattern to match
`row_condition`	`str`	No	Conditional filter on rows

| Level | Column |

`dbt_expectations.expect_column_values_to_be_increasing`

Ensures column values are monotonically increasing — critical for timestamp sequences in time-series market data.

tests:
  - dbt_expectations.expect_column_values_to_be_increasing:
      group_by: [symbol]
      strictly: false

Parameter	Type	Required	Description
`group_by`	`list[str]`	No	Group columns for partitioning
`strictly`	`bool`	No	Enforce strict inequality

| Level | Column |

Table-Level Tests

`dbt_utils.unique_combination_of_columns`

Ensures a set of columns forms a unique key across the entire table.

table_tests:
  - dbt_utils.unique_combination_of_columns:
      combination_of_columns: [symbol, trade_time, trade_id]

Parameter	Type	Required	Description
`combination_of_columns`	`list[str]`	Yes	Columns that together must be unique

| Level | Table |

`dbt_utils.recency`

Ensures data is fresh — the most recent field value is within interval dateparts of now.

table_tests:
  - dbt_utils.recency:
      datepart: hour
      field: event_time
      interval: 24

Parameter	Type	Required	Description
`datepart`	`str`	Yes	Time unit (`hour`, `day`, `minute`)
`field`	`str`	Yes	Timestamp column to check
`interval`	`int`	Yes	Maximum allowed age

| Level | Table |

`dbt_utils.expression_is_true`

Validates that a SQL expression evaluates to true for all rows.

table_tests:
  - dbt_utils.expression_is_true:
      expression: "event_time >= received_time"

# Price sanity
table_tests:
  - dbt_utils.expression_is_true:
      expression: "price > 0"

Parameter	Type	Required	Description
`expression`	`str`	Yes	SQL expression that must be true

| Level | Column or Table |

`dbt_utils.relationships`

Ensures referential integrity — values in a column exist in a reference table. Critical for symbol validation against a master symbol list.

table_tests:
  - dbt_utils.relationships:
      field: symbol
      to: ref('symbol_reference')
      to_field: symbol

Parameter	Type	Required	Description
`field`	`str`	Yes	Column to validate
`to`	`ref`	Yes	Reference table
`to_field`	`str`	Yes	Reference column

| Level | Table |

`dbt_utils.at_least_one`

Ensures the table contains at least one row — catches empty partitions or missing data days.

table_tests:
  - dbt_utils.at_least_one

`dbt_utils.sequential_values`

Detects gaps in sequential columns. Useful for trade sequence IDs, bar indices.

table_tests:
  - dbt_utils.sequential_values:
      field: trade_sequence
      group_by: [symbol]
      max_gap: 5

Parameter	Type	Required	Description
`field`	`str`	Yes	Sequential column to check
`group_by`	`list[str]`	No	Partition columns
`max_gap`	`int`	No	Maximum allowed gap

| Level | Table |

`dbt_utils.mutually_exclusive_ranges`

Ensures no overlapping ranges for validity periods — prevents duplicate symbol mappings or bar windows.

table_tests:
  - dbt_utils.mutually_exclusive_ranges:
      lower_bound_column: valid_from
      upper_bound_column: valid_to
      group_by: [symbol]

Parameter	Type	Required	Description
`lower_bound_column`	`str`	Yes	Start of range
`upper_bound_column`	`str`	Yes	End of range
`group_by`	`list[str]`	No	Partition columns

| Level | Table |

`dbt_utils.cardinality_equality`

Ensures row count consistency between source and CDM staging.

table_tests:
  - dbt_utils.cardinality_equality:
      field: cdm_id
      to: ref('source_trades')
      to_field: trade_id

Parameter	Type	Required	Description
`field`	`str`	Yes	Column in this table
`to`	`ref`	Yes	Comparison table
`to_field`	`str`	Yes	Comparison column

| Level | Table |

`dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B`

Ensures ask > bid, event_time >= received_time, or any ordered column pair.

table_tests:
  - dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
      column_A: ask_price
      column_B: bid_price

Parameter	Type	Required	Description
`column_A`	`str`	Yes	Column that should be greater
`column_B`	`str`	Yes	Column that should be smaller
`row_condition`	`str`	No	Conditional filter

| Level | Table |

Recommended Test Profiles

High-Frequency Trade Data

tests:
  column_tests:
    symbol:
      - not_null
      - dbt_utils.accepted_values:
          values: [BTCUSDT, ETHUSDT]
    price:
      - not_null
      - dbt_utils.accepted_range: { min_value: 0, inclusive: false }
    size:
      - not_null
      - dbt_utils.accepted_range: { min_value: 0, inclusive: false }
    event_time:
      - not_null
      - dbt_expectations.expect_column_values_to_match_regex:
          regex: "^[0-9]{13}$"
    trade_sequence:
      - not_null
  table_tests:
    - dbt_utils.unique_combination_of_columns:
        combination_of_columns: [symbol, trade_time, trade_id]
    - dbt_utils.recency: { datepart: hour, field: event_time, interval: 24 }
    - dbt_utils.relationships: { field: symbol, to: ref('symbol_reference'), to_field: symbol }
    - dbt_utils.at_least_one
    - dbt_utils.sequential_values: { field: trade_sequence, group_by: [symbol] }

Order Book Snapshots

tests:
  column_tests:
    symbol:
      - not_null
    best_bid_price:
      - not_null
      - dbt_utils.accepted_range: { min_value: 0, inclusive: false }
    best_ask_price:
      - not_null
      - dbt_utils.accepted_range: { min_value: 0, inclusive: false }
    snapshot_time:
      - not_null
  table_tests:
    - dbt_utils.unique_combination_of_columns:
        combination_of_columns: [symbol, snapshot_time]
    - dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
        column_A: best_ask_price
        column_B: best_bid_price
    - dbt_utils.expression_is_true:
        expression: "best_bid_size > 0 AND best_ask_size > 0"

Quality Control Level

Test severity is configured per test in the feed provider YAML. Behavior on failure:

Level	Behavior
`error`	Pipeline stops on test failure
`warning`	Violations logged; pipeline continues
`silent`	Violations recorded internally; no user-facing alert

Tests at the ingestion layer (Pandera) block data from being written. Tests at the warehouse layer (dbt) run post-load and can be configured per test severity.

Four-Layer Validation Architecture

Layer 1: Schema (Pydantic)
   ↓
Layer 2: Ingestion (Pandera)
   ↓
Layer 3: Warehouse (dbt tests)        ← this reference
   ↓
Layer 4: Monitoring (Elementary)

Layer 1: Schema Validation

Pydantic models validate all YAML configs at load time
Naming conventions enforced: symbols, table names, field names
Type safety — DataType enum, AttributeConstraint ranges

Layer 2: Ingestion-Time (Pandera)

DataFrame-level schema conformance at read time
Column type enforcement with coercion policies
Custom check functions per column defined in code

Layer 3: Warehouse (dbt tests)

All YAML-declared tests are compiled into dbt test cases
Generated by the dbt generator as part of the dbt project
Tests run via dbt test against the actual warehouse data

Layer 4: Elementary (Observability)

Auto-detects anomalies in volume, null rates, and freshness
Alerts on schema changes (new/removed columns, type changes)
Central dashboard for test results across runs
Column-level lineage visualization (sources → staging → CDM)
Slack/Email notifications on failures

Test Generation Pipeline

YAML Test Declarations
    ↓
dbt Generator (SourcesGenerator / ProcessingGenerator)
    ↓
dbt .yml test configurations
    ↓
dbt test (runs against warehouse)
    ↓
elementary run-operation (collects results)
    ↓
elementary report (dashboard + alerts)

Test Structure​

Column-Level Tests​

not_null​

dbt_utils.accepted_values​

dbt_utils.accepted_range​

dbt_utils.not_accepted_values​

dbt_expectations.expect_column_values_to_match_regex​

dbt_expectations.expect_column_values_to_be_increasing​

Table-Level Tests​

dbt_utils.unique_combination_of_columns​

dbt_utils.recency​

dbt_utils.expression_is_true​

dbt_utils.relationships​

dbt_utils.at_least_one​

dbt_utils.sequential_values​

dbt_utils.mutually_exclusive_ranges​

dbt_utils.cardinality_equality​

dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B​

Recommended Test Profiles​

High-Frequency Trade Data​

Order Book Snapshots​

Quality Control Level​

Four-Layer Validation Architecture​

Layer 1: Schema Validation​

Layer 2: Ingestion-Time (Pandera)​

Layer 3: Warehouse (dbt tests)​

Layer 4: Elementary (Observability)​

Test Generation Pipeline​

Test Structure

Column-Level Tests

`not_null`

`dbt_utils.accepted_values`

`dbt_utils.accepted_range`

`dbt_utils.not_accepted_values`

`dbt_expectations.expect_column_values_to_match_regex`

`dbt_expectations.expect_column_values_to_be_increasing`

Table-Level Tests

`dbt_utils.unique_combination_of_columns`

`dbt_utils.recency`

`dbt_utils.expression_is_true`

`dbt_utils.relationships`

`dbt_utils.at_least_one`

`dbt_utils.sequential_values`

`dbt_utils.mutually_exclusive_ranges`

`dbt_utils.cardinality_equality`

`dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B`

Recommended Test Profiles

High-Frequency Trade Data

Order Book Snapshots

Quality Control Level

Four-Layer Validation Architecture

Layer 1: Schema Validation

Layer 2: Ingestion-Time (Pandera)

Layer 3: Warehouse (dbt tests)

Layer 4: Elementary (Observability)

Test Generation Pipeline