Skip to main content

Common Data Model (CDM)

What is the CDM

The Common Data Model is the universal data contract that every QuantFlow pipeline stage reads from and writes to. It standardizes market data — trades, order books, bars, features, labels — into a single set of table schemas with consistent column names, types, partitioning, and uniqueness constraints.

No matter which exchange the data came from, which vendor delivered it, or which engine stores it (DuckDB, BigQuery, Snowflake, OpenLakehouse, DolphinDB), the CDM surface looks the same. A feature written against cdm_trade_enriched.trade_price works identically whether the underlying data came from Databento equities, Binance futures, or Bybit perpetuals.

Why a CDM

Market data arrives fragmented. Each feed provider has its own schema, field names, timestamp conventions, and data types. Without a CDM, every downstream system — state reconstruction, feature computation, label generation, model training — must handle provider-specific quirks individually. This produces:

  • Duplicate transformation logic — the same renaming, casting, and enrichment repeated across research notebooks, production pipelines, and backtesting frameworks
  • Silent schema drift — a provider changes a field name or type; the pipeline breaks, or worse, produces subtly wrong results
  • Engine lock-in — transformations written in engine-specific SQL make switching backends a rewrite, not a config change

The CDM solves this by defining schemas once in YAML. Feed providers declare field mappings from their native format to CDM columns using engine-agnostic QFSQL expressions. The dbt generator translates those mappings into engine-native SQL. The result: one schema definition yields correct DDL and transformations on every engine.

High-Level Structure

CDM tables fall into four categories, each produced at a different pipeline stage:

Source Data (exchange APIs, vendor files)

▼ [Ingestion + dbt]
┌──────────────────────────────┐
│ Base Tables │
│ Raw data, normalized │
│ cdm_trades, cdm_lob_l1, │
│ cdm_lob_l2 │
└──────────────┬───────────────┘

▼ [State Engine]
┌──────────────────────────────┐
│ Enriched Tables │
│ Trades + L1 context │
│ cdm_trade_enriched │
└──────────────┬───────────────┘

▼ [State Engine]
┌──────────────────────────────┐
│ Bar Tables │
│ 9 bar types, common OHLCV │
│ time, tick, volume, dollar, │
│ imbalance, run, volatility, │
│ dollar_imbalance, cusum │
└──────────────┬───────────────┘

┌─────────┴─────────┐
▼ ▼ [Label / Feature Engines]
┌──────────────┐ ┌──────────────┐
│ Output Tables │ │ Output Tables │
│ labels │ │ ft_features │
└──────────────┘ └──────────────┘
CategoryProduced ByTables
BaseIngestion + dbtcdm_trades, cdm_lob_l1, cdm_lob_l2
EnrichedState Enginecdm_trade_enriched
BarState Engine9 bar tables — one per bar type
OutputLabel Engine, Feature Enginelabels, ft_features

Each table definition lives as a YAML file in the quantflow metadata package. At project load time, Pydantic validates every schema. At pipeline runtime, tables are created on-demand from metadata — no manual DDL, no migration scripts. Partitioning, clustering, and uniqueness rules are engine-applied from the same definition.

Base Tables

Raw market data normalized into CDM schema by ingestion and dbt.

cdm_trades

Individual trade records from exchanges and data providers.

ColumnTypeDescription
venuestringTrading venue/exchange
symbolstringInstrument symbol
sequence_idbigintUnique sequence identifier per event
event_timetimestampTrade execution timestamp
pricedecimal(20,8)Trade execution price
sizedecimal(20,8)Trade size
is_buyer_makerbooleanWhether buyer was the maker
order_typestringAssociated order type
received_timetimestampRecord received timestamp
processed_timetimestampSystem processing timestamp
PropertyValue
Partitionevent_time (hour)
Clustervenue, symbol
Unique Keyvenue, symbol, sequence_id, event_time

cdm_lob_l1

Top-of-book (level 1) order book snapshots — best bid and ask at each update.

ColumnTypeDescription
actionstringMBP-1 action type (B=bid, A=ask, T=trade)
venuestringTrading venue/exchange
symbolstringInstrument symbol
event_timetimestampSnapshot timestamp
best_bid_pricedecimal(20,8)Best bid price
best_bid_sizedecimal(20,8)Best bid size
best_ask_pricedecimal(20,8)Best ask price
best_ask_sizedecimal(20,8)Best ask size
spreaddecimal(20,8)Bid-ask spread (calculated)
mid_pricedecimal(20,8)Mid price (calculated)
weighted_mid_pricedecimal(20,8)Size-weighted mid price
order_book_depthdecimal(20,8)Total depth at best levels
sequence_idbigintSequence identifier
processed_timetimestampSystem processing timestamp
PropertyValue
Partitionevent_time (hour)
Clustervenue, symbol
Unique Keyvenue, symbol, sequence_id, event_time, action

cdm_lob_l2

Full-depth LOB snapshots with both L1 top-of-book and L2 depth arrays, plus aggregated metrics. Pass-through from raw feed data, enriched with derived analytics by the State Engine.

ColumnTypeDescription
venuestringTrading venue
symbolstringInstrument symbol
event_timetimestampSnapshot timestamp
best_bid_pricedecimal(20,8)Best bid (L1)
best_bid_sizedecimal(20,8)Best bid size (L1)
best_ask_pricedecimal(20,8)Best ask (L1)
best_ask_sizedecimal(20,8)Best ask size (L1)
spreaddecimal(20,8)Bid-ask spread
mid_pricedecimal(20,8)Mid price
weighted_mid_pricedecimal(20,8)Size-weighted mid price
bidsarray<struct>Bid levels: {level, price, size, order_count}
asksarray<struct>Ask levels: {level, price, size, order_count}
total_bid_depthdecimal(20,8)Sum of all bid sizes
total_ask_depthdecimal(20,8)Sum of all ask sizes
depth_imbalancedecimal(20,8)(total_bid − total_ask) / (total_bid + total_ask)
vwap_biddecimal(20,8)Volume-weighted average bid price
vwap_askdecimal(20,8)Volume-weighted average ask price
sequence_idbigintSequence identifier
processed_timetimestampSystem processing timestamp
PropertyValue
Partitionevent_time (hour)
Clustervenue, symbol
Unique Keyvenue, symbol, event_time

Enriched Tables

Produced by the State Engine — L1 context joined to every trade.

cdm_trade_enriched

Every trade enriched with the L1 order book state at trade time. This is the primary source table for most feature computations.

ColumnTypeDescription
symbolstringInstrument symbol
venuestringTrading venue
event_timetimestampTrade execution timestamp
sequence_idbigintUnique trade identifier
trade_pricedecimal(20,8)Trade execution price
trade_sizedecimal(20,8)Trade size
trade_directioninteger+1 buy, -1 sell
best_bid_pricedecimal(20,8)Best bid at trade time
best_ask_pricedecimal(20,8)Best ask at trade time
best_bid_sizedecimal(20,8)Best bid size at trade time
best_ask_sizedecimal(20,8)Best ask size at trade time
mid_pricedecimal(20,8)Mid price at trade time
spreaddecimal(20,8)Bid-ask spread at trade time
effective_spreaddecimal(20,8)2 ·
micro_pricedecimal(20,8)Size-weighted micro price
book_imbalancedecimal(20,8)(bid_sz − ask_sz) / (bid_sz + ask_sz)
p_buydecimal(10,8)Probability trade was buyer-initiated
signed_volumedecimal(20,8)size · (2·p_buy − 1) — continuous signed volume
buy_volumedecimal(20,8)Volume when direction = buy
sell_volumedecimal(20,8)Volume when direction = sell
retdecimal(20,8)Trade-to-trade price return
log_returndecimal(20,8)Log return between consecutive trades
sign_methodstringDirection inference method (DSIDE, QUOTE_INFER, LEE_READY)
sign_confidencestringConfidence level (HIGH, MEDIUM, LOW, UNKNOWN)
is_inferredbooleanWhether direction was inferred
trade_countintegerTrade counter (1 per trade, for aggregation)
orderbook_update_countintegerOrder book update counter
processed_timetimestampSystem processing timestamp
PropertyValue
Partitionevent_time (hour)
Clustervenue, symbol
Unique Keyvenue, symbol, sequence_id, event_time

Bar Tables

Nine bar types produced by the State Engine. All bar tables share a common OHLCV schema with bar-specific metadata columns.

Common bar columns:

ColumnTypeDescription
venuestringTrading venue
symbolstringInstrument symbol
start_timetimestampBar start timestamp
end_timetimestampBar end timestamp
bar_typestringBar type identifier
resolutionstringBar resolution string
param_set_idstringParameter set identifier
opendecimal(20,8)Opening price
highdecimal(20,8)Highest price
lowdecimal(20,8)Lowest price
closedecimal(20,8)Closing price
volumedecimal(20,8)Total volume
dollar_volumedecimal(20,8)Total dollar volume
trade_countbigintNumber of trades
vwapdecimal(20,8)Volume-weighted average price
avg_trade_sizedecimal(20,8)Average trade size
bar_indexbigintSequential bar index
processed_timetimestampSystem processing timestamp
PropertyValue
Partitionstart_time (day)
Clustervenue, symbol, bar_type, param_set_id
Unique Keyvenue, symbol, bar_type, param_set_id, start_time

Bar Types

TableTriggerTypical Use
cdm_time_barsFixed interval (e.g. 1 min)Default clock, ML training
cdm_tick_barsN tradesFastest clock, execution features
cdm_volume_barsVolume thresholdVolume-standardized bars
cdm_dollar_barsDollar volume thresholdRobust to price changes
cdm_imbalance_barsSigned volume imbalance (k)Information-driven sampling
cdm_run_barsConsecutive same-direction trades (w)Sequential run detection
cdm_volatility_barsEWMA variance thresholdVolatility-regime sampling
cdm_dollar_imbalance_barsSigned dollar imbalance (k)Dollar-weighted information bars
cdm_cusum_barsCUSUM filter thresholdStructural break detection

Output Tables

labels

Label engine output — supervised learning targets computed from bars.

ColumnTypeDescription
venuestringTrading venue
symbolstringInstrument symbol
timestamptimestampLabel timestamp
label_namestringLabel type (e.g. triple_barrier, fixed_horizon_return)
label_valuebigintLabel value (-1, 0, +1)
horizonbigintLook-forward horizon in periods
run_idstringLabel computation run identifier
forward_returndecimal(20,8)Forward return used for label determination
barrier_hitbigintWhich barrier hit first (1=upper, -1=lower, 0=expiry)
cusum_valuedecimal(20,8)CUSUM cumulative value (trend scanning)
PropertyValue
Partitiontimestamp (day)
Clustersymbol, label_name
Unique Keysymbol, timestamp, label_name, run_id

ft_features

Feature engine output — computed features in key-value format.

ColumnTypeDescription
symbolstringInstrument symbol
feature_namestringUnique feature identifier
feature_versionstringFeature version (e.g. v1.0)
feature_idstringUnique feature instance ID
timestamptimestampFeature calculation timestamp
feature_namespacestringCategory (signal, execution, quality, regime, stability, technical)
feature_paramsarray<struct>Calculation parameters
feature_value_numericdecimalScalar feature value
feature_value_jsonjsonComplex feature values (vectors/tensors)
feature_value_arrayarray<numeric>Time series/vector feature values
confidence_scoredecimal(3,2)Feature confidence (0.0–1.0)
pipeline_namestringGenerating pipeline name
statusstringCalculation status (success, error, partial)
error_messagestringError details if failed
processed_timetimestampProcessing timestamp
PropertyValue
Partitiontimestamp (day)
Clustersymbol, feature_namespace
Unique Keyfeature_id

Table Creation & Lifecycle

CDM entities are defined declaratively in YAML and validated via Pydantic at project load time. Tables are created on-demand from metadata — no manual DDL.

from quantflow.metadata import load_metadata

meta = load_metadata(project_dir=".")
# Tables created on-demand from CDM entity definitions
# Partitioning, clustering, and unique constraints applied per-engine

The lifecycle flows:

YAML Definition → Pydantic Validation → On-Demand Table Creation → Incremental Loading
  • Ingestion + dbt produce cdm_trades and cdm_lob_l1 from raw provider data
  • State Engine produces cdm_trade_enriched, cdm_lob_l2, and all 9 bar tables
  • Label Engine produces labels from bar tables
  • Feature Engine produces ft_features from enriched trades, bars, and LOB snapshots