Common Data Model (CDM)

What is the CDM

The Common Data Model is the universal data contract that every QuantFlow pipeline stage reads from and writes to. It standardizes market data — trades, order books, bars, features, labels — into a single set of table schemas with consistent column names, types, partitioning, and uniqueness constraints.

No matter which exchange the data came from, which vendor delivered it, or which engine stores it (DuckDB, BigQuery, Snowflake, OpenLakehouse, DolphinDB), the CDM surface looks the same. A feature written against cdm_trade_enriched.trade_price works identically whether the underlying data came from Databento equities, Binance futures, or Bybit perpetuals.

Why a CDM

Market data arrives fragmented. Each feed provider has its own schema, field names, timestamp conventions, and data types. Without a CDM, every downstream system — state reconstruction, feature computation, label generation, model training — must handle provider-specific quirks individually. This produces:

Duplicate transformation logic — the same renaming, casting, and enrichment repeated across research notebooks, production pipelines, and backtesting frameworks
Silent schema drift — a provider changes a field name or type; the pipeline breaks, or worse, produces subtly wrong results
Engine lock-in — transformations written in engine-specific SQL make switching backends a rewrite, not a config change

The CDM solves this by defining schemas once in YAML. Feed providers declare field mappings from their native format to CDM columns using engine-agnostic QFSQL expressions. The dbt generator translates those mappings into engine-native SQL. The result: one schema definition yields correct DDL and transformations on every engine.

High-Level Structure

CDM tables fall into four categories, each produced at a different pipeline stage:

Source Data (exchange APIs, vendor files)
    │
    ▼  [Ingestion + dbt]
┌──────────────────────────────┐
│       Base Tables             │
│  Raw data, normalized         │
│  cdm_trades, cdm_lob_l1,      │
│  cdm_lob_l2                   │
└──────────────┬───────────────┘
               │
               ▼  [State Engine]
┌──────────────────────────────┐
│     Enriched Tables           │
│  Trades + L1 context          │
│  cdm_trade_enriched           │
└──────────────┬───────────────┘
               │
               ▼  [State Engine]
┌──────────────────────────────┐
│       Bar Tables              │
│  9 bar types, common OHLCV    │
│  time, tick, volume, dollar,  │
│  imbalance, run, volatility,  │
│  dollar_imbalance, cusum      │
└──────────────┬───────────────┘
               │
     ┌─────────┴─────────┐
     ▼                   ▼  [Label / Feature Engines]
┌──────────────┐  ┌──────────────┐
│ Output Tables │  │ Output Tables │
│ labels        │  │ ft_features   │
└──────────────┘  └──────────────┘

Category	Produced By	Tables
Base	Ingestion + dbt	`cdm_trades`, `cdm_lob_l1`, `cdm_lob_l2`
Enriched	State Engine	`cdm_trade_enriched`
Bar	State Engine	9 bar tables — one per bar type
Output	Label Engine, Feature Engine	`labels`, `ft_features`

Each table definition lives as a YAML file in the quantflow metadata package. At project load time, Pydantic validates every schema. At pipeline runtime, tables are created on-demand from metadata — no manual DDL, no migration scripts. Partitioning, clustering, and uniqueness rules are engine-applied from the same definition.

Base Tables

Raw market data normalized into CDM schema by ingestion and dbt.

`cdm_trades`

Individual trade records from exchanges and data providers.

Column	Type	Description
`venue`	`string`	Trading venue/exchange
`symbol`	`string`	Instrument symbol
`sequence_id`	`bigint`	Unique sequence identifier per event
`event_time`	`timestamp`	Trade execution timestamp
`price`	`decimal(20,8)`	Trade execution price
`size`	`decimal(20,8)`	Trade size
`is_buyer_maker`	`boolean`	Whether buyer was the maker
`order_type`	`string`	Associated order type
`received_time`	`timestamp`	Record received timestamp
`processed_time`	`timestamp`	System processing timestamp

Property	Value
Partition	`event_time` (hour)
Cluster	`venue`, `symbol`
Unique Key	`venue`, `symbol`, `sequence_id`, `event_time`

`cdm_lob_l1`

Top-of-book (level 1) order book snapshots — best bid and ask at each update.

Column	Type	Description
`action`	`string`	MBP-1 action type (B=bid, A=ask, T=trade)
`venue`	`string`	Trading venue/exchange
`symbol`	`string`	Instrument symbol
`event_time`	`timestamp`	Snapshot timestamp
`best_bid_price`	`decimal(20,8)`	Best bid price
`best_bid_size`	`decimal(20,8)`	Best bid size
`best_ask_price`	`decimal(20,8)`	Best ask price
`best_ask_size`	`decimal(20,8)`	Best ask size
`spread`	`decimal(20,8)`	Bid-ask spread (calculated)
`mid_price`	`decimal(20,8)`	Mid price (calculated)
`weighted_mid_price`	`decimal(20,8)`	Size-weighted mid price
`order_book_depth`	`decimal(20,8)`	Total depth at best levels
`sequence_id`	`bigint`	Sequence identifier
`processed_time`	`timestamp`	System processing timestamp

Property	Value
Partition	`event_time` (hour)
Cluster	`venue`, `symbol`
Unique Key	`venue`, `symbol`, `sequence_id`, `event_time`, `action`

`cdm_lob_l2`

Full-depth LOB snapshots with both L1 top-of-book and L2 depth arrays, plus aggregated metrics. Pass-through from raw feed data, enriched with derived analytics by the State Engine.

Column	Type	Description
`venue`	`string`	Trading venue
`symbol`	`string`	Instrument symbol
`event_time`	`timestamp`	Snapshot timestamp
`best_bid_price`	`decimal(20,8)`	Best bid (L1)
`best_bid_size`	`decimal(20,8)`	Best bid size (L1)
`best_ask_price`	`decimal(20,8)`	Best ask (L1)
`best_ask_size`	`decimal(20,8)`	Best ask size (L1)
`spread`	`decimal(20,8)`	Bid-ask spread
`mid_price`	`decimal(20,8)`	Mid price
`weighted_mid_price`	`decimal(20,8)`	Size-weighted mid price
`bids`	`array<struct>`	Bid levels: `{level, price, size, order_count}`
`asks`	`array<struct>`	Ask levels: `{level, price, size, order_count}`
`total_bid_depth`	`decimal(20,8)`	Sum of all bid sizes
`total_ask_depth`	`decimal(20,8)`	Sum of all ask sizes
`depth_imbalance`	`decimal(20,8)`	(total_bid − total_ask) / (total_bid + total_ask)
`vwap_bid`	`decimal(20,8)`	Volume-weighted average bid price
`vwap_ask`	`decimal(20,8)`	Volume-weighted average ask price
`sequence_id`	`bigint`	Sequence identifier
`processed_time`	`timestamp`	System processing timestamp

Property	Value
Partition	`event_time` (hour)
Cluster	`venue`, `symbol`
Unique Key	`venue`, `symbol`, `event_time`

Enriched Tables

Produced by the State Engine — L1 context joined to every trade.

`cdm_trade_enriched`

Every trade enriched with the L1 order book state at trade time. This is the primary source table for most feature computations.

Column	Type	Description
`symbol`	`string`	Instrument symbol
`venue`	`string`	Trading venue
`event_time`	`timestamp`	Trade execution timestamp
`sequence_id`	`bigint`	Unique trade identifier
`trade_price`	`decimal(20,8)`	Trade execution price
`trade_size`	`decimal(20,8)`	Trade size
`trade_direction`	`integer`	+1 buy, -1 sell
`best_bid_price`	`decimal(20,8)`	Best bid at trade time
`best_ask_price`	`decimal(20,8)`	Best ask at trade time
`best_bid_size`	`decimal(20,8)`	Best bid size at trade time
`best_ask_size`	`decimal(20,8)`	Best ask size at trade time
`mid_price`	`decimal(20,8)`	Mid price at trade time
`spread`	`decimal(20,8)`	Bid-ask spread at trade time
`effective_spread`	`decimal(20,8)`	2 ·
`micro_price`	`decimal(20,8)`	Size-weighted micro price
`book_imbalance`	`decimal(20,8)`	(bid_sz − ask_sz) / (bid_sz + ask_sz)
`p_buy`	`decimal(10,8)`	Probability trade was buyer-initiated
`signed_volume`	`decimal(20,8)`	size · (2·p_buy − 1) — continuous signed volume
`buy_volume`	`decimal(20,8)`	Volume when direction = buy
`sell_volume`	`decimal(20,8)`	Volume when direction = sell
`ret`	`decimal(20,8)`	Trade-to-trade price return
`log_return`	`decimal(20,8)`	Log return between consecutive trades
`sign_method`	`string`	Direction inference method (DSIDE, QUOTE_INFER, LEE_READY)
`sign_confidence`	`string`	Confidence level (HIGH, MEDIUM, LOW, UNKNOWN)
`is_inferred`	`boolean`	Whether direction was inferred
`trade_count`	`integer`	Trade counter (1 per trade, for aggregation)
`orderbook_update_count`	`integer`	Order book update counter
`processed_time`	`timestamp`	System processing timestamp

Property	Value
Partition	`event_time` (hour)
Cluster	`venue`, `symbol`
Unique Key	`venue`, `symbol`, `sequence_id`, `event_time`

Bar Tables

Nine bar types produced by the State Engine. All bar tables share a common OHLCV schema with bar-specific metadata columns.

Common bar columns:

Column	Type	Description
`venue`	`string`	Trading venue
`symbol`	`string`	Instrument symbol
`start_time`	`timestamp`	Bar start timestamp
`end_time`	`timestamp`	Bar end timestamp
`bar_type`	`string`	Bar type identifier
`resolution`	`string`	Bar resolution string
`param_set_id`	`string`	Parameter set identifier
`open`	`decimal(20,8)`	Opening price
`high`	`decimal(20,8)`	Highest price
`low`	`decimal(20,8)`	Lowest price
`close`	`decimal(20,8)`	Closing price
`volume`	`decimal(20,8)`	Total volume
`dollar_volume`	`decimal(20,8)`	Total dollar volume
`trade_count`	`bigint`	Number of trades
`vwap`	`decimal(20,8)`	Volume-weighted average price
`avg_trade_size`	`decimal(20,8)`	Average trade size
`bar_index`	`bigint`	Sequential bar index
`processed_time`	`timestamp`	System processing timestamp

Property	Value
Partition	`start_time` (day)
Cluster	`venue`, `symbol`, `bar_type`, `param_set_id`
Unique Key	`venue`, `symbol`, `bar_type`, `param_set_id`, `start_time`

Bar Types

Table	Trigger	Typical Use
`cdm_time_bars`	Fixed interval (e.g. 1 min)	Default clock, ML training
`cdm_tick_bars`	N trades	Fastest clock, execution features
`cdm_volume_bars`	Volume threshold	Volume-standardized bars
`cdm_dollar_bars`	Dollar volume threshold	Robust to price changes
`cdm_imbalance_bars`	Signed volume imbalance (k)	Information-driven sampling
`cdm_run_bars`	Consecutive same-direction trades (w)	Sequential run detection
`cdm_volatility_bars`	EWMA variance threshold	Volatility-regime sampling
`cdm_dollar_imbalance_bars`	Signed dollar imbalance (k)	Dollar-weighted information bars
`cdm_cusum_bars`	CUSUM filter threshold	Structural break detection

Output Tables

`labels`

Label engine output — supervised learning targets computed from bars.

Column	Type	Description
`venue`	`string`	Trading venue
`symbol`	`string`	Instrument symbol
`timestamp`	`timestamp`	Label timestamp
`label_name`	`string`	Label type (e.g. `triple_barrier`, `fixed_horizon_return`)
`label_value`	`bigint`	Label value (-1, 0, +1)
`horizon`	`bigint`	Look-forward horizon in periods
`run_id`	`string`	Label computation run identifier
`forward_return`	`decimal(20,8)`	Forward return used for label determination
`barrier_hit`	`bigint`	Which barrier hit first (1=upper, -1=lower, 0=expiry)
`cusum_value`	`decimal(20,8)`	CUSUM cumulative value (trend scanning)

Property	Value
Partition	`timestamp` (day)
Cluster	`symbol`, `label_name`
Unique Key	`symbol`, `timestamp`, `label_name`, `run_id`

`ft_features`

Feature engine output — computed features in key-value format.

Column	Type	Description
`symbol`	`string`	Instrument symbol
`feature_name`	`string`	Unique feature identifier
`feature_version`	`string`	Feature version (e.g. `v1.0`)
`feature_id`	`string`	Unique feature instance ID
`timestamp`	`timestamp`	Feature calculation timestamp
`feature_namespace`	`string`	Category (signal, execution, quality, regime, stability, technical)
`feature_params`	`array<struct>`	Calculation parameters
`feature_value_numeric`	`decimal`	Scalar feature value
`feature_value_json`	`json`	Complex feature values (vectors/tensors)
`feature_value_array`	`array<numeric>`	Time series/vector feature values
`confidence_score`	`decimal(3,2)`	Feature confidence (0.0–1.0)
`pipeline_name`	`string`	Generating pipeline name
`status`	`string`	Calculation status (success, error, partial)
`error_message`	`string`	Error details if failed
`processed_time`	`timestamp`	Processing timestamp

Property	Value
Partition	`timestamp` (day)
Cluster	`symbol`, `feature_namespace`
Unique Key	`feature_id`

Table Creation & Lifecycle

CDM entities are defined declaratively in YAML and validated via Pydantic at project load time. Tables are created on-demand from metadata — no manual DDL.

from quantflow.metadata import load_metadata

meta = load_metadata(project_dir=".")
# Tables created on-demand from CDM entity definitions
# Partitioning, clustering, and unique constraints applied per-engine

The lifecycle flows:

YAML Definition  →  Pydantic Validation  →  On-Demand Table Creation  →  Incremental Loading

Ingestion + dbt produce cdm_trades and cdm_lob_l1 from raw provider data
State Engine produces cdm_trade_enriched, cdm_lob_l2, and all 9 bar tables
Label Engine produces labels from bar tables
Feature Engine produces ft_features from enriched trades, bars, and LOB snapshots

What is the CDM​

Why a CDM​

High-Level Structure​

Base Tables​

cdm_trades​

cdm_lob_l1​

cdm_lob_l2​

Enriched Tables​

cdm_trade_enriched​

Bar Tables​

Bar Types​

Output Tables​

labels​

ft_features​

Table Creation & Lifecycle​