Common Data Model (CDM)
What is the CDM
The Common Data Model is the universal data contract that every QuantFlow pipeline stage reads from and writes to. It standardizes market data — trades, order books, bars, features, labels — into a single set of table schemas with consistent column names, types, partitioning, and uniqueness constraints.
No matter which exchange the data came from, which vendor delivered it, or which engine stores it (DuckDB, BigQuery, Snowflake, OpenLakehouse, DolphinDB), the CDM surface looks the same. A feature written against cdm_trade_enriched.trade_price works identically whether the underlying data came from Databento equities, Binance futures, or Bybit perpetuals.
Why a CDM
Market data arrives fragmented. Each feed provider has its own schema, field names, timestamp conventions, and data types. Without a CDM, every downstream system — state reconstruction, feature computation, label generation, model training — must handle provider-specific quirks individually. This produces:
- Duplicate transformation logic — the same renaming, casting, and enrichment repeated across research notebooks, production pipelines, and backtesting frameworks
- Silent schema drift — a provider changes a field name or type; the pipeline breaks, or worse, produces subtly wrong results
- Engine lock-in — transformations written in engine-specific SQL make switching backends a rewrite, not a config change
The CDM solves this by defining schemas once in YAML. Feed providers declare field mappings from their native format to CDM columns using engine-agnostic QFSQL expressions. The dbt generator translates those mappings into engine-native SQL. The result: one schema definition yields correct DDL and transformations on every engine.
High-Level Structure
CDM tables fall into four categories, each produced at a different pipeline stage:
Source Data (exchange APIs, vendor files)
│
▼ [Ingestion + dbt]
┌──────────────────────────────┐
│ Base Tables │
│ Raw data, normalized │
│ cdm_trades, cdm_lob_l1, │
│ cdm_lob_l2 │
└──────────────┬───────────────┘
│
▼ [State Engine]
┌──────────────────────────────┐
│ Enriched Tables │
│ Trades + L1 context │
│ cdm_trade_enriched │
└──────────────┬───────────────┘
│
▼ [State Engine]
┌──────────────────────────────┐
│ Bar Tables │
│ 9 bar types, common OHLCV │
│ time, tick, volume, dollar, │
│ imbalance, run, volatility, │
│ dollar_imbalance, cusum │
└──────────────┬───────────────┘
│
┌─────────┴─────────┐
▼ ▼ [Label / Feature Engines]
┌──────────────┐ ┌──────────────┐
│ Output Tables │ │ Output Tables │
│ labels │ │ ft_features │
└──────────────┘ └──────────────┘
| Category | Produced By | Tables |
|---|---|---|
| Base | Ingestion + dbt | cdm_trades, cdm_lob_l1, cdm_lob_l2 |
| Enriched | State Engine | cdm_trade_enriched |
| Bar | State Engine | 9 bar tables — one per bar type |
| Output | Label Engine, Feature Engine | labels, ft_features |
Each table definition lives as a YAML file in the quantflow metadata package. At project load time, Pydantic validates every schema. At pipeline runtime, tables are created on-demand from metadata — no manual DDL, no migration scripts. Partitioning, clustering, and uniqueness rules are engine-applied from the same definition.
Base Tables
Raw market data normalized into CDM schema by ingestion and dbt.
cdm_trades
Individual trade records from exchanges and data providers.
| Column | Type | Description |
|---|---|---|
venue | string | Trading venue/exchange |
symbol | string | Instrument symbol |
sequence_id | bigint | Unique sequence identifier per event |
event_time | timestamp | Trade execution timestamp |
price | decimal(20,8) | Trade execution price |
size | decimal(20,8) | Trade size |
is_buyer_maker | boolean | Whether buyer was the maker |
order_type | string | Associated order type |
received_time | timestamp | Record received timestamp |
processed_time | timestamp | System processing timestamp |
| Property | Value |
|---|---|
| Partition | event_time (hour) |
| Cluster | venue, symbol |
| Unique Key | venue, symbol, sequence_id, event_time |
cdm_lob_l1
Top-of-book (level 1) order book snapshots — best bid and ask at each update.
| Column | Type | Description |
|---|---|---|
action | string | MBP-1 action type (B=bid, A=ask, T=trade) |
venue | string | Trading venue/exchange |
symbol | string | Instrument symbol |
event_time | timestamp | Snapshot timestamp |
best_bid_price | decimal(20,8) | Best bid price |
best_bid_size | decimal(20,8) | Best bid size |
best_ask_price | decimal(20,8) | Best ask price |
best_ask_size | decimal(20,8) | Best ask size |
spread | decimal(20,8) | Bid-ask spread (calculated) |
mid_price | decimal(20,8) | Mid price (calculated) |
weighted_mid_price | decimal(20,8) | Size-weighted mid price |
order_book_depth | decimal(20,8) | Total depth at best levels |
sequence_id | bigint | Sequence identifier |
processed_time | timestamp | System processing timestamp |
| Property | Value |
|---|---|
| Partition | event_time (hour) |
| Cluster | venue, symbol |
| Unique Key | venue, symbol, sequence_id, event_time, action |
cdm_lob_l2
Full-depth LOB snapshots with both L1 top-of-book and L2 depth arrays, plus aggregated metrics. Pass-through from raw feed data, enriched with derived analytics by the State Engine.
| Column | Type | Description |
|---|---|---|
venue | string | Trading venue |
symbol | string | Instrument symbol |
event_time | timestamp | Snapshot timestamp |
best_bid_price | decimal(20,8) | Best bid (L1) |
best_bid_size | decimal(20,8) | Best bid size (L1) |
best_ask_price | decimal(20,8) | Best ask (L1) |
best_ask_size | decimal(20,8) | Best ask size (L1) |
spread | decimal(20,8) | Bid-ask spread |
mid_price | decimal(20,8) | Mid price |
weighted_mid_price | decimal(20,8) | Size-weighted mid price |
bids | array<struct> | Bid levels: {level, price, size, order_count} |
asks | array<struct> | Ask levels: {level, price, size, order_count} |
total_bid_depth | decimal(20,8) | Sum of all bid sizes |
total_ask_depth | decimal(20,8) | Sum of all ask sizes |
depth_imbalance | decimal(20,8) | (total_bid − total_ask) / (total_bid + total_ask) |
vwap_bid | decimal(20,8) | Volume-weighted average bid price |
vwap_ask | decimal(20,8) | Volume-weighted average ask price |
sequence_id | bigint | Sequence identifier |
processed_time | timestamp | System processing timestamp |
| Property | Value |
|---|---|
| Partition | event_time (hour) |
| Cluster | venue, symbol |
| Unique Key | venue, symbol, event_time |
Enriched Tables
Produced by the State Engine — L1 context joined to every trade.
cdm_trade_enriched
Every trade enriched with the L1 order book state at trade time. This is the primary source table for most feature computations.
| Column | Type | Description |
|---|---|---|
symbol | string | Instrument symbol |
venue | string | Trading venue |
event_time | timestamp | Trade execution timestamp |
sequence_id | bigint | Unique trade identifier |
trade_price | decimal(20,8) | Trade execution price |
trade_size | decimal(20,8) | Trade size |
trade_direction | integer | +1 buy, -1 sell |
best_bid_price | decimal(20,8) | Best bid at trade time |
best_ask_price | decimal(20,8) | Best ask at trade time |
best_bid_size | decimal(20,8) | Best bid size at trade time |
best_ask_size | decimal(20,8) | Best ask size at trade time |
mid_price | decimal(20,8) | Mid price at trade time |
spread | decimal(20,8) | Bid-ask spread at trade time |
effective_spread | decimal(20,8) | 2 · |
micro_price | decimal(20,8) | Size-weighted micro price |
book_imbalance | decimal(20,8) | (bid_sz − ask_sz) / (bid_sz + ask_sz) |
p_buy | decimal(10,8) | Probability trade was buyer-initiated |
signed_volume | decimal(20,8) | size · (2·p_buy − 1) — continuous signed volume |
buy_volume | decimal(20,8) | Volume when direction = buy |
sell_volume | decimal(20,8) | Volume when direction = sell |
ret | decimal(20,8) | Trade-to-trade price return |
log_return | decimal(20,8) | Log return between consecutive trades |
sign_method | string | Direction inference method (DSIDE, QUOTE_INFER, LEE_READY) |
sign_confidence | string | Confidence level (HIGH, MEDIUM, LOW, UNKNOWN) |
is_inferred | boolean | Whether direction was inferred |
trade_count | integer | Trade counter (1 per trade, for aggregation) |
orderbook_update_count | integer | Order book update counter |
processed_time | timestamp | System processing timestamp |
| Property | Value |
|---|---|
| Partition | event_time (hour) |
| Cluster | venue, symbol |
| Unique Key | venue, symbol, sequence_id, event_time |
Bar Tables
Nine bar types produced by the State Engine. All bar tables share a common OHLCV schema with bar-specific metadata columns.
Common bar columns:
| Column | Type | Description |
|---|---|---|
venue | string | Trading venue |
symbol | string | Instrument symbol |
start_time | timestamp | Bar start timestamp |
end_time | timestamp | Bar end timestamp |
bar_type | string | Bar type identifier |
resolution | string | Bar resolution string |
param_set_id | string | Parameter set identifier |
open | decimal(20,8) | Opening price |
high | decimal(20,8) | Highest price |
low | decimal(20,8) | Lowest price |
close | decimal(20,8) | Closing price |
volume | decimal(20,8) | Total volume |
dollar_volume | decimal(20,8) | Total dollar volume |
trade_count | bigint | Number of trades |
vwap | decimal(20,8) | Volume-weighted average price |
avg_trade_size | decimal(20,8) | Average trade size |
bar_index | bigint | Sequential bar index |
processed_time | timestamp | System processing timestamp |
| Property | Value |
|---|---|
| Partition | start_time (day) |
| Cluster | venue, symbol, bar_type, param_set_id |
| Unique Key | venue, symbol, bar_type, param_set_id, start_time |
Bar Types
| Table | Trigger | Typical Use |
|---|---|---|
cdm_time_bars | Fixed interval (e.g. 1 min) | Default clock, ML training |
cdm_tick_bars | N trades | Fastest clock, execution features |
cdm_volume_bars | Volume threshold | Volume-standardized bars |
cdm_dollar_bars | Dollar volume threshold | Robust to price changes |
cdm_imbalance_bars | Signed volume imbalance (k) | Information-driven sampling |
cdm_run_bars | Consecutive same-direction trades (w) | Sequential run detection |
cdm_volatility_bars | EWMA variance threshold | Volatility-regime sampling |
cdm_dollar_imbalance_bars | Signed dollar imbalance (k) | Dollar-weighted information bars |
cdm_cusum_bars | CUSUM filter threshold | Structural break detection |
Output Tables
labels
Label engine output — supervised learning targets computed from bars.
| Column | Type | Description |
|---|---|---|
venue | string | Trading venue |
symbol | string | Instrument symbol |
timestamp | timestamp | Label timestamp |
label_name | string | Label type (e.g. triple_barrier, fixed_horizon_return) |
label_value | bigint | Label value (-1, 0, +1) |
horizon | bigint | Look-forward horizon in periods |
run_id | string | Label computation run identifier |
forward_return | decimal(20,8) | Forward return used for label determination |
barrier_hit | bigint | Which barrier hit first (1=upper, -1=lower, 0=expiry) |
cusum_value | decimal(20,8) | CUSUM cumulative value (trend scanning) |
| Property | Value |
|---|---|
| Partition | timestamp (day) |
| Cluster | symbol, label_name |
| Unique Key | symbol, timestamp, label_name, run_id |
ft_features
Feature engine output — computed features in key-value format.
| Column | Type | Description |
|---|---|---|
symbol | string | Instrument symbol |
feature_name | string | Unique feature identifier |
feature_version | string | Feature version (e.g. v1.0) |
feature_id | string | Unique feature instance ID |
timestamp | timestamp | Feature calculation timestamp |
feature_namespace | string | Category (signal, execution, quality, regime, stability, technical) |
feature_params | array<struct> | Calculation parameters |
feature_value_numeric | decimal | Scalar feature value |
feature_value_json | json | Complex feature values (vectors/tensors) |
feature_value_array | array<numeric> | Time series/vector feature values |
confidence_score | decimal(3,2) | Feature confidence (0.0–1.0) |
pipeline_name | string | Generating pipeline name |
status | string | Calculation status (success, error, partial) |
error_message | string | Error details if failed |
processed_time | timestamp | Processing timestamp |
| Property | Value |
|---|---|
| Partition | timestamp (day) |
| Cluster | symbol, feature_namespace |
| Unique Key | feature_id |
Table Creation & Lifecycle
CDM entities are defined declaratively in YAML and validated via Pydantic at project load time. Tables are created on-demand from metadata — no manual DDL.
from quantflow.metadata import load_metadata
meta = load_metadata(project_dir=".")
# Tables created on-demand from CDM entity definitions
# Partitioning, clustering, and unique constraints applied per-engine
The lifecycle flows:
YAML Definition → Pydantic Validation → On-Demand Table Creation → Incremental Loading
- Ingestion + dbt produce
cdm_tradesandcdm_lob_l1from raw provider data - State Engine produces
cdm_trade_enriched,cdm_lob_l2, and all 9 bar tables - Label Engine produces
labelsfrom bar tables - Feature Engine produces
ft_featuresfrom enriched trades, bars, and LOB snapshots