Overview
MarketState sits between DataInfra and FeatureDAG in the QuantFlow pipeline. It transforms normalized CDM data into structured, analysis-ready market representations — bars, order book states, and labeled datasets. MarketState does NOT compute features; it constructs the canonical market states that FeatureDAG consumes.
Core Responsibility
Construct canonical market representations — bars, order book snapshots, and labeled events — from validated CDM data. Every downstream feature and model is built on MarketState output.
Position in QuantFlow
DataInfra → MarketState → FeatureDAG → Research / Trading
DataInfra delivers clean, validated CDM data. MarketState structures it into bars, order book states, and labels. FeatureDAG computes features on top of those structured states.
Architecture
MarketState has two engines:
| Engine | Role | Runtime |
|---|---|---|
| State Engine | Constructs bars, enriched trades, and LOB snapshots from CDM data | Batch (Numba kernel) + Streaming (DolphinDB) |
| Label Engine | Generates supervised learning labels from bar data | Batch (Polars); labels persisted to warehouse |
Raw CDM Data (trades, LOB)
│
▼
State Engine ──► cdm_trade_enriched (28 columns — trades + L1 context)
──► cdm_lob_l2 (full-depth snapshots with derived metrics)
──► cdm_{type}_bars (9 bar types, each a separate table)
│
▼
Label Engine ──► cdm_labels (6 labeling methods)
│
▼
Structured Datasets → FeatureDAG
Components
State Engine
Single-pass Numba-accelerated kernel processes interleaved trades and LOB updates in one fused loop. Produces enriched trades with L1 analytics, full-depth LOB snapshots, and 9 bar types across activity-sampled and information-driven families. Runs in batch (Numba kernel over historical data) or streaming (DolphinDB reactive state engine).
→ State Engine overview · Bar types catalog · LOB book, trade enrichment & snapshots
Label Engine
Six registered labeling methods — triple barrier, fixed horizon return, trend scanning, quantile binning, time-series sign, and meta-labeling — dispatched via a decorator-based registry. Labels are a batch-only concern, persisted to cdm_labels for model training.
→ Label Engine overview · Labeling methods in detail
Configuration
State engine and label engine are configured in quantflow_project.yml:
state_engine:
enabled: true
micro_batch_size: 200000
bar_groups:
- name: liquid_equities
symbols: [SPY, QQQ, NVDA]
bars:
- type: time
interval_minutes: 1
- type: dollar
threshold: 100000.0
- type: tick
count: 200
- type: imbalance
k: 100.0
- type: run
window: 10
- type: volatility
threshold: 0.0001
- type: dollar_imbalance
k: 100000.0
- type: cusum
threshold: 0.5
snapshots:
period_seconds: 60.0
depth_levels: 10
trade_signing:
method: quote
label_engine:
enabled: true
historical_label_engine: polars
labels:
- name: triple_barrier_20_2pct
type: triple_barrier
parameters:
horizon: 20
upper_barrier: 0.02
lower_barrier: 0.02
vertical_barrier: 20
inputs:
close: close
high: high
low: low
bar_types: [time_1m]
| Mode | What Runs |
|---|---|
| Batch (Research) | State Engine via Ray with daily sharding, Label Engine via LabelEngine.run() |
| Streaming (Trading) | State Engine inside DolphinDB — continuous bar formation; labels not applicable |
Outputs
| Output | Table | Description |
|---|---|---|
| Enriched trades | cdm_trade_enriched | Trades with L1 context — 28 columns including direction, signed volume, micro-price |
| LOB snapshots | cdm_lob_l2 | Full-depth L2 snapshots with L1, depth arrays, and derived metrics |
| Bar tables | cdm_{type}_bars | 9 bar types, each a separate table with OHLCV + bar metadata |
| Labels | cdm_labels | Normalized label table — symbol, timestamp, label_name, label_value, run_id |