Skip to main content

Overview

MarketState sits between DataInfra and FeatureDAG in the QuantFlow pipeline. It transforms normalized CDM data into structured, analysis-ready market representations — bars, order book states, and labeled datasets. MarketState does NOT compute features; it constructs the canonical market states that FeatureDAG consumes.


Core Responsibility

Construct canonical market representations — bars, order book snapshots, and labeled events — from validated CDM data. Every downstream feature and model is built on MarketState output.


Position in QuantFlow

DataInfra → MarketState → FeatureDAG → Research / Trading

DataInfra delivers clean, validated CDM data. MarketState structures it into bars, order book states, and labels. FeatureDAG computes features on top of those structured states.


Architecture

MarketState has two engines:

EngineRoleRuntime
State EngineConstructs bars, enriched trades, and LOB snapshots from CDM dataBatch (Numba kernel) + Streaming (DolphinDB)
Label EngineGenerates supervised learning labels from bar dataBatch (Polars); labels persisted to warehouse
Raw CDM Data (trades, LOB)


State Engine ──► cdm_trade_enriched (28 columns — trades + L1 context)
──► cdm_lob_l2 (full-depth snapshots with derived metrics)
──► cdm_{type}_bars (9 bar types, each a separate table)


Label Engine ──► cdm_labels (6 labeling methods)


Structured Datasets → FeatureDAG

Components

State Engine

Single-pass Numba-accelerated kernel processes interleaved trades and LOB updates in one fused loop. Produces enriched trades with L1 analytics, full-depth LOB snapshots, and 9 bar types across activity-sampled and information-driven families. Runs in batch (Numba kernel over historical data) or streaming (DolphinDB reactive state engine).

State Engine overview · Bar types catalog · LOB book, trade enrichment & snapshots

Label Engine

Six registered labeling methods — triple barrier, fixed horizon return, trend scanning, quantile binning, time-series sign, and meta-labeling — dispatched via a decorator-based registry. Labels are a batch-only concern, persisted to cdm_labels for model training.

Label Engine overview · Labeling methods in detail


Configuration

State engine and label engine are configured in quantflow_project.yml:

state_engine:
enabled: true
micro_batch_size: 200000
bar_groups:
- name: liquid_equities
symbols: [SPY, QQQ, NVDA]
bars:
- type: time
interval_minutes: 1
- type: dollar
threshold: 100000.0
- type: tick
count: 200
- type: imbalance
k: 100.0
- type: run
window: 10
- type: volatility
threshold: 0.0001
- type: dollar_imbalance
k: 100000.0
- type: cusum
threshold: 0.5
snapshots:
period_seconds: 60.0
depth_levels: 10
trade_signing:
method: quote

label_engine:
enabled: true
historical_label_engine: polars
labels:
- name: triple_barrier_20_2pct
type: triple_barrier
parameters:
horizon: 20
upper_barrier: 0.02
lower_barrier: 0.02
vertical_barrier: 20
inputs:
close: close
high: high
low: low
bar_types: [time_1m]
ModeWhat Runs
Batch (Research)State Engine via Ray with daily sharding, Label Engine via LabelEngine.run()
Streaming (Trading)State Engine inside DolphinDB — continuous bar formation; labels not applicable

Outputs

OutputTableDescription
Enriched tradescdm_trade_enrichedTrades with L1 context — 28 columns including direction, signed volume, micro-price
LOB snapshotscdm_lob_l2Full-depth L2 snapshots with L1, depth arrays, and derived metrics
Bar tablescdm_{type}_bars9 bar types, each a separate table with OHLCV + bar metadata
Labelscdm_labelsNormalized label table — symbol, timestamp, label_name, label_value, run_id