Skip to main content

State Engine

The State Engine transforms raw market events (trades + LOB updates) into enriched CDM tables in a single fused Numba kernel pass. Two execution modes: batch — Numba JIT-compiled replay over historical data; streaming — DolphinDB reactive state engine for continuous real-time processing.


Two Execution Paths

Batch (Research)Streaming (Live Trading)
EngineNumba fused kernel + Python orchestrationDolphinDB reactive state engine
Data sourceHistorical CDM tables (SQL → Arrow → NumPy)Real-time stream tables (WebSocket)
ProcessingMicro-batch replay (200k events per batch)Continuous, event by event
OutputCDM tables via StateOutputWriterShared stream tables, pub/sub to downstream

Both paths share the same configuration and produce the same output schema.


Batch Path: The Fused Kernel

Raw market data arrives as an interleaved stream of trades and LOB updates. Before features or labels can be computed, this must be transformed into structured CDM tables with derived analytics: trade signs, effective spreads, micro-price, book imbalance, bar OHLCV, and event-triggered boundaries.

The State Engine solves this with a single fused Numba kernel — one pass over the data produces all outputs simultaneously:

trades + LOB updates


[fused_kernel — @njit(cache=True)]

├──► cdm_trade_enriched (28 columns)
├──► cdm_lob_l2 (full-depth snapshots)
├──► quotes (L1 per trade)
├──► lob_l1 (L1 per LOB event, optional)
└──► bars (9 types, each a separate output)

A critical design choice: all internal prices are represented as integer ticks (price / tick_size). This avoids floating-point comparison issues in the LOB book and bar triggers. Prices are converted back to floats at DataFrame construction time.


Batch Path: Architecture

Python side — configuration, data loading via StateEngineReader, array allocation, and post-processing. I/O-bound work that benefits from Python's ergonomics.

Numba side — the hot event loop. Sparse LOB maintenance, trade enrichment, snapshot emission, and all 11 bar types. Everything runs inside fused_kernel, compiled to native code on first invocation and cached to disk.

Input: SQL → Arrow → NumPy

  1. StateEngineReader queries source databases per (venue, symbol) pair using engine-specific SQL generators (DuckDB, Trino, BigQuery, Snowflake, Databricks).
  2. Trades and LOB data are merged and sorted by (time ASC, event_type ASC) — LOB updates precede trades at identical timestamps. Trades get sequential trade_idx; LOB events get -1.
  3. Arrow RecordBatches (default 200,000 events) are converted to NumPy arrays via zero-copy where possible.

Processing

StateEngine.process(batch) allocates pre-sized output arrays and calls fused_kernel. Inside the loop:

  • LOB updateupsert_level on the sparse book (ADD/DELETE/MODIFY), eager best bid/ask
  • Trade → emit quote, enrich trade with L1 data, check snapshot triggers, update all 11 bar accumulators

Output: NumPy → DataFrame → CDM Tables

After the kernel returns, DataFrames are built with tick→float conversion, deduplication, and param_set_id generation. StateOutputWriter dispatches each output to the target engine.


Configuration: Three-Tier Resolution

TierSourcePurpose
1 — FallbacksFALLBACK_CONFIG (hardcoded)Engine always has valid config
2 — Metadatastate_engine config block in quantflow_project.ymlPrimary user-facing control
3 — OverridesDict passed to StateEngine constructorProgrammatic experiments, CLI flags

Merge priority: overrides > metadata > fallbacks. StateEngineConfig.from_metadata() handles the resolution.


Key Design Decisions

  1. Single fused Numba kernel — all output families in one pass, guaranteeing LOB state consistency per trade
  2. Integer tick representation — eliminates floating-point comparison issues, enables direct-indexed price map
  3. Sparse LOB with O(1) lookup — 20M-entry price map for direct-indexed level access
  4. LOB-before-trade ordering — reader enforces sort order so the book is up-to-date when a trade arrives
  5. Python/Numba split — maximum throughput in Numba, maximum ergonomics in Python
  6. Three-tier configuration — sensible defaults always present, YAML for control, inline overrides for experimentation

Bar types catalog · LOB book, trade enrichment, and snapshots