System Overview

QuantFlow is a quantitative financial intelligence infrastructure platform that transforms raw market data into microstructure-aware features and trading signals. A unified declarative pipeline — from ingestion through CDM normalization, market state reconstruction, feature compilation, and execution — runs consistently across batch (research) and streaming (trading) from the same YAML definitions.

Architecture

DataInfra → MarketState → FeatureDAG → Execution Layer
                  │
            ┌─────┴─────┐
            ▼           ▼
      State Engine  Label Engine

Component	Role	Key Subsystems
DataInfra	Ingest, normalize, validate	CDM, feed providers, dbt generator, engine adapter, data quality
MarketState	Structure market representations	State Engine (bars, snapshots), Label Engine (triple barrier, trend scanning)
FeatureDAG	Compile formulas into executable graphs	AST compiler, IR DAG, lowering, execution
Execution Layer	Run feature pipelines in target environments	Batch, Streaming, mode polymorphism

DataInfra

The engine-agnostic data foundation. DataInfra ingests raw market data from exchange feeds and vendors, normalizes it into the Common Data Model (CDM), and auto-generates production-ready dbt pipelines with embedded quality controls.

Ingestion — Declarative feed provider YAML: connectors, field mappings, QFSQL transforms, quality tests. No custom ingestion code.
CDM — Unified schema contract consumed by MarketState and FeatureDAG. Every downstream stage operates on consistent, validated inputs.
dbt Generator — Six sub-generators produce staging models, CDM union models, engine-specific SQL macros, and connection profiles. Zero manual SQL.
Engine Layer — DBEngine interface backed by an engine registry. DuckDB for local dev, OpenLakehouse (S3 + Iceberg + Trino) for production, DolphinDB for real-time streaming — same API, same Arrow-based data flow.
Data Quality — Four-layer validation: Pydantic schema, Pandera ingestion, dbt warehouse tests, Elementary monitoring with alerting.

→ DataInfra overview

MarketState

Sits between DataInfra and FeatureDAG. Transforms normalized CDM data into structured, analysis-ready market representations — bars, order book states, and labeled datasets. MarketState does NOT compute features; it constructs the canonical market states that FeatureDAG consumes.

State Engine constructs bars (time, tick, volume, dollar, imbalance, run, volatility, dollar_imbalance, CUSUM), enriched trades with L1 analytics, and LOB snapshots. Runs in batch via single-pass Numba-accelerated replay, or streaming via DolphinDB reactive state engine.

Label Engine generates supervised learning labels across five paradigms: triple barrier, fixed horizon return, trend scanning, quantile binning, and time-series sign. Registry-based dispatch with backend-agnostic I/O. Labels are batch-only — used for model training.

→ MarketState overview

FeatureDAG

A compiler for quantitative finance features. Formula strings in YAML compile through a 4-stage pipeline — AST compiler, IR DAG, lowering, execution — producing Polars (batch) and DolphinDB (streaming) results from the same YAML.

Layer	Role
AST Compiler	Parses formula strings via Python `ast`, dispatches ~40 built-in functions to IR nodes
IR DAG	Frozen, validated DAG (rustworkx) with 50+ compile-time schema contracts
Lowering	Translates IR into `pl.Expr` (Polars) or DolphinDB DSL strings via decorator-based registry
Execution	Runs lowered expressions — Polars lazy pipeline (batch) or DolphinDB stream engines (streaming)

→ FeatureDAG overview · Formula Language Reference

Execution Layer

How feature DAGs become running computations. Two execution paths consume the same FeatureInstance objects — the divergence happens only at expression generation. No duplicate implementations.

Aspect	Batch (Polars)	Streaming (DolphinDB)
Runtime	Python process	DolphinDB cluster
Trigger	Explicit `run()`	Continuous data arrival
Lifecycle	Run-once per symbol	Deploy-once, run continuously
Primary use	Research, batch scoring	Live trading, real-time signals

Mode polymorphism — both backends support three computation modes configured per feature:

tick — per-event computation for HFT signals and execution algorithms
bar — per-boundary computation for bar-native strategies
tick_to_bar — tick-level features projected onto a bar grid via temporal aggregation (mean, std, last, sum)

The lowering registry dispatches (Primitive, op) pairs to backend-specific functions. Adding a new engine means implementing the backend protocol and registering lowering functions — no IR changes needed.

→ Execution Layer overview

Cross-Cutting Properties

Single definition, dual execution — Same YAML compiles to both batch and streaming. No duplicate implementations, no research/production drift.
Deterministic and reproducible — Hash-based feature versioning. No hidden state, no implicit transforms. Full audit trail.
Extensible at every layer — Custom feed providers, feature types, labeling methods, and execution backends. All via declarative config or registry-based plugin patterns.
Data quality built in — Schema validation, anomaly detection, and freshness monitoring are first-class concerns, not afterthoughts.

Architecture​

DataInfra​

MarketState​

FeatureDAG​

Execution Layer​

Cross-Cutting Properties​