System Overview
QuantFlow is a quantitative financial intelligence infrastructure platform that transforms raw market data into microstructure-aware features and trading signals. A unified declarative pipeline — from ingestion through CDM normalization, market state reconstruction, feature compilation, and execution — runs consistently across batch (research) and streaming (trading) from the same YAML definitions.
Architecture
DataInfra → MarketState → FeatureDAG → Execution Layer
│
┌─────┴─────┐
▼ ▼
State Engine Label Engine
| Component | Role | Key Subsystems |
|---|---|---|
| DataInfra | Ingest, normalize, validate | CDM, feed providers, dbt generator, engine adapter, data quality |
| MarketState | Structure market representations | State Engine (bars, snapshots), Label Engine (triple barrier, trend scanning) |
| FeatureDAG | Compile formulas into executable graphs | AST compiler, IR DAG, lowering, execution |
| Execution Layer | Run feature pipelines in target environments | Batch, Streaming, mode polymorphism |
DataInfra
The engine-agnostic data foundation. DataInfra ingests raw market data from exchange feeds and vendors, normalizes it into the Common Data Model (CDM), and auto-generates production-ready dbt pipelines with embedded quality controls.
- Ingestion — Declarative feed provider YAML: connectors, field mappings, QFSQL transforms, quality tests. No custom ingestion code.
- CDM — Unified schema contract consumed by MarketState and FeatureDAG. Every downstream stage operates on consistent, validated inputs.
- dbt Generator — Six sub-generators produce staging models, CDM union models, engine-specific SQL macros, and connection profiles. Zero manual SQL.
- Engine Layer —
DBEngineinterface backed by an engine registry. DuckDB for local dev, OpenLakehouse (S3 + Iceberg + Trino) for production, DolphinDB for real-time streaming — same API, same Arrow-based data flow. - Data Quality — Four-layer validation: Pydantic schema, Pandera ingestion, dbt warehouse tests, Elementary monitoring with alerting.
MarketState
Sits between DataInfra and FeatureDAG. Transforms normalized CDM data into structured, analysis-ready market representations — bars, order book states, and labeled datasets. MarketState does NOT compute features; it constructs the canonical market states that FeatureDAG consumes.
State Engine constructs bars (time, tick, volume, dollar, imbalance, run, volatility, dollar_imbalance, CUSUM), enriched trades with L1 analytics, and LOB snapshots. Runs in batch via single-pass Numba-accelerated replay, or streaming via DolphinDB reactive state engine.
Label Engine generates supervised learning labels across five paradigms: triple barrier, fixed horizon return, trend scanning, quantile binning, and time-series sign. Registry-based dispatch with backend-agnostic I/O. Labels are batch-only — used for model training.
FeatureDAG
A compiler for quantitative finance features. Formula strings in YAML compile through a 4-stage pipeline — AST compiler, IR DAG, lowering, execution — producing Polars (batch) and DolphinDB (streaming) results from the same YAML.
| Layer | Role |
|---|---|
| AST Compiler | Parses formula strings via Python ast, dispatches ~40 built-in functions to IR nodes |
| IR DAG | Frozen, validated DAG (rustworkx) with 50+ compile-time schema contracts |
| Lowering | Translates IR into pl.Expr (Polars) or DolphinDB DSL strings via decorator-based registry |
| Execution | Runs lowered expressions — Polars lazy pipeline (batch) or DolphinDB stream engines (streaming) |
→ FeatureDAG overview · Formula Language Reference
Execution Layer
How feature DAGs become running computations. Two execution paths consume the same FeatureInstance objects — the divergence happens only at expression generation. No duplicate implementations.
| Aspect | Batch (Polars) | Streaming (DolphinDB) |
|---|---|---|
| Runtime | Python process | DolphinDB cluster |
| Trigger | Explicit run() | Continuous data arrival |
| Lifecycle | Run-once per symbol | Deploy-once, run continuously |
| Primary use | Research, batch scoring | Live trading, real-time signals |
Mode polymorphism — both backends support three computation modes configured per feature:
- tick — per-event computation for HFT signals and execution algorithms
- bar — per-boundary computation for bar-native strategies
- tick_to_bar — tick-level features projected onto a bar grid via temporal aggregation (mean, std, last, sum)
The lowering registry dispatches (Primitive, op) pairs to backend-specific functions. Adding a new engine means implementing the backend protocol and registering lowering functions — no IR changes needed.
Cross-Cutting Properties
- Single definition, dual execution — Same YAML compiles to both batch and streaming. No duplicate implementations, no research/production drift.
- Deterministic and reproducible — Hash-based feature versioning. No hidden state, no implicit transforms. Full audit trail.
- Extensible at every layer — Custom feed providers, feature types, labeling methods, and execution backends. All via declarative config or registry-based plugin patterns.
- Data quality built in — Schema validation, anomaly detection, and freshness monitoring are first-class concerns, not afterthoughts.