Skip to main content

System Overview

QuantFlow is a quantitative financial intelligence infrastructure platform that transforms raw market data into microstructure-aware features and trading signals. A unified declarative pipeline — from ingestion through CDM normalization, market state reconstruction, feature compilation, and execution — runs consistently across batch (research) and streaming (trading) from the same YAML definitions.


Architecture

DataInfra → MarketState → FeatureDAG → Execution Layer

┌─────┴─────┐
▼ ▼
State Engine Label Engine
ComponentRoleKey Subsystems
DataInfraIngest, normalize, validateCDM, feed providers, dbt generator, engine adapter, data quality
MarketStateStructure market representationsState Engine (bars, snapshots), Label Engine (triple barrier, trend scanning)
FeatureDAGCompile formulas into executable graphsAST compiler, IR DAG, lowering, execution
Execution LayerRun feature pipelines in target environmentsBatch, Streaming, mode polymorphism

DataInfra

The engine-agnostic data foundation. DataInfra ingests raw market data from exchange feeds and vendors, normalizes it into the Common Data Model (CDM), and auto-generates production-ready dbt pipelines with embedded quality controls.

  • Ingestion — Declarative feed provider YAML: connectors, field mappings, QFSQL transforms, quality tests. No custom ingestion code.
  • CDM — Unified schema contract consumed by MarketState and FeatureDAG. Every downstream stage operates on consistent, validated inputs.
  • dbt Generator — Six sub-generators produce staging models, CDM union models, engine-specific SQL macros, and connection profiles. Zero manual SQL.
  • Engine LayerDBEngine interface backed by an engine registry. DuckDB for local dev, OpenLakehouse (S3 + Iceberg + Trino) for production, DolphinDB for real-time streaming — same API, same Arrow-based data flow.
  • Data Quality — Four-layer validation: Pydantic schema, Pandera ingestion, dbt warehouse tests, Elementary monitoring with alerting.

DataInfra overview


MarketState

Sits between DataInfra and FeatureDAG. Transforms normalized CDM data into structured, analysis-ready market representations — bars, order book states, and labeled datasets. MarketState does NOT compute features; it constructs the canonical market states that FeatureDAG consumes.

State Engine constructs bars (time, tick, volume, dollar, imbalance, run, volatility, dollar_imbalance, CUSUM), enriched trades with L1 analytics, and LOB snapshots. Runs in batch via single-pass Numba-accelerated replay, or streaming via DolphinDB reactive state engine.

Label Engine generates supervised learning labels across five paradigms: triple barrier, fixed horizon return, trend scanning, quantile binning, and time-series sign. Registry-based dispatch with backend-agnostic I/O. Labels are batch-only — used for model training.

MarketState overview


FeatureDAG

A compiler for quantitative finance features. Formula strings in YAML compile through a 4-stage pipeline — AST compiler, IR DAG, lowering, execution — producing Polars (batch) and DolphinDB (streaming) results from the same YAML.

LayerRole
AST CompilerParses formula strings via Python ast, dispatches ~40 built-in functions to IR nodes
IR DAGFrozen, validated DAG (rustworkx) with 50+ compile-time schema contracts
LoweringTranslates IR into pl.Expr (Polars) or DolphinDB DSL strings via decorator-based registry
ExecutionRuns lowered expressions — Polars lazy pipeline (batch) or DolphinDB stream engines (streaming)

FeatureDAG overview · Formula Language Reference


Execution Layer

How feature DAGs become running computations. Two execution paths consume the same FeatureInstance objects — the divergence happens only at expression generation. No duplicate implementations.

AspectBatch (Polars)Streaming (DolphinDB)
RuntimePython processDolphinDB cluster
TriggerExplicit run()Continuous data arrival
LifecycleRun-once per symbolDeploy-once, run continuously
Primary useResearch, batch scoringLive trading, real-time signals

Mode polymorphism — both backends support three computation modes configured per feature:

  • tick — per-event computation for HFT signals and execution algorithms
  • bar — per-boundary computation for bar-native strategies
  • tick_to_bar — tick-level features projected onto a bar grid via temporal aggregation (mean, std, last, sum)

The lowering registry dispatches (Primitive, op) pairs to backend-specific functions. Adding a new engine means implementing the backend protocol and registering lowering functions — no IR changes needed.

Execution Layer overview


Cross-Cutting Properties

  • Single definition, dual execution — Same YAML compiles to both batch and streaming. No duplicate implementations, no research/production drift.
  • Deterministic and reproducible — Hash-based feature versioning. No hidden state, no implicit transforms. Full audit trail.
  • Extensible at every layer — Custom feed providers, feature types, labeling methods, and execution backends. All via declarative config or registry-based plugin patterns.
  • Data quality built in — Schema validation, anomaly detection, and freshness monitoring are first-class concerns, not afterthoughts.