Skip to main content

QuantFlow FeatureDAG

Declarative YAML β†’ IR DAG β†’ Polars (batch) + DolphinDB (streaming)

πŸ”— Where It Fits

DataInfra β†’ MarketState β†’ FeatureDAG β†’ Research / Trading

πŸ“‹ Overview

FeatureDAG is a compiler for quantitative finance features. It eliminates the research-to-production gap by treating financial features as declarative computation graphs, not hand-coded functions. A single YAML specification compiles through a 4-layer pipeline and produces both batch (Polars) and streaming (DolphinDB) execution from the same definition.

Quant researchers write features in notebooks with pandas and one-off scripts. Production engineers rewrite them for streaming pipelines. The two diverge β€” results drift, bugs creep in, and every new feature requires a full rewrite cycle. FeatureDAG eliminates this gap entirely.

[Formula YAML] β†’ [AST Compiler] β†’ [IR DAG] β†’ [Lowering] β†’ [Execution]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
Polars (batch)Β Β Β Β Β Β Β Β DolphinDB (streaming)

πŸ—οΈ 4-Stage Compilation Pipeline

Stage 1

AST Compiler

Formula string β†’ IR nodes. Parses Python AST, dispatches ~40 built-in functions.

Stage 2

IR DAG

Frozen IR nodes. 50+ schema contracts. Column aliasing. Rustworkx DAG.

Stage 3

Lowering

Backend protocol. Decorator-based registry. 30+ agnostic ops.

Stage 4

Execution

Polars lazy pipeline (batch) + DolphinDB stream engines (streaming).

🧩 6 Computation Primitives

Every feature computation falls into one of six primitives. The compiler understands the semantics of each and optimizes accordingly.

PrimitiveNature
SOURCEData ingestion from CDM tables
TRANSFORMStateless row-wise mapping (mid_price, trade_sign, depth_imbalance)
WINDOWRolling/windowed aggregation (rolling_vol, rolling_corr, pct_change, lag)
STATERecursive computations with memory (ema, decay_accum, rolling_zscore)
SINKMarks output columns β€” final feature values written to storage
EVENTBar trigger generation for information-driven sampling

πŸ”„ Batch + Streaming Consistency

Same feature definitions, two execution models. The divergence happens only at the expression-generation layer.

BatchStreaming
RuntimePolars (Python)DolphinDB cluster
DataStatic Arrow tablesUnbounded stream tables
TriggerExplicit run()Continuous β€” data arrival
GroupingSequential per-featureConsolidated engines (~60% fewer)
UseResearchLive trading, real-time signals

Same definitions, two runtimes β€” no duplicate implementations.

πŸ’‘ Key Design Decisions

  • Feature definitions are declarative, not imperative β€” YAML specs compile to IR. No hand-coded computation logic in the pipeline.
  • Single IR, multiple backends β€” The IR is backend-agnostic. Lowering functions produce engine-specific expressions for Polars and DolphinDB from the same IR nodes.
  • Feature-level error isolation β€” A misconfigured feature is logged and skipped. The rest of the pipeline continues β€” critical for research iteration speed.
  • LazyFrame chaining in batch β€” Polars expressions are chained via successive with_columns calls, letting the query optimizer fuse operations and minimize allocations.
  • Consolidated engine deployment in streaming β€” Multiple features sharing the same input table deploy as a single DolphinDB engine, reducing engine count by ~60%.
  • Expression folding in streaming β€” Intermediate computation steps are inlined into terminal expressions via regex substitution. Deployed engines only see final outputs.

πŸ“š Standard Feature Library

FeatureDAG ships with 133 FeatureTypes across 6 dimensions (Signal, Execution, Quality, Regime, Stability, Technical).

Each feature is documented with its inputs, parameters, computation type, and economic meaning. Features are consumed directly by the compiler β€” no manual implementation needed.

Browse the Standard Feature Library β†’

πŸ“– Design Docs