Overview

A compiler for quantitative finance features. Formula strings compile through a 4-stage pipeline — AST compiler, IR DAG, lowering, and execution — producing batch (Polars) and streaming (DolphinDB) results from the same YAML specification.

Why

In a typical quant organization, the same features — OFI, rolling z-score, VPIN, momentum signals — are written and maintained independently by multiple teams. Each researcher authors them in notebooks. Each production engineer rewrites them for the live pipeline. Each new hire rediscovers them from scratch. The result:

Duplicated effort — teams across the organization rebuild the same core logic, with no shared library or single source of truth
Divergent implementations — the same feature computed in research and production can produce different numbers because each version encodes subtly different logic
Quality blind spots — without centralized validation, bugs in feature code go undetected; a researcher's notebook error might surface only months later when a model underperforms
Notebook → production gap — the research version of a feature doesn't match what runs in production; every new feature requires a full rewrite cycle to bridge this gap

FeatureDAG addresses this by treating features as declarative computation graphs, not hand-coded functions spread across notebooks and scripts. A formula string in YAML compiles through Python's AST module into an Intermediate Representation (IR), which lowers to backend-specific expressions. Same formula, same IR, two backends — one source of truth for the entire organization.

[Formula YAML] → [AST Compiler] → [IR DAG] → [Lowering] → [Execution]
                                                                 │
                                                      ┌──────────┼──────────┐
                                                      ▼                     ▼
                                               Polars (batch)    DolphinDB (streaming)

How It Works

Stage	Role
1. AST Compiler	Parses formula strings via Python's `ast` module, walks the syntax tree, dispatches ~40 built-in functions to IR nodes
2. IR DAG	Frozen, validated DAG nodes (rustworkx) with 50+ compile-time schema contracts — catches type errors before execution
3. Lowering	Translates IR into backend expressions: `pl.Expr` objects (Polars) or DolphinDB DSL strings, dispatched via a decorator-based registry
4. Execution	Runs the lowered expressions — Polars lazy DataFrame pipeline (batch) or DolphinDB stream engines (streaming)

→ Type System · AST Compiler · IR DAG · Lowering · Execution

Integration

state_engine                    label_engine
    │                                │
    │ (enriched CDM tables)          │ (target labels)
    ▼                                ▼
        feature_engine
        │
        │ (computed features)
        ▼
    sinks (warehouse, Kafka, stream tables)

FeatureDAG sits downstream of MarketState. It consumes enriched CDM tables and target labels, and outputs computed features.

state_engine — Produces enriched CDM tables consumed as SOURCE inputs
label_engine — Generates target labels for supervised learning features
batch_runner — End-to-end batch pipeline: load data → compile formulas → build IR DAG → lower → execute → write to sinks
streaming/dolphindb — Deploys features as DolphinDB streaming pipelines for live trading

Two Execution Paths

Aspect	Batch (Polars)	Streaming (DolphinDB)
Runtime	Python process	DolphinDB cluster
Data model	Static Arrow tables	Unbounded stream tables
Trigger	Explicit `run()`	Continuous — data arrival
Expression model	`pl.Expr` objects (lazy)	DolphinDB DSL strings
Primary use	Research	Live trading, real-time signals

Both paths consume the same feature definitions — no duplicate implementations, no research-production drift.

Key Design Decisions

Template → Instance separation — One FeatureType blueprint yields many FeatureInstance parameterizations
Schema contracts over runtime errors — 50+ OP_CONTRACTS validate column types at DAG construction time
Frozen IRNode — Immutable after construction; no accidental mutation during optimization passes
Protocol-based backends — Python Protocol (structural subtyping), no inheritance required
Per-feature error isolation — One broken feature doesn't block the entire pipeline run

Why​

How It Works​

Integration​

Two Execution Paths​

Key Design Decisions​

See Also​

Why

How It Works

Integration

Two Execution Paths

Key Design Decisions

See Also