Skip to main content

Overview

A compiler for quantitative finance features. Formula strings compile through a 4-stage pipeline — AST compiler, IR DAG, lowering, and execution — producing batch (Polars) and streaming (DolphinDB) results from the same YAML specification.


Why

In a typical quant organization, the same features — OFI, rolling z-score, VPIN, momentum signals — are written and maintained independently by multiple teams. Each researcher authors them in notebooks. Each production engineer rewrites them for the live pipeline. Each new hire rediscovers them from scratch. The result:

  • Duplicated effort — teams across the organization rebuild the same core logic, with no shared library or single source of truth
  • Divergent implementations — the same feature computed in research and production can produce different numbers because each version encodes subtly different logic
  • Quality blind spots — without centralized validation, bugs in feature code go undetected; a researcher's notebook error might surface only months later when a model underperforms
  • Notebook → production gap — the research version of a feature doesn't match what runs in production; every new feature requires a full rewrite cycle to bridge this gap

FeatureDAG addresses this by treating features as declarative computation graphs, not hand-coded functions spread across notebooks and scripts. A formula string in YAML compiles through Python's AST module into an Intermediate Representation (IR), which lowers to backend-specific expressions. Same formula, same IR, two backends — one source of truth for the entire organization.

[Formula YAML] → [AST Compiler] → [IR DAG] → [Lowering] → [Execution]

┌──────────┼──────────┐
▼ ▼
Polars (batch) DolphinDB (streaming)

How It Works

StageRole
1. AST CompilerParses formula strings via Python's ast module, walks the syntax tree, dispatches ~40 built-in functions to IR nodes
2. IR DAGFrozen, validated DAG nodes (rustworkx) with 50+ compile-time schema contracts — catches type errors before execution
3. LoweringTranslates IR into backend expressions: pl.Expr objects (Polars) or DolphinDB DSL strings, dispatched via a decorator-based registry
4. ExecutionRuns the lowered expressions — Polars lazy DataFrame pipeline (batch) or DolphinDB stream engines (streaming)

Type System · AST Compiler · IR DAG · Lowering · Execution


Integration

state_engine label_engine
│ │
│ (enriched CDM tables) │ (target labels)
▼ ▼
feature_engine

│ (computed features)

sinks (warehouse, Kafka, stream tables)

FeatureDAG sits downstream of MarketState. It consumes enriched CDM tables and target labels, and outputs computed features.

  • state_engine — Produces enriched CDM tables consumed as SOURCE inputs
  • label_engine — Generates target labels for supervised learning features
  • batch_runner — End-to-end batch pipeline: load data → compile formulas → build IR DAG → lower → execute → write to sinks
  • streaming/dolphindb — Deploys features as DolphinDB streaming pipelines for live trading

Two Execution Paths

AspectBatch (Polars)Streaming (DolphinDB)
RuntimePython processDolphinDB cluster
Data modelStatic Arrow tablesUnbounded stream tables
TriggerExplicit run()Continuous — data arrival
Expression modelpl.Expr objects (lazy)DolphinDB DSL strings
Primary useResearchLive trading, real-time signals

Both paths consume the same feature definitions — no duplicate implementations, no research-production drift.


Key Design Decisions

  1. Template → Instance separation — One FeatureType blueprint yields many FeatureInstance parameterizations
  2. Schema contracts over runtime errors — 50+ OP_CONTRACTS validate column types at DAG construction time
  3. Frozen IRNode — Immutable after construction; no accidental mutation during optimization passes
  4. Protocol-based backends — Python Protocol (structural subtyping), no inheritance required
  5. Per-feature error isolation — One broken feature doesn't block the entire pipeline run

See Also