Skip to main content

Overview

How feature DAGs become running computations: the Polars batch path for research, and the DolphinDB streaming path for live production. Same feature definitions, two execution models.


Architecture

The execution layer sits below the type system and IR compiler. It receives resolved FeatureInstance objects and produces computed feature values at the destination — Arrow tables for batch, shared stream tables for streaming.

FeatureInstance[]

┌───────────────┴───────────────┐
▼ ▼
Batch Path (Polars) Streaming Path (DolphinDB)
│ │
▼ ▼
Warehouse tables DolphinDB stream tables

Key principle: feature definitions (YAML → FeatureType → FeatureInstance) are identical across both paths. The divergence happens only at the expression-generation layer.

Engine-Agnostic by Design

Polars and DolphinDB are the current backends — they are not the only possible backends. The execution layer is built on a backend protocol (structural subtyping, ~20 methods) and a lowering registry (decorator-based dispatch). Adding a new engine means:

  1. Implement the BaseBackend protocol for the target engine
  2. Register lowering functions via @register(Primitive, op) for engine-specific operations
  3. Backend-agnostic operations (~30 in base.py) come free — they call only protocol methods

Same IR, same semantics, any backend. The compiler doesn't care whether it's targeting a warehouse, a stream processor, or a GPU runtime.


Batch vs. Streaming

AspectBatch (Polars)Streaming (DolphinDB)
RuntimePython process (single machine)DolphinDB cluster (distributed)
Data modelStatic Arrow tablesUnbounded stream tables
Execution triggerExplicit run() callContinuous — data arrival triggers computation
Expression modelpl.Expr objects (lazy)DolphinDB DSL strings (deployed as engines)
Feature groupingSequential per-featureConsolidated engines by input table
Array handlingmap_elements Python UDFsHandler functions (DolphinDB script)
Intermediate stateIn-memory LazyFrame columnsShared stream tables between engines
Error handlingFeature-level try/except, skip on failureEngine-level monitoring, re-deployment
BackfillingNatural — run over any date rangeRequires historical replay path
Primary useResearch, batch scoringLive trading, real-time signal generation

Both paths consume the same FeatureInstance objects — single definition, dual runtime.



Mode Polymorphism

Both backends support three computation modes, configured per feature:

ModeWhen Features ComputeDDB ImplementationPolars Implementation
tickPer-event (every tick)createReactiveStateEngineDirect column expressions
barPer-bar (OHLCV boundary)createTimeSeriesEngineFiltered by bar_type
tick_to_barTick features → bar aggregateTwo-phase: tick engines + bar createTimeSeriesEngineFeature DAG + temporal join + group-by agg

tick: Lowest latency. Features update on every market data event. Used for HFT signals, real-time spread monitoring, and execution algorithms that need per-event updates.

bar: Features compute on bar boundaries. Used for bar-native strategies (OHLCV pattern recognition, bar-level momentum) where tick granularity is unnecessary and would create noise.

tick_to_bar: The most common intraday mode. Features are computed at tick resolution (capturing microstructure dynamics), then projected onto a bar grid via temporal aggregation with per-feature aggregation methods (mean, std, last, sum, min, max). This captures tick-level information while producing bar-aligned outputs suitable for ML model training.


Shared IR + Lowering Registry

Both execution paths share the same compilation infrastructure:

Feature definitions

FeatureType definitions (shared metadata)

QuantflowIRBuilder (shared IR DAG via rustworkx)

Lowering Registry (@register dispatch)
/ \
/ \
DolphinDB Polars
streaming batch

Why this matters: A feature's formula is compiled to an IR DAG once. The lowering registry dispatches the same (Primitive, op) pairs to backend-specific functions. @register(Primitive.WINDOW, "rolling_mean") registers the Polars implementation. @register(Primitive.WINDOW, "rolling_mean", backend="dolphindb") registers the DolphinDB implementation. resolve_lowering(node, backend_name) selects the correct one at runtime. Adding a new backend means registering lowering functions — no IR changes needed.


Deployment Lifecycle

Both paths follow a similar lifecycle, adapted to their runtime models:

Batch (Polars) — run-once lifecycle:

  1. Load project metadata (features, engine config)
  2. Read source CDM tables from warehouse per symbol
  3. Merge tables on time/symbol keys
  4. Compile formulas → IR DAG via QuantflowIRBuilder
  5. Execute DAG via PolarsDAGExecutor (lazy, collect() at end)
  6. Apply post-processing: transforms → bar alignment (if tick_to_bar) → normalization
  7. Write results to warehouse sinks
  8. Next symbol — cleared memory, fresh computation

Streaming (DolphinDB) — deploy-and-forget lifecycle:

  1. Load project metadata (features, engine config)
  2. Connect to DolphinDB cluster (pool, retry)
  3. Deploy ingest (create stream tables, start data sources)
  4. Deploy process (state engine: bars, snapshots, trade enrichment)
  5. Deploy features (IR DAG → layer-based topological engine deployment)
  6. Deploy feature output tables (dimension-aware consolidation)
  7. Python disconnects — pipeline runs autonomously inside DolphinDB
  8. Monitor via getStreamEngineStat(), tear down in reverse order on stop

The batch path is run-once-per-symbol. The streaming path is deploy-once-run-continuously. This fundamental difference shapes every design decision in the execution layer.


Batch Execution (Polars) · Streaming Execution (DolphinDB)