Skip to main content

DolphinDB

DolphinDB is a high-performance distributed time-series database and stream processing engine. QuantFlow uses it as the streaming execution backend — all real-time feature computation runs inside the DolphinDB cluster, with Python acting only as the deployment and monitoring control plane.


Why DolphinDB

QuantFlow evaluated multiple streaming engines and chose DolphinDB for several reasons that align with the demands of production quantitative finance:

Native time-series engine. DolphinDB is purpose-built for tick-level financial data. Its storage engine, query language, and stream processing primitives are designed for time-series workloads — not retrofitted from a general-purpose database. Concepts like createTimeSeriesEngine, createReactiveStateEngine, asof join, and context by are first-class language constructs, not add-on plugins.

Stream-table architecture. DolphinDB's shared stream tables enable a clean pub/sub pipeline: an upstream engine writes to a stream table, a downstream engine subscribes to it. Stages are decoupled — each can be deployed, monitored, and torn down independently. This maps directly to QuantFlow's ingest → process → feature pipeline stages.

Columnar execution with vectorized operations. Like Polars (the batch backend), DolphinDB operates on columns, not rows. Rolling window functions (mavg, msum, mstd, mcorr), arithmetic, and conditional logic are all vectorized. A single expression processes millions of rows in milliseconds.

In-process deployment — no Python in the hot path. Once deployed, the entire pipeline runs inside the DolphinDB server process. Python generates deployment scripts (DSL strings), sends them to the cluster, and disconnects. The streaming pipeline continues running autonomously. This avoids the GIL, serialization overhead, and network round-trips that would come with Python-in-the-loop.

Distributed scale. DolphinDB clusters can scale horizontally across nodes, with automatic data partitioning and distributed query execution. A single cluster can handle thousands of instruments with tick-level updates.


Key Concepts

Stream Tables

A stream table is a special table type that acts as a pub/sub channel. Publishers append rows; subscribers receive them asynchronously. Stream tables are the only communication mechanism between QuantFlow pipeline stages — no REST calls, no message queues, no shared memory.

# Created via Python (deployment script)
share streamTable(100000:0, `symbol`venue`event_time`price`volume,
[SYMBOL, SYMBOL, TIMESTAMP, DOUBLE, DOUBLE]) as trades

Reactive State Engine

A createReactiveStateEngine processes events one at a time, applying metric expressions to each arriving row and emitting results to an output table. This is the primary engine type for tick-mode feature computation.

createReactiveStateEngine(
name="ofi_engine",
metrics=[<cumsum((deltas(best_bid_size)-deltas(best_ask_size)), 0)>],
dummyTable=trades,
outputTable=ofi_output,
keyColumn=`symbol`venue)

Key constraint (Community Edition): metrics cannot use @state functions or perform arithmetic on array vector elements. QuantFlow works around this with handler functions (see below) and consolidated deployment strategies.

Time Series Engine

A createTimeSeriesEngine aggregates events over fixed time windows, emitting one row per window. Used for bar-mode and tick_to_bar Phase B feature computation.

createTimeSeriesEngine(
name="bar_engine",
windowSize=60000, # 60-second bars
metrics=[<first(price)>, <last(price)>, <max(price)>, <min(price)>, <sum(volume)>],
dummyTable=trades,
outputTable=bars_output,
keyColumn=`symbol`venue)

Handler Functions

When engine metric expressions can't express a computation (array operations on Community Edition, complex state machines), DolphinDB handler functions fill the gap. A handler is a DolphinDB function that subscribes to a stream table, iterates over arriving batches row-by-row, and appends results to an output table.

def handler(msg):
output = array(DOUBLE, msg.size())
for i in 0..(msg.size()-1):
# Parse JSON, compute feature, store result
bids = fromStdJson(msg.bids[i])
asks = fromStdJson(msg.asks[i])
output[i] = (sum(bids) - sum(asks)) / (sum(bids) + sum(asks) + 1e-8)
append!(output_table, output)

QuantFlow's consolidated LOB handler parses the bids/asks JSON once per row and computes all LOB features (depth_imbalance, book_pressure, weighted_imbalance) in a single loop, avoiding redundant JSON parsing.


How QuantFlow Uses DolphinDB

DSL Generation, Not Data Processing

QuantFlow never processes streaming data in Python. Instead, it generates DolphinDB DSL scripts as Python strings and executes them via conn.run(). Every deployment — table creation, engine creation, subscription, handler definition — is a generated script. This is possible because DolphinDB's entire streaming topology is defined declaratively in its scripting language.

Deployment Pattern

Python (QuantFlow) DolphinDB Cluster
───────────────── ─────────────────
1. Generate CREATE scripts

conn.run(script) ─────────────────────→ 2. Execute scripts
3. Create stream tables
4. Create engines
5. Subscribe engines
6. Generate handler scripts

conn.run(script) ─────────────────────→ 7. Deploy handlers
8. Python disconnects
9. Pipeline runs autonomously
- Data flows through stages
- Features compute continuously
- Output tables fill
10. Monitor (periodic)

conn.run("getStreamEngineStat()") ──→ 11. Return health metrics

Community Edition Considerations

QuantFlow targets DolphinDB Community Edition (free, widely available), which has two notable constraints:

  1. Array arithmetic in engine metrics: Array vector element access inside createReactiveStateEngine metrics does not support arithmetic operations. The handler pattern works around this (see Streaming Feature Engine).

  2. No @state functions in metrics: Stateful accumulations require handler-based or multi-engine workarounds for complex state machines.

Enterprise Edition lifts both constraints, enabling more concise engine definitions.


Performance at Scale

Quantitative trading generates data volumes that break conventional databases. A single liquid instrument produces ~10M ticks/day across trades and order book updates. A 1,000-instrument universe generates 10B+ events daily. At this scale, every millisecond of query latency compounds across thousands of features, hundreds of instruments, and continuous real-time evaluation.

DolphinDB was built for exactly this problem.

Architectural Speed

DolphinDB achieves its performance through several architectural decisions that align perfectly with financial time-series workloads:

C++ native core. The entire engine — storage, query execution, stream processing, and the scripting language runtime — is implemented in C++ with no garbage collection. SIMD vectorization accelerates aggregation operations on modern CPUs. There is no JVM warmup, no GC pause, no interpreted-language overhead in the data path.

Columnar storage and execution. Data is stored and processed column-by-column, not row-by-row. A rolling_mean(close, 20) query reads only the close column, touches only the relevant rows, and applies a vectorized window function. This is the same execution model as Polars (QuantFlow's batch backend) and Apache Arrow — zero impedance mismatch between batch research and streaming production.

In-process streaming. Every engine, handler, and subscription runs inside a single DolphinDB process. Data flows through the pipeline stages via in-memory stream tables — no network hops, no serialization, no context switches between stages. A tick arriving at the ingest stage reaches the feature output stage in microseconds.

Native time-series primitives. Functions like mavg, mstd, mcorr, asof join, and context by are implemented as native C++ operators, not SQL wrappers. They operate directly on compressed columnar data with memory access patterns optimized for sequential scans.

The Scale of Real-World Trading

QuantFlow's streaming pipeline on DolphinDB handles workloads that would be infeasible with Python-in-the-loop architectures:

ScaleWhat It Means
100+ instrumentsSimultaneous tick-by-tick feature computation across an entire liquid universe
10M+ events/day per instrumentEvery trade, every quote update, every order book change processed individually
50+ features per instrumentFull feature deployment — signal, quality, regime, stability, execution features
Sub-millisecond latencyFrom market data arrival to feature output update
24/7 continuous operationNo batch windows, no restart cycles — true continuous streaming for crypto markets

Real-World Performance

Independent benchmarks and user reports consistently place DolphinDB ahead of general-purpose time-series databases and on par with or exceeding specialized financial databases:

  • Ingestion: 10M+ rows/second sustained ingest on commodity hardware
  • Aggregation queries: 100M+ rows/second for group by symbol aggregation over tick data
  • Rolling window functions: 10-50x faster than Python/Pandas, 3-10x faster than general-purpose columnar databases (ClickHouse, DuckDB) on financial time-series patterns
  • Streaming latency: Single-digit milliseconds end-to-end from ingest to feature output

The performance gap widens as data volumes grow — DolphinDB's vectorized execution and columnar compression scale linearly with data size while row-based and interpreted approaches degrade super-linearly.

Learn More

  • DolphinDB Official Benchmarks — detailed performance comparisons across ingestion, query, and streaming workloads
  • DolphinDB YouTube Channel — technical deep dives, benchmarking demos, and performance optimization guides
  • STAC-M3 Benchmarks — the independent industry standard for financial time-series database performance (search for DolphinDB results)
  • DolphinDB Blog — architecture deep dives, customer case studies, and performance tuning guides

Cross-Backend Consistency

DolphinDB and Polars (the batch backend) consume the same IR DAG — no duplicate feature definitions. Cross-backend consistency is validated through:

  • Shared lowering registry: @register(Primitive.WINDOW, "rolling_mean", backend="dolphindb") and @register(Primitive.WINDOW, "rolling_mean") (Polars default) register backend-specific implementations for the same (primitive, op) key.
  • Stable numerical methods: The DDB rolling_std uses a two-pass stable formula (sqrt(mavg((x - mavg(x,w))^2, w) * w/(w-1))) to match Polars' ddof=1 behavior and avoid catastrophic cancellation on large-magnitude inputs.
  • Identical IR: Both backends receive the same IRNode DAG from QuantflowIRBuilder, ensuring structural equivalence before lowering.

Streaming Overview · Streaming Feature Engine