Engine-Agnostic Data Layer

The same pipeline definition works across all database backends without code changes. Choose the right engine for your use case, then switch by changing one line of config.

DBEngine Interface

Every engine implements the same abstract interface. Pipelines never import engine-specific code.

API	Method	Description
Read	`read_arrow(table, columns, filters)` → `pa.Table`	Parameterized queries with range/equality filters
Read	`read_sql(sql)` → `pa.Table`	Arbitrary SQL execution
Write	`write_arrow(table, data, mode, unique_key)`	`append`, `overwrite`, or `delete_insert` (upsert)
Schema	`create_table(table, schema)`	DDL from Arrow schema with full type mapping
Validate	`validate()` → `bool`	Connectivity health check

Engine Registry

Engines are registered by name and created via a single factory call:

from quantflow.io import DBRegistry

DBRegistry.register("duckdb", DuckDBEngine)
DBRegistry.register("openlakehouse", OpenLakehouseEngine)
DBRegistry.register("s3_iceberg", OpenLakehouseEngine)   # backward compat alias
DBRegistry.register("trino", OpenLakehouseEngine)         # backward compat alias
DBRegistry.register("bigquery", BigQueryEngine)
DBRegistry.register("snowflake", SnowflakeEngine)
DBRegistry.register("databricks", DatabricksEngine)

# Factory usage — same call, any engine
engine = DBRegistry.create("duckdb", config)
engine = DBRegistry.create("openlakehouse", config)
engine = DBRegistry.create("bigquery", config)

Engine Options

DuckDB (Embedded)

An embedded columnar database that runs in-process — no server, no daemon, no network configuration. The default engine for local development.

Pros	Cons
Zero setup — `pip install` is all you need	Single-machine only, no distributed queries
Full pipeline runs on a laptop	Not suitable for multi-user production workloads
File-based persistence, easy to version and share	Limited concurrency
Partition-aware scans on Parquet files

When to use: Development, research, backtesting, single-machine batch pipelines. Start here — switching to another engine later is a config change.

OpenLakehouse (S3 + Parquet + Iceberg + Trino)

A unified backend that combines S3-compatible object storage, Apache Parquet files, Apache Iceberg table format, and Trino for federated SQL queries. All three registry names (openlakehouse, s3_iceberg, trino) resolve to the same engine.

Component	Role
S3 / Object Store	Persistent, cost-effective storage for HFT-scale data
Apache Parquet	Columnar format — compressed, predicate-pushdown, partition-aware
Apache Iceberg	Table format — ACID transactions, time-travel, schema evolution, hidden partitioning
Trino	Federated SQL — joins across Iceberg catalogs and external sources

The write path: Ray workers write Parquet to staging → commit_files() registers files in the Iceberg catalog → Trino queries the committed table. The read path: Trino resolves the Iceberg snapshot, applies partition pruning, and scans only the relevant Parquet files.

Pros	Cons
Cost-effective — S3 storage is cheap, compute is separate	Requires infrastructure: S3 bucket, Iceberg catalog (JDBC), Trino cluster
Iceberg ACID — safe concurrent writes, time-travel queries	More operational complexity than DuckDB
Trino federated queries — join CDM tables with external data sources	Cold-start query latency higher than embedded engines
Scales to billions of HFT events

When to use: Production data plane, HFT-scale storage, multi-user research, any workload where data must outlive a single machine. The standard production target.

→ OpenLakehouse — full engine documentation, storage layout, and write flow details

BigQuery

Google Cloud's serverless data warehouse. Fully managed, no infrastructure to operate.

Pros	Cons
Serverless — no cluster management, auto-scaling	GCP-only, vendor lock-in
BI Engine — sub-second response on dashboards	Query costs scale with data scanned
Native partitioning and clustering	Limited per-query execution time
Built-in ML (BigQuery ML)

When to use: Organizations already on GCP, workloads that benefit from serverless auto-scaling, BI dashboarding on top of QuantFlow data.

Snowflake

Multi-cloud data warehouse with strong governance, data sharing, and time-travel features.

Pros	Cons
Multi-cloud (AWS, Azure, GCP)	Warehouse sizing requires active management
Time-travel and zero-copy cloning	Compute and storage costs are coupled in standard edition
Secure data sharing across organizations	Cold-start for suspended warehouses
Strong RBAC and governance

When to use: Enterprise environments with governance requirements, multi-cloud strategies, organizations that already use Snowflake for analytics.

Databricks

Lakehouse platform combining data engineering, SQL analytics, and ML on a single platform.

Pros	Cons
Unified platform — SQL, Python, ML in one workspace	Requires cluster management
Unity Catalog for governance	More complex than pure SQL warehouses
Optimized for Spark/ML workloads
Delta Lake / Iceberg interoperability

When to use: Teams doing both data engineering and ML, organizations standardized on Databricks, workloads that combine SQL analytics with Spark-based feature engineering.

DolphinDB (Streaming)

A native timeseries engine for real-time, event-driven computation. Not a general-purpose storage engine — it is the compute engine for streaming pipelines.

Capability	Detail
Stream tables	Shared in-memory ingestion tables with pub/sub semantics
Reactive state engines	Stateful streaming computation triggered on every incoming tick
Sub-ms queries	Native timeseries engine optimized for HFT workloads
Unified query	Same SQL for both historical and live data

Pros	Cons
Sub-ms latency on streaming data	Requires a DolphinDB server (Community Edition available)
Native timeseries primitives — no emulation	Not a general-purpose data warehouse
Unified historical + real-time query surface	Limited ecosystem compared to cloud warehouses

When to use: Live trading, real-time signal generation, streaming feature computation. Deploys alongside a storage engine — DuckDB for batch research, OpenLakehouse for historical data, DolphinDB for the live path.

Engine Selection Guide

Scenario	Recommended Engine
Getting started, local development	DuckDB
Single-machine batch research	DuckDB
Production data lake, HFT-scale storage	OpenLakehouse
Multi-user research on shared data	OpenLakehouse
Serverless analytics, BI dashboards	BigQuery
Enterprise governance, data sharing	Snowflake
Combined data engineering + ML	Databricks
Live trading, real-time signals	DolphinDB

QFSQL: Engine-Agnostic Expressions

QFSQL (QuantFlow SQL) is the expression language that makes engine portability possible. Write QFSQL in field mappings, quality tests, and feature definitions — each engine adapter translates it to native SQL.

QFSQL expression  →  Engine adapter  →  Native SQL

QFSQL	DuckDB	BigQuery	Snowflake
`cast_safe(price, numeric)`	`CAST(price AS DOUBLE)`	`SAFE_CAST(price AS NUMERIC)`	`TRY_CAST(price AS NUMBER)`
`current_timestamp()`	`now()`	`CURRENT_TIMESTAMP()`	`CURRENT_TIMESTAMP()`
`timestamp_ms(ts)`	`TO_TIMESTAMP(ts / 1000)`	`TIMESTAMP_MILLIS(ts)`	`TO_TIMESTAMP(ts, 3)`

See the dbt Generator for the full 60-function catalog and the QFSQL Reference for the complete language specification.

Configuration

Each engine is configured in .local_config.yml:

engine:
  # Embedded — zero setup
  - name: duckdb
    database: "./data/{{project_name}}/quantflow.duckdb"

  # Lakehouse — production data plane
  - name: openlakehouse
    s3:
      bucket: "quantflow-data"
      region: "eu-central-003"
      endpoint: "https://s3.eu-central-003.backblazeb2.com"
    iceberg:
      catalog_type: "jdbc"
      catalog_uri: "jdbc:postgresql://localhost:5432/iceberg"
    trino:
      host: "localhost"
      port: 7010
      catalog: "quantflow_canonical"

  # Cloud warehouse
  - name: bigquery
    project: "my-quant-project"
    dataset: "quantflow_data"
    location: "US"

  # Streaming compute
  - name: dolphindb
    host: "localhost"
    port: 8848
    database: "quantflow_db"
    auth: password
    key:
      username: "admin"
      password: "123456"

Set the active engine in quantflow_project.yml:

data_processing:
  historical_data_engine: duckdb         # local dev
  # historical_data_engine: openlakehouse  # production
  # historical_data_engine: bigquery       # cloud analytics

DBEngine Interface​

Engine Registry​

Engine Options​

DuckDB (Embedded)​

OpenLakehouse (S3 + Parquet + Iceberg + Trino)​

BigQuery​

Snowflake​

Databricks​

DolphinDB (Streaming)​

Engine Selection Guide​

QFSQL: Engine-Agnostic Expressions​

Configuration​