Skip to main content

Engine-Agnostic Data Layer

The same pipeline definition works across all database backends without code changes. Choose the right engine for your use case, then switch by changing one line of config.


DBEngine Interface

Every engine implements the same abstract interface. Pipelines never import engine-specific code.

APIMethodDescription
Readread_arrow(table, columns, filters)pa.TableParameterized queries with range/equality filters
Readread_sql(sql)pa.TableArbitrary SQL execution
Writewrite_arrow(table, data, mode, unique_key)append, overwrite, or delete_insert (upsert)
Schemacreate_table(table, schema)DDL from Arrow schema with full type mapping
Validatevalidate()boolConnectivity health check

Engine Registry

Engines are registered by name and created via a single factory call:

from quantflow.io import DBRegistry

DBRegistry.register("duckdb", DuckDBEngine)
DBRegistry.register("openlakehouse", OpenLakehouseEngine)
DBRegistry.register("s3_iceberg", OpenLakehouseEngine) # backward compat alias
DBRegistry.register("trino", OpenLakehouseEngine) # backward compat alias
DBRegistry.register("bigquery", BigQueryEngine)
DBRegistry.register("snowflake", SnowflakeEngine)
DBRegistry.register("databricks", DatabricksEngine)

# Factory usage — same call, any engine
engine = DBRegistry.create("duckdb", config)
engine = DBRegistry.create("openlakehouse", config)
engine = DBRegistry.create("bigquery", config)

Engine Options

DuckDB (Embedded)

An embedded columnar database that runs in-process — no server, no daemon, no network configuration. The default engine for local development.

ProsCons
Zero setup — pip install is all you needSingle-machine only, no distributed queries
Full pipeline runs on a laptopNot suitable for multi-user production workloads
File-based persistence, easy to version and shareLimited concurrency
Partition-aware scans on Parquet files

When to use: Development, research, backtesting, single-machine batch pipelines. Start here — switching to another engine later is a config change.


OpenLakehouse (S3 + Parquet + Iceberg + Trino)

A unified backend that combines S3-compatible object storage, Apache Parquet files, Apache Iceberg table format, and Trino for federated SQL queries. All three registry names (openlakehouse, s3_iceberg, trino) resolve to the same engine.

ComponentRole
S3 / Object StorePersistent, cost-effective storage for HFT-scale data
Apache ParquetColumnar format — compressed, predicate-pushdown, partition-aware
Apache IcebergTable format — ACID transactions, time-travel, schema evolution, hidden partitioning
TrinoFederated SQL — joins across Iceberg catalogs and external sources

The write path: Ray workers write Parquet to staging → commit_files() registers files in the Iceberg catalog → Trino queries the committed table. The read path: Trino resolves the Iceberg snapshot, applies partition pruning, and scans only the relevant Parquet files.

ProsCons
Cost-effective — S3 storage is cheap, compute is separateRequires infrastructure: S3 bucket, Iceberg catalog (JDBC), Trino cluster
Iceberg ACID — safe concurrent writes, time-travel queriesMore operational complexity than DuckDB
Trino federated queries — join CDM tables with external data sourcesCold-start query latency higher than embedded engines
Scales to billions of HFT events

When to use: Production data plane, HFT-scale storage, multi-user research, any workload where data must outlive a single machine. The standard production target.

OpenLakehouse — full engine documentation, storage layout, and write flow details


BigQuery

Google Cloud's serverless data warehouse. Fully managed, no infrastructure to operate.

ProsCons
Serverless — no cluster management, auto-scalingGCP-only, vendor lock-in
BI Engine — sub-second response on dashboardsQuery costs scale with data scanned
Native partitioning and clusteringLimited per-query execution time
Built-in ML (BigQuery ML)

When to use: Organizations already on GCP, workloads that benefit from serverless auto-scaling, BI dashboarding on top of QuantFlow data.


Snowflake

Multi-cloud data warehouse with strong governance, data sharing, and time-travel features.

ProsCons
Multi-cloud (AWS, Azure, GCP)Warehouse sizing requires active management
Time-travel and zero-copy cloningCompute and storage costs are coupled in standard edition
Secure data sharing across organizationsCold-start for suspended warehouses
Strong RBAC and governance

When to use: Enterprise environments with governance requirements, multi-cloud strategies, organizations that already use Snowflake for analytics.


Databricks

Lakehouse platform combining data engineering, SQL analytics, and ML on a single platform.

ProsCons
Unified platform — SQL, Python, ML in one workspaceRequires cluster management
Unity Catalog for governanceMore complex than pure SQL warehouses
Optimized for Spark/ML workloads
Delta Lake / Iceberg interoperability

When to use: Teams doing both data engineering and ML, organizations standardized on Databricks, workloads that combine SQL analytics with Spark-based feature engineering.


DolphinDB (Streaming)

A native timeseries engine for real-time, event-driven computation. Not a general-purpose storage engine — it is the compute engine for streaming pipelines.

CapabilityDetail
Stream tablesShared in-memory ingestion tables with pub/sub semantics
Reactive state enginesStateful streaming computation triggered on every incoming tick
Sub-ms queriesNative timeseries engine optimized for HFT workloads
Unified querySame SQL for both historical and live data
ProsCons
Sub-ms latency on streaming dataRequires a DolphinDB server (Community Edition available)
Native timeseries primitives — no emulationNot a general-purpose data warehouse
Unified historical + real-time query surfaceLimited ecosystem compared to cloud warehouses

When to use: Live trading, real-time signal generation, streaming feature computation. Deploys alongside a storage engine — DuckDB for batch research, OpenLakehouse for historical data, DolphinDB for the live path.


Engine Selection Guide

ScenarioRecommended Engine
Getting started, local developmentDuckDB
Single-machine batch researchDuckDB
Production data lake, HFT-scale storageOpenLakehouse
Multi-user research on shared dataOpenLakehouse
Serverless analytics, BI dashboardsBigQuery
Enterprise governance, data sharingSnowflake
Combined data engineering + MLDatabricks
Live trading, real-time signalsDolphinDB

QFSQL: Engine-Agnostic Expressions

QFSQL (QuantFlow SQL) is the expression language that makes engine portability possible. Write QFSQL in field mappings, quality tests, and feature definitions — each engine adapter translates it to native SQL.

QFSQL expression → Engine adapter → Native SQL
QFSQLDuckDBBigQuerySnowflake
cast_safe(price, numeric)CAST(price AS DOUBLE)SAFE_CAST(price AS NUMERIC)TRY_CAST(price AS NUMBER)
current_timestamp()now()CURRENT_TIMESTAMP()CURRENT_TIMESTAMP()
timestamp_ms(ts)TO_TIMESTAMP(ts / 1000)TIMESTAMP_MILLIS(ts)TO_TIMESTAMP(ts, 3)

See the dbt Generator for the full 60-function catalog and the QFSQL Reference for the complete language specification.


Configuration

Each engine is configured in .local_config.yml:

engine:
# Embedded — zero setup
- name: duckdb
database: "./data/{{project_name}}/quantflow.duckdb"

# Lakehouse — production data plane
- name: openlakehouse
s3:
bucket: "quantflow-data"
region: "eu-central-003"
endpoint: "https://s3.eu-central-003.backblazeb2.com"
iceberg:
catalog_type: "jdbc"
catalog_uri: "jdbc:postgresql://localhost:5432/iceberg"
trino:
host: "localhost"
port: 7010
catalog: "quantflow_canonical"

# Cloud warehouse
- name: bigquery
project: "my-quant-project"
dataset: "quantflow_data"
location: "US"

# Streaming compute
- name: dolphindb
host: "localhost"
port: 8848
database: "quantflow_db"
auth: password
key:
username: "admin"
password: "123456"

Set the active engine in quantflow_project.yml:

data_processing:
historical_data_engine: duckdb # local dev
# historical_data_engine: openlakehouse # production
# historical_data_engine: bigquery # cloud analytics