Engine-Agnostic Data Layer
The same pipeline definition works across all database backends without code changes. Choose the right engine for your use case, then switch by changing one line of config.
DBEngine Interface
Every engine implements the same abstract interface. Pipelines never import engine-specific code.
| API | Method | Description |
|---|---|---|
| Read | read_arrow(table, columns, filters) → pa.Table | Parameterized queries with range/equality filters |
| Read | read_sql(sql) → pa.Table | Arbitrary SQL execution |
| Write | write_arrow(table, data, mode, unique_key) | append, overwrite, or delete_insert (upsert) |
| Schema | create_table(table, schema) | DDL from Arrow schema with full type mapping |
| Validate | validate() → bool | Connectivity health check |
Engine Registry
Engines are registered by name and created via a single factory call:
from quantflow.io import DBRegistry
DBRegistry.register("duckdb", DuckDBEngine)
DBRegistry.register("openlakehouse", OpenLakehouseEngine)
DBRegistry.register("s3_iceberg", OpenLakehouseEngine) # backward compat alias
DBRegistry.register("trino", OpenLakehouseEngine) # backward compat alias
DBRegistry.register("bigquery", BigQueryEngine)
DBRegistry.register("snowflake", SnowflakeEngine)
DBRegistry.register("databricks", DatabricksEngine)
# Factory usage — same call, any engine
engine = DBRegistry.create("duckdb", config)
engine = DBRegistry.create("openlakehouse", config)
engine = DBRegistry.create("bigquery", config)
Engine Options
DuckDB (Embedded)
An embedded columnar database that runs in-process — no server, no daemon, no network configuration. The default engine for local development.
| Pros | Cons |
|---|---|
Zero setup — pip install is all you need | Single-machine only, no distributed queries |
| Full pipeline runs on a laptop | Not suitable for multi-user production workloads |
| File-based persistence, easy to version and share | Limited concurrency |
| Partition-aware scans on Parquet files |
When to use: Development, research, backtesting, single-machine batch pipelines. Start here — switching to another engine later is a config change.
OpenLakehouse (S3 + Parquet + Iceberg + Trino)
A unified backend that combines S3-compatible object storage, Apache Parquet files, Apache Iceberg table format, and Trino for federated SQL queries. All three registry names (openlakehouse, s3_iceberg, trino) resolve to the same engine.
| Component | Role |
|---|---|
| S3 / Object Store | Persistent, cost-effective storage for HFT-scale data |
| Apache Parquet | Columnar format — compressed, predicate-pushdown, partition-aware |
| Apache Iceberg | Table format — ACID transactions, time-travel, schema evolution, hidden partitioning |
| Trino | Federated SQL — joins across Iceberg catalogs and external sources |
The write path: Ray workers write Parquet to staging → commit_files() registers files in the Iceberg catalog → Trino queries the committed table. The read path: Trino resolves the Iceberg snapshot, applies partition pruning, and scans only the relevant Parquet files.
| Pros | Cons |
|---|---|
| Cost-effective — S3 storage is cheap, compute is separate | Requires infrastructure: S3 bucket, Iceberg catalog (JDBC), Trino cluster |
| Iceberg ACID — safe concurrent writes, time-travel queries | More operational complexity than DuckDB |
| Trino federated queries — join CDM tables with external data sources | Cold-start query latency higher than embedded engines |
| Scales to billions of HFT events |
When to use: Production data plane, HFT-scale storage, multi-user research, any workload where data must outlive a single machine. The standard production target.
→ OpenLakehouse — full engine documentation, storage layout, and write flow details
BigQuery
Google Cloud's serverless data warehouse. Fully managed, no infrastructure to operate.
| Pros | Cons |
|---|---|
| Serverless — no cluster management, auto-scaling | GCP-only, vendor lock-in |
| BI Engine — sub-second response on dashboards | Query costs scale with data scanned |
| Native partitioning and clustering | Limited per-query execution time |
| Built-in ML (BigQuery ML) |
When to use: Organizations already on GCP, workloads that benefit from serverless auto-scaling, BI dashboarding on top of QuantFlow data.
Snowflake
Multi-cloud data warehouse with strong governance, data sharing, and time-travel features.
| Pros | Cons |
|---|---|
| Multi-cloud (AWS, Azure, GCP) | Warehouse sizing requires active management |
| Time-travel and zero-copy cloning | Compute and storage costs are coupled in standard edition |
| Secure data sharing across organizations | Cold-start for suspended warehouses |
| Strong RBAC and governance |
When to use: Enterprise environments with governance requirements, multi-cloud strategies, organizations that already use Snowflake for analytics.
Databricks
Lakehouse platform combining data engineering, SQL analytics, and ML on a single platform.
| Pros | Cons |
|---|---|
| Unified platform — SQL, Python, ML in one workspace | Requires cluster management |
| Unity Catalog for governance | More complex than pure SQL warehouses |
| Optimized for Spark/ML workloads | |
| Delta Lake / Iceberg interoperability |
When to use: Teams doing both data engineering and ML, organizations standardized on Databricks, workloads that combine SQL analytics with Spark-based feature engineering.
DolphinDB (Streaming)
A native timeseries engine for real-time, event-driven computation. Not a general-purpose storage engine — it is the compute engine for streaming pipelines.
| Capability | Detail |
|---|---|
| Stream tables | Shared in-memory ingestion tables with pub/sub semantics |
| Reactive state engines | Stateful streaming computation triggered on every incoming tick |
| Sub-ms queries | Native timeseries engine optimized for HFT workloads |
| Unified query | Same SQL for both historical and live data |
| Pros | Cons |
|---|---|
| Sub-ms latency on streaming data | Requires a DolphinDB server (Community Edition available) |
| Native timeseries primitives — no emulation | Not a general-purpose data warehouse |
| Unified historical + real-time query surface | Limited ecosystem compared to cloud warehouses |
When to use: Live trading, real-time signal generation, streaming feature computation. Deploys alongside a storage engine — DuckDB for batch research, OpenLakehouse for historical data, DolphinDB for the live path.
Engine Selection Guide
| Scenario | Recommended Engine |
|---|---|
| Getting started, local development | DuckDB |
| Single-machine batch research | DuckDB |
| Production data lake, HFT-scale storage | OpenLakehouse |
| Multi-user research on shared data | OpenLakehouse |
| Serverless analytics, BI dashboards | BigQuery |
| Enterprise governance, data sharing | Snowflake |
| Combined data engineering + ML | Databricks |
| Live trading, real-time signals | DolphinDB |
QFSQL: Engine-Agnostic Expressions
QFSQL (QuantFlow SQL) is the expression language that makes engine portability possible. Write QFSQL in field mappings, quality tests, and feature definitions — each engine adapter translates it to native SQL.
QFSQL expression → Engine adapter → Native SQL
| QFSQL | DuckDB | BigQuery | Snowflake |
|---|---|---|---|
cast_safe(price, numeric) | CAST(price AS DOUBLE) | SAFE_CAST(price AS NUMERIC) | TRY_CAST(price AS NUMBER) |
current_timestamp() | now() | CURRENT_TIMESTAMP() | CURRENT_TIMESTAMP() |
timestamp_ms(ts) | TO_TIMESTAMP(ts / 1000) | TIMESTAMP_MILLIS(ts) | TO_TIMESTAMP(ts, 3) |
See the dbt Generator for the full 60-function catalog and the QFSQL Reference for the complete language specification.
Configuration
Each engine is configured in .local_config.yml:
engine:
# Embedded — zero setup
- name: duckdb
database: "./data/{{project_name}}/quantflow.duckdb"
# Lakehouse — production data plane
- name: openlakehouse
s3:
bucket: "quantflow-data"
region: "eu-central-003"
endpoint: "https://s3.eu-central-003.backblazeb2.com"
iceberg:
catalog_type: "jdbc"
catalog_uri: "jdbc:postgresql://localhost:5432/iceberg"
trino:
host: "localhost"
port: 7010
catalog: "quantflow_canonical"
# Cloud warehouse
- name: bigquery
project: "my-quant-project"
dataset: "quantflow_data"
location: "US"
# Streaming compute
- name: dolphindb
host: "localhost"
port: 8848
database: "quantflow_db"
auth: password
key:
username: "admin"
password: "123456"
Set the active engine in quantflow_project.yml:
data_processing:
historical_data_engine: duckdb # local dev
# historical_data_engine: openlakehouse # production
# historical_data_engine: bigquery # cloud analytics