QuantFlow DataInfra

Metadata-driven data infrastructure for quantitative finance

DataInfra → MarketState → FeatureDAG → Execution

DataInfra is the engine-agnostic, metadata-driven data infrastructure layer of QuantFlow. It provides a unified system for ingesting market data from multiple sources, normalizing it into a Common Data Model (CDM), generating production-ready dbt pipelines, and enforcing data quality controls.

Without a CDM, each dataset requires custom transformation logic. Raw tick data, order book snapshots, and reference data arrive in incompatible schemas. DataInfra eliminates this fragmentation by enforcing a unified schema across all sources — ensuring every downstream component operates on consistent, validated inputs.

The central metadata registry that defines and governs all data assets:

Common Data Model (CDM) — standardized financial entities (trade_tick, lob_incremental, lob_snapshot, etc.) defined declaratively in YAML
Schema registry — column types, partitioning, clustering, and unique keys per entity
Lineage tracking — source-to-destination field-level lineage
Pydantic validation — all metadata validated at project load time, preventing config drift

One metadata module. All engines. No schema duplication across environments.

External data sources configured declaratively with field mappings and quality tests:

Databento Reader — Historical market data via Databento API
HTTP Reader — REST APIs with retry, pagination, rate limiting
Field Mappings — Raw columns → CDM columns via QFSQL expressions
Quality Tests — Uniqueness, null checks, range validation per field

Providers are configured once in YAML. New venues or data sources are a configuration change, not a code change.

Source → Connector → Processor → Writer → Quantflow Raw Zone

Stage	What It Does	Options
Reader	Download raw data from external source	DatabentoReader, HTTPReader
Processor	Transform and validate	Decompressor (gzip/snappy/zstd)
Writer	Persist to target engine	DuckDB, OpenLakehouse, BigQuery, Snowflake, Databricks

Quantflow Raw Zone → Staging → Intermediate → Mart

Layer	What It Does
Staging	Raw source → typed, validated staging tables with field-level tests
Intermediate	Cross-source joins, deduplication, enrichment, business logic
Mart	Analysis-ready tables for feature computation and downstream consumption

Auto-generated from CDM definitions. Follows dbt best practices. Zero manual SQL.

OpenLakehouse

DolphinDB

BigQuery

Snowflake

Databricks

DuckDB

Four-layer validation ensures data integrity at every stage, powered by Elementary for dbt-native monitoring and anomaly detection:

Schema Validation — Pydantic models enforce type safety at parse time
Field-level Tests — Per-column uniqueness, null checks, range constraints
Cross-table Integrity — Referential consistency across CDM entities
Business Rules — Domain-specific assertions (e.g., bid ≤ ask, positive volume)

Test failures are reported with row-level diagnostics. Configurable severity: warn, error, or abort. Elementary provides dashboards, alerts, and lineage-aware monitoring out of the box.

DataInfra Overview — Full CDM specification, feed provider YAML reference, dbt generation, metadata governance
Ingestion Pipeline — Connectors, processors, writers, and multi-source configuration
dbt Model Generator — Auto-generated staging, intermediate, and mart models
Data Quality — Four-layer validation architecture with Elementary integration
Metadata Specifications — Pydantic model reference: ProjectConfig, FeedProvider, CDMEntity, EngineConfig

← Back to Home

QuantFlow DataInfra

🔗 Where It Fits

📋 Overview

📐 Metadata Module

🔌 Feed Providers

📥 Ingestion Pipeline

🏗️ dbt Transformation Pipeline

🔌 Engine Agnostic

✅ Data Quality

📚 Design Docs