Overview
DataInfra is the engine-agnostic, metadata-driven data infrastructure layer of QuantFlow. It provides a unified system for ingesting market data from multiple sources, normalizing it into a Common Data Model (CDM), generating production-ready dbt pipelines, and enforcing data quality controls.
Whether you run a local embedded engine during development, a cloud warehouse in production, or a real-time engine for streaming — the same CDM schemas, field mappings, and quality rules apply. DataInfra's engine adapter system translates QFSQL expressions and generates engine-native dbt models automatically. Switch engines by changing one line of config, no pipeline rewrites.
Core Responsibility
Turn fragmented, multi-source market data into a unified, queryable, and validated financial data layer — on any engine, at any scale.
Position in QuantFlow System
DataInfra → MarketState → FeatureDAG → Research / Trading
DataInfra turns raw market data into a clean, validated, and standardized financial data foundation. Everything downstream depends on it.
Components
Common Data Model
The shared data contract across the entire pipeline. DataInfra defines and enforces the CDM — MarketState and FeatureDAG both read from and write to it, ensuring every stage operates on consistent, validated inputs.
Ingestion & Feed Providers
Configurable pipeline of connectors, processors, and writers. External data sources are defined declaratively in YAML with field mappings, QFSQL transformations, and quality tests — no custom ingestion code.
→ Ingestion pipeline & feed provider config
dbt Generator
Auto-generates complete dbt projects from metadata: staging models, CDM union models, engine-specific SQL macros, and connection profiles. Six sub-generators produce a production-ready project with zero manual SQL.
→ Generator architecture & QFSQL translation
Engine-Agnostic Data Layer
A common DBEngine interface backed by an engine registry. DuckDB for local dev, OpenLakehouse (S3 + Iceberg + Trino) for production, DolphinDB for real-time HFT — same API, same Arrow-based data flow.
→ Engine interface & registry · OpenLakehouse
Pipeline Orchestration
Dagster-powered batch pipeline with automatic asset discovery, job definitions, and execution observability. Five stages — ingest, dbt, state engine, label engine, feature engine — each running as isolated Dagster asset groups with full lineage tracking.
→ Dagster pipeline orchestration
Data Quality Control
Four-layer validation: schema (Pydantic), ingestion (Pandera), warehouse (dbt tests), and monitoring (Elementary). Declarative tests in feed provider YAML become automated dbt test cases with alerting.