Skip to main content

Overview

DataInfra is the engine-agnostic, metadata-driven data infrastructure layer of QuantFlow. It provides a unified system for ingesting market data from multiple sources, normalizing it into a Common Data Model (CDM), generating production-ready dbt pipelines, and enforcing data quality controls.

Whether you run a local embedded engine during development, a cloud warehouse in production, or a real-time engine for streaming — the same CDM schemas, field mappings, and quality rules apply. DataInfra's engine adapter system translates QFSQL expressions and generates engine-native dbt models automatically. Switch engines by changing one line of config, no pipeline rewrites.

Core Responsibility

Turn fragmented, multi-source market data into a unified, queryable, and validated financial data layer — on any engine, at any scale.


Position in QuantFlow System

DataInfra → MarketState → FeatureDAG → Research / Trading

DataInfra turns raw market data into a clean, validated, and standardized financial data foundation. Everything downstream depends on it.


Components

Common Data Model

The shared data contract across the entire pipeline. DataInfra defines and enforces the CDM — MarketState and FeatureDAG both read from and write to it, ensuring every stage operates on consistent, validated inputs.

CDM schema definition

Ingestion & Feed Providers

Configurable pipeline of connectors, processors, and writers. External data sources are defined declaratively in YAML with field mappings, QFSQL transformations, and quality tests — no custom ingestion code.

Ingestion pipeline & feed provider config

dbt Generator

Auto-generates complete dbt projects from metadata: staging models, CDM union models, engine-specific SQL macros, and connection profiles. Six sub-generators produce a production-ready project with zero manual SQL.

Generator architecture & QFSQL translation

Engine-Agnostic Data Layer

A common DBEngine interface backed by an engine registry. DuckDB for local dev, OpenLakehouse (S3 + Iceberg + Trino) for production, DolphinDB for real-time HFT — same API, same Arrow-based data flow.

Engine interface & registry · OpenLakehouse

Pipeline Orchestration

Dagster-powered batch pipeline with automatic asset discovery, job definitions, and execution observability. Five stages — ingest, dbt, state engine, label engine, feature engine — each running as isolated Dagster asset groups with full lineage tracking.

Dagster pipeline orchestration

Data Quality Control

Four-layer validation: schema (Pydantic), ingestion (Pandera), warehouse (dbt tests), and monitoring (Elementary). Declarative tests in feed provider YAML become automated dbt test cases with alerting.

Validation layers & monitoring