Skip to main content

QuantFlow DataInfra

Metadata-driven data infrastructure for quantitative finance

🔗 Where It Fits

DataInfra → MarketState → FeatureDAG → Execution

📋 Overview

DataInfra is the engine-agnostic, metadata-driven data infrastructure layer of QuantFlow. It provides a unified system for ingesting market data from multiple sources, normalizing it into a Common Data Model (CDM), generating production-ready dbt pipelines, and enforcing data quality controls.

Without a CDM, each dataset requires custom transformation logic. Raw tick data, order book snapshots, and reference data arrive in incompatible schemas. DataInfra eliminates this fragmentation by enforcing a unified schema across all sources — ensuring every downstream component operates on consistent, validated inputs.

📐 Metadata Module

The central metadata registry that defines and governs all data assets:

  • Common Data Model (CDM) — standardized financial entities (trade_tick, lob_incremental, lob_snapshot, etc.) defined declaratively in YAML
  • Schema registry — column types, partitioning, clustering, and unique keys per entity
  • Lineage tracking — source-to-destination field-level lineage
  • Pydantic validation — all metadata validated at project load time, preventing config drift

One metadata module. All engines. No schema duplication across environments.

🔌 Feed Providers

External data sources configured declaratively with field mappings and quality tests:

  • Databento Reader — Historical market data via Databento API
  • HTTP Reader — REST APIs with retry, pagination, rate limiting
  • Field Mappings — Raw columns → CDM columns via QFSQL expressions
  • Quality Tests — Uniqueness, null checks, range validation per field

Providers are configured once in YAML. New venues or data sources are a configuration change, not a code change.

📥 Ingestion Pipeline

Source → ConnectorProcessorWriter → Quantflow Raw Zone
StageWhat It DoesOptions
ReaderDownload raw data from external sourceDatabentoReader, HTTPReader
ProcessorTransform and validateDecompressor (gzip/snappy/zstd)
WriterPersist to target engineDuckDB, OpenLakehouse, BigQuery, Snowflake, Databricks

🏗️ dbt Transformation Pipeline

Quantflow Raw Zone → StagingIntermediateMart
LayerWhat It Does
StagingRaw source → typed, validated staging tables with field-level tests
IntermediateCross-source joins, deduplication, enrichment, business logic
MartAnalysis-ready tables for feature computation and downstream consumption

Auto-generated from CDM definitions. Follows dbt best practices. Zero manual SQL.

🔌 Engine Agnostic

OpenLakehouseOpenLakehouse
DolphinDBDolphinDB
BigQueryBigQuery
SnowflakeSnowflake
DatabricksDatabricks
DuckDBDuckDB

✅ Data Quality

Four-layer validation ensures data integrity at every stage, powered by Elementary for dbt-native monitoring and anomaly detection:

  1. Schema Validation — Pydantic models enforce type safety at parse time
  2. Field-level Tests — Per-column uniqueness, null checks, range constraints
  3. Cross-table Integrity — Referential consistency across CDM entities
  4. Business Rules — Domain-specific assertions (e.g., bid ≤ ask, positive volume)

Test failures are reported with row-level diagnostics. Configurable severity: warn, error, or abort. Elementary provides dashboards, alerts, and lineage-aware monitoring out of the box.

📚 Design Docs

  • DataInfra Overview — Full CDM specification, feed provider YAML reference, dbt generation, metadata governance
  • Ingestion Pipeline — Connectors, processors, writers, and multi-source configuration
  • dbt Model Generator — Auto-generated staging, intermediate, and mart models
  • Data Quality — Four-layer validation architecture with Elementary integration
  • Metadata Specifications — Pydantic model reference: ProjectConfig, FeedProvider, CDMEntity, EngineConfig