Label Engine
Converts CDM bar data into supervised ML training labels. Six labeling methods — triple barrier, fixed horizon return, trend scanning, quantile binning, time-series sign, and meta-labeling — deployed through a decorator-based registry with backend-agnostic I/O.
Problem & Approach
Feature computation answers "what does the market look like?" — supervised learning also needs "what should we predict?" The Label Engine transforms price data into target variables that downstream ML models learn to forecast.
| Method | Forward-Looking | Label Space | Primary Use Case |
|---|---|---|---|
| Triple Barrier | Yes (path scan) | +1, 0, -1 | Regime-adaptive take-profit/stop-loss |
| Fixed Horizon | Yes (shift) | Regression or discrete | Classic time-series forecasting |
| Trend Scanning | No (CUSUM) | +1, 0, -1 | Regime detection at observation time |
| Quantile Label | Yes (cross-sectional) | +1, 0, -1 | Relative-strength ranking across assets |
| Time-Series Sign | Yes (shift) | +1, 0, -1 | Direction classifier with noise band |
| Meta Labeling | Yes (requires primary model) | 0, 1 | Secondary model of primary label correctness |
All methods return a LabelResult — a named container with the result DataFrame, forward horizon (for training alignment), and output column list.
Architecture
Registry: A module-level dict maps method names to callables. Each method is a plain function decorated with @register("method_name"). Importing label_engine.methods triggers all decorators. New method = one function + one decorator.
Orchestrator (LabelEngine): The single entry point. Reads LabelEngineConfig from project YAML, loads required CDM tables via CDMReader, dispatches each label definition to its registered method, collects LabelResult objects, and optionally persists via LabelWriter.
Configuration (LabelEngineConfig): A Pydantic model with historical_label_engine (default: "polars") and labels list. Each entry is a LabelDefinition.
Data Flow
Project YAML → LabelEngineConfig (Pydantic)
│
▼
CDM tables → CDMReader → {table_name: pl.DataFrame}
│
▼
dispatch to registered method
cdm_data, params, inputs → method()
│
▼
LabelResult (name, df, horizon, output_cols)
│
▼
LabelWriter.write_labels() → cdm_labels
Input — CDMReader
Wraps the engine (DBEngine) to load CDM tables as Polars DataFrames. Knows the canonical time column per CDM table:
| Table | Time Column |
|---|---|
cdm_trades, cdm_trade_enriched, cdm_lob_l2 | event_time |
cdm_{type}_bars | start_time |
ft_features | timestamp |
Supports symbol filtering, time-range filtering, column subsets, and custom sort ordering.
Output — LabelWriter
Labels are persisted to the cdm_labels table: symbol, timestamp, label_name, label_value (Int8), horizon, run_id, plus method-specific columns (forward_return, barrier_hit, cusum_value). The normalized schema enables cross-label querying, run provenance tracking, and incremental computation.
Configuration
label_engine:
enabled: true
historical_label_engine: polars
labels:
- name: triple_barrier_20_2pct
type: triple_barrier
description: Triple-barrier with 2% barriers, 20-period max horizon
parameters:
horizon: 20
upper_barrier: 0.02
lower_barrier: 0.02
vertical_barrier: 20
inputs:
close: close
high: high
low: low
bar_types: [time_1m]
- name: fwd_return_5
type: fixed_horizon_return
parameters:
horizon: 5
return_type: simple
binning:
method: quantile
n_bins: 3
inputs:
price: close
Each LabelDefinition has:
| Field | Required | Description |
|---|---|---|
name | Yes | Unique identifier — stored as label_name in output |
type | Yes | Must match a @register name |
parameters | No | Method-specific (horizon, barriers, thresholds, binning) |
inputs | No | Column name mappings (logical → actual CDM column) |
dependencies | No | CDM tables required (defaults: bar tables) |
bar_types | No | Filter to specific bar types |
Key Design Decisions
- Registry-based dispatch —
@registermakes methods self-registering. Add a method with one function + one decorator. - Column indirection via
inputsdict — methods never hard-code column names;inputs: {close: close}mappings decouple logic from schema. - Normalized label storage —
(symbol, timestamp, label_name, label_value, horizon, run_id)schema supports cross-label querying and run provenance. - Horizon-aware training alignment —
export_training_datashifts label timestamps back by horizon sofeature(t)aligns withlabel(t + horizon), preventing look-ahead bias. - Pydantic config —
LabelEngineConfigaccepts both bare list and{labels: [...]}dict forms. - Backend-agnostic I/O — depends only on
DBEngineabstract interface, works across all engines.