Skip to main content

Label Engine

Converts CDM bar data into supervised ML training labels. Six labeling methods — triple barrier, fixed horizon return, trend scanning, quantile binning, time-series sign, and meta-labeling — deployed through a decorator-based registry with backend-agnostic I/O.


Problem & Approach

Feature computation answers "what does the market look like?" — supervised learning also needs "what should we predict?" The Label Engine transforms price data into target variables that downstream ML models learn to forecast.

MethodForward-LookingLabel SpacePrimary Use Case
Triple BarrierYes (path scan)+1, 0, -1Regime-adaptive take-profit/stop-loss
Fixed HorizonYes (shift)Regression or discreteClassic time-series forecasting
Trend ScanningNo (CUSUM)+1, 0, -1Regime detection at observation time
Quantile LabelYes (cross-sectional)+1, 0, -1Relative-strength ranking across assets
Time-Series SignYes (shift)+1, 0, -1Direction classifier with noise band
Meta LabelingYes (requires primary model)0, 1Secondary model of primary label correctness

All methods return a LabelResult — a named container with the result DataFrame, forward horizon (for training alignment), and output column list.

Detailed method logic


Architecture

Registry: A module-level dict maps method names to callables. Each method is a plain function decorated with @register("method_name"). Importing label_engine.methods triggers all decorators. New method = one function + one decorator.

Orchestrator (LabelEngine): The single entry point. Reads LabelEngineConfig from project YAML, loads required CDM tables via CDMReader, dispatches each label definition to its registered method, collects LabelResult objects, and optionally persists via LabelWriter.

Configuration (LabelEngineConfig): A Pydantic model with historical_label_engine (default: "polars") and labels list. Each entry is a LabelDefinition.


Data Flow

Project YAML → LabelEngineConfig (Pydantic)


CDM tables → CDMReader → {table_name: pl.DataFrame}


dispatch to registered method
cdm_data, params, inputs → method()


LabelResult (name, df, horizon, output_cols)


LabelWriter.write_labels() → cdm_labels

Input — CDMReader

Wraps the engine (DBEngine) to load CDM tables as Polars DataFrames. Knows the canonical time column per CDM table:

TableTime Column
cdm_trades, cdm_trade_enriched, cdm_lob_l2event_time
cdm_{type}_barsstart_time
ft_featurestimestamp

Supports symbol filtering, time-range filtering, column subsets, and custom sort ordering.

Output — LabelWriter

Labels are persisted to the cdm_labels table: symbol, timestamp, label_name, label_value (Int8), horizon, run_id, plus method-specific columns (forward_return, barrier_hit, cusum_value). The normalized schema enables cross-label querying, run provenance tracking, and incremental computation.


Configuration

label_engine:
enabled: true
historical_label_engine: polars
labels:
- name: triple_barrier_20_2pct
type: triple_barrier
description: Triple-barrier with 2% barriers, 20-period max horizon
parameters:
horizon: 20
upper_barrier: 0.02
lower_barrier: 0.02
vertical_barrier: 20
inputs:
close: close
high: high
low: low
bar_types: [time_1m]

- name: fwd_return_5
type: fixed_horizon_return
parameters:
horizon: 5
return_type: simple
binning:
method: quantile
n_bins: 3
inputs:
price: close

Each LabelDefinition has:

FieldRequiredDescription
nameYesUnique identifier — stored as label_name in output
typeYesMust match a @register name
parametersNoMethod-specific (horizon, barriers, thresholds, binning)
inputsNoColumn name mappings (logical → actual CDM column)
dependenciesNoCDM tables required (defaults: bar tables)
bar_typesNoFilter to specific bar types

Key Design Decisions

  1. Registry-based dispatch@register makes methods self-registering. Add a method with one function + one decorator.
  2. Column indirection via inputs dict — methods never hard-code column names; inputs: {close: close} mappings decouple logic from schema.
  3. Normalized label storage(symbol, timestamp, label_name, label_value, horizon, run_id) schema supports cross-label querying and run provenance.
  4. Horizon-aware training alignmentexport_training_data shifts label timestamps back by horizon so feature(t) aligns with label(t + horizon), preventing look-ahead bias.
  5. Pydantic configLabelEngineConfig accepts both bare list and {labels: [...]} dict forms.
  6. Backend-agnostic I/O — depends only on DBEngine abstract interface, works across all engines.