Skip to main content

AST Compiler

The AST compiler transforms formula strings into IR DAG nodes. It uses Python's ast module to parse mathematical expressions, then walks the syntax tree to produce a list of IRNode objects — each representing one computation step. This is the bridge between human-readable formulas and the backend-agnostic IR.


Why Formulas

Before the formula compiler, FeatureTypes were defined with explicit steps: lists — manual DAG construction in YAML. The formula compiler replaces this with:

# Old: explicit steps
steps:
- primitive: TRANSFORM
op: sub
params: { a: best_bid_size, b: best_ask_size }
- primitive: WINDOW
op: diff
params: { input: bid_ask_diff, window: 1 }

# New: single formula string
formula: "diff(best_bid_size - best_ask_size, $window)"

The formula is more concise, less error-prone, and allows the compiler to optimize the resulting DAG. A single formula line replaces dozens of YAML lines.


Formula Language Reference

Arithmetic

OperatorPrimitiveExample
+TRANSFORM.adda + b
-TRANSFORM.suba - b
*TRANSFORM.mula * b
/TRANSFORM.diva / b
**TRANSFORM.powa ** 2

Unary Functions (TRANSFORM)

FunctionDescription
sign(x)Sign: -1, 0, or +1
abs(x)Absolute value
sqrt(x)Square root
log(x)Natural logarithm
tanh(x)Hyperbolic tangent (bounds to [-1, 1])
clip(x, lo, hi)Clamp to [lo, hi]
round(x), floor(x), ceil(x)Rounding

Comparisons (TRANSFORM)

OperatorDescription
a > b, a < bGreater-than, less-than
a >= b, a <= bGreater/less-or-equal
a == b, a != bEquality, inequality

Conditional

if(condition, true_value, false_value)

Compiles to TRANSFORM.conditional — evaluated row-by-row.

Window Functions (WINDOW)

Window functions operate over a fixed-size trailing window of rows.

FunctionDescriptionExtra Params
rolling_mean(x, window)Rolling average
rolling_std(x, window)Rolling standard deviation
rolling_sum(x, window)Rolling sum
rolling_min(x, window)Rolling minimum
rolling_max(x, window)Rolling maximum
rolling_median(x, window)Rolling median
rolling_skew(x, window)Rolling skewness
rolling_kurt(x, window)Rolling kurtosis
rolling_zscore(x, window)Rolling z-score: (x - μ) / σ
rolling_corr(x, y, window)Rolling correlation between two series
zscore(x, window)Alias for rolling_zscore

Lag / Diff (WINDOW)

FunctionDescription
diff(x, n)Difference from n periods ago: x[t] - x[t-n]
lag(x, n)Value n periods ago: x[t-n]
shift(x, n)Shift forward by n periods
pct_change(x, n)Percentage change: (x[t] - x[t-n]) / x[t-n]

Autocorrelation (WINDOW)

FunctionDescription
autocorr(x, lag)Positive autocorrelation at given lag
neg_autocorr(x, lag)Negative autocorrelation (for mean-reversion signals)

Entropy (WINDOW)

FunctionDescription
shannon_entropy(x, window)Shannon entropy over rolling window
sample_entropy(x, window)Sample entropy (pattern regularity)

Special Window Functions

FunctionDescription
rolling_percentile(x, window, q)q-th percentile over rolling window
historical_var(x, window, q)Historical Value-at-Risk (quantile)
consecutive_count(x, window)Count of consecutive above-threshold values
percentile_rank(x, window)Percentile rank of current value in rolling distribution
time_under_water(x, window)Duration since last peak
linear_regression_slope(x, window)Slope of linear regression over window
half_life(x, window)Estimated half-life of mean reversion
adx(high, low, close, window)Wilder's ADX trend strength indicator

State Accumulators (STATE)

State functions maintain cumulative memory across all prior rows — no fixed window.

FunctionDescription
ema(x, alpha)Exponential moving average: S[t] = α·x[t] + (1-α)·S[t-1]
decay_accum(x, decay)Decaying accumulator: S[t] = decay·S[t-1] + x[t]
cumsum(x, init)Cumulative sum starting from init

Utility

FunctionDescription
rsi_from_rs(rs)Convert Relative Strength to RSI: 100 - 100/(1+rs)
min(a, b), max(a, b)Two-argument min/max (decomposed to compare + conditional)

Column References vs Parameters

In a formula, bare names refer to CDM columns. Names prefixed with $ refer to configurable parameters from the FeatureType's parameters: block.

# FeatureType definition
name: ofi
required_inputs:
- best_bid_size # ← column reference in formula
- best_ask_size # ← column reference in formula
parameters:
window:
type: integer
default: 1
formula: "cumsum((diff(best_bid_size, $window) - diff(best_ask_size, $window)), 0)"
# ^^^ columns ^^^^^^ parameter ^^^^^^ parameter

During compilation, $window is resolved to its concrete value (e.g., 1) from the parameter defaults or feature overrides. Column names pass through as column references in the IR.

Python keywords that conflict with formula identifiers (e.g., return) are handled with a _kw_ prefix: _kw_return_ before parsing, restored to return after.


Compilation Pipeline

Step 1: Parse

Python's ast.parse() converts the formula string into an AST:

"sign(OFI) * abs(momentum)"


BinOp(op=Mult,
left= Call(func=Name('sign'), args=[Name('OFI')]),
right=Call(func=Name('abs'), args=[Name('momentum')]))

Step 2: Walk

FormulaIRCompiler._walk() recursively traverses the AST, producing IRNodes at each node:

AST NodeIRNode Produced
ast.ConstantTRANSFORM.lit — literal value
ast.NameTRANSFORM.col (if column) or TRANSFORM.lit (if $param)
ast.AttributeTRANSFORM.col with flattened dotted name (a.b.c)
ast.BinOpTRANSFORM.{add, sub, mul, div, pow}
ast.UnaryOpTRANSFORM.neg
ast.CompareTRANSFORM.{gt, lt, eq, neq, lte, gte}
ast.IfExpTRANSFORM.conditional
ast.CallDispatched to handler by function name

Step 3: Dispatch

Function calls (ast.Call) are dispatched by category using frozenset lookups from builtins.py:

if func_name in _UNARY_TRANSFORM_FUNCS: # sign, abs, sqrt, log, tanh
return make_transform_node(func_name, args[0])
elif func_name in _UNARY_WINDOW_FUNCS: # rolling_mean, rolling_std, ...
return make_window_node(func_name, args[0], args[1])
elif func_name in _STATE_FUNCS: # ema, decay_accum, cumsum
return make_state_node(func_name, args[0], args[1])
elif func_name in _AUTOCORR_FUNCS: # autocorr, neg_autocorr
return make_autocorr_node(func_name, args[0], args[1])
...

Step 4: Output

The compiler returns (nodes, root_id) — the full list of IRNodes and the ID of the root (final) node. The IR builder then assembles these into the full DAG.


Example: Full Compilation Walkthrough

Formula: "cumsum((diff(best_bid_size, $window) - diff(best_ask_size, $window)), 0)"

With window = 5:

Parse:
Call(func='cumsum', args=[
BinOp(op=Sub,
left= Call(func='diff', args=[Name('best_bid_size'), Constant(5)]),
right=Call(func='diff', args=[Name('best_ask_size'), Constant(5)])),
Constant(0)
])

Compiled IRNodes:
1. TRANSFORM.col → best_bid_size
2. WINDOW.diff → best_bid_size[t] - best_bid_size[t-5]
3. TRANSFORM.col → best_ask_size
4. WINDOW.diff → best_ask_size[t] - best_ask_size[t-5]
5. TRANSFORM.sub → diff_bid - diff_ask
6. STATE.cumsum → running total from 0

DAG edges:
1 → 2 (diff depends on col)
3 → 4 (diff depends on col)
2 → 5 (sub depends on both diffs)
4 → 5
5 → 6 (cumsum depends on sub)

The resulting DAG has 6 nodes. The compiler handles all dependency wiring automatically — no manual inputs: declarations needed.


Utility Functions

The ir/ast/utils.py module provides high-level entry points:

FunctionPurpose
formula_to_ir(formula, feature_columns, param_names)Compile formula → (nodes, root_id)
formula_to_steps(formula, output_name, param_names)Compile formula → List[ComputationStep] (for legacy interop)
extract_formula_variables(formula, param_names)Extract column references (excluding known functions and params)
ir_nodes_to_steps(nodes)Convert IRNode list → ComputationStep list

File Reference

FilePurpose
ir/ast/compiler.pyFormulaIRCompiler — main AST → IR compilation
ir/ast/handlers.pyFormulaHandlersMixin — per-function handler dispatch
ir/ast/builtins.pyFunction classification frozensets + helpers
ir/ast/utils.pyHigh-level formula_to_ir / formula_to_steps API