Optimizing Compiler: Pass Design
Date: 2026-04-04 Parent: ../index.md Related: primitive-catalog.md, backend-contract.md, ../architecture/tenferro-crates.md
I. Purpose
This document specifies the optimization passes that run while lowering computegraph CompiledProgram values into tenferro’s execution IR.
The current compiler entry points are:
compile_std_to_exec()compile_semiring_to_exec()
Both lower directly to ExecProgram. Passes run on ExecProgram in place; there is no intermediate StableHloProgram / StableHloOp layer anymore.
Before any pass runs, lowering also computes:
ExecInstruction::dtypeExecInstruction::output_shapesExecInstruction::last_use
The new output_shapes field is the key enabler for future DotDecomposer work: the compiler can now reason about reshapes and contraction outputs without reconstructing shapes from scratch downstream.
II. Current Status
Active passes
| Pass | Status | Purpose |
|---|---|---|
DotDimensionSorter |
active | Sort contracting dimensions on ExecOp::DotGeneral to stabilize canonical order |
TransposeFolding |
active | Fold producer ExecOp::Transpose into downstream ExecOp::DotGeneral dimension numbers |
DotDecomposer |
active | Canonicalize ExecOp::DotGeneral into [M?, K, B...] × [K, N?, B...] → [M?, N?, B...] by emitting explicit Transpose/Reshape wrappers |
DeadCodeElimination |
active | Drop instructions whose outputs are never consumed by a program output or later instruction (preserving side-effecting ValidateNonsingular) |
Removed pass
| Pass | Status | Reason |
|---|---|---|
ReductionSimplification |
removed | The pass was deleted during the 1-layer IR refactor; it is not part of the current compiler pipeline |
No-op category removed by IR cleanup
Structured linalg operations no longer travel through string CustomCall targets, so there is no separate “linalg passthrough” pass in the current pipeline. Svd, Qr, Lu, Eigh, Eig, TriangularSolve, and ValidateNonsingular are first-class ExecOp variants.
III. Pass Specifications
III.1 DotDimensionSorter
Scope: ExecProgram
Purpose: Normalize DotGeneral contracting dimensions into a stable sorted order when they are already consecutive up to permutation.
Why it still matters: TransposeFolding and future DotDecomposer logic both behave more predictably when contracting dimensions already have a canonical ordering.
When to fire:
ExecInstruction.opisExecOp::DotGeneral(config)- one side’s contracting dimensions are consecutive if sorted
- that side’s contracting dimensions are currently unsorted
Algorithm:
PROCEDURE SortDotDimensions(program):
FOR each instruction IN program.instructions:
IF instruction.op is not DotGeneral:
CONTINUE
lhs = instruction.op.lhs_contracting_dims
rhs = instruction.op.rhs_contracting_dims
IF lhs is consecutive_if_sorted AND lhs is not sorted:
perm = argsort(lhs)
lhs = apply_perm(lhs, perm)
rhs = apply_perm(rhs, perm)
ELSE IF rhs is consecutive_if_sorted AND rhs is not sorted:
perm = argsort(rhs)
lhs = apply_perm(lhs, perm)
rhs = apply_perm(rhs, perm)
Example:
Before:
lhs_contracting = [3, 2]
rhs_contracting = [2, 1]
After:
lhs_contracting = [2, 3]
rhs_contracting = [1, 2]
III.2 TransposeFolding
Scope: ExecProgram
Purpose: Remove producer Transpose instructions that feed directly into DotGeneral by rewriting the dot dimension numbers and bypassing the transpose output slot.
When to fire:
- the consumer instruction is
ExecOp::DotGeneral - the selected operand is produced by
ExecOp::Transpose - the permutation is valid for the operand rank
- the foldability checks pass
Current foldability rule: batch dimensions must remain fixed, and the operand must expose exactly one contracting dimension after the rewrite.
Algorithm:
PROCEDURE FoldTransposeIntoDot(program):
REPEAT until fixed point:
changed = false
FOR each DotGeneral instruction IN program.instructions:
FOR operand_idx IN {0, 1}:
producer = instruction producer for operand_idx
IF producer is not Transpose:
CONTINUE
perm = producer.permutation
IF transpose is foldable for this operand:
rewrite DotGeneral dimension numbers with perm
replace operand slot with the transpose input slot
changed = true
IF changed is false:
BREAK
Example:
Before:
t0 = Transpose(A, [1, 0])
t1 = DotGeneral(t0, B)
After:
t1 = DotGeneral(A, B) with rewritten dimension numbers
The pass runs to a fixed point so later DotGeneral instructions can absorb transposes exposed by earlier folds.
III.3 DotDecomposer
Scope: ExecProgram
Goal: Canonicalize arbitrary ExecOp::DotGeneral contractions into a shape directly consumable by canonical batched GEMM kernels. Runs after TransposeFolding, so pre-existing foldable transposes are already absorbed when decomposition kicks in.
Canonical form (tenferro column-major with batch trailing, per AGENTS.md):
LHS shape: [M?, K, B0, B1, ...]
lhs_contracting_dims = [|free_L|']
lhs_batch_dims = [|free_L|' + 1, ..., |free_L|' + nb]
|free_L|' = 1 if the original LHS had any free dim, else 0.
RHS shape: [K, N?, B0, B1, ...]
rhs_contracting_dims = [0]
rhs_batch_dims = [|free_R|' + 1, ..., |free_R|' + nb]
|free_R|' = 1 if the original RHS had any free dim, else 0.
Output shape (from shape_infer::dot_general_shape): [M?, N?, B0, B1, ...] with 0-or-1 M dim, 0-or-1 N dim, and the batch sizes trailing.
When to fire: any ExecOp::DotGeneral whose (config, lhs_rank, rhs_rank) is not already in canonical form.
Algorithm per non-canonical DotGeneral (inputs: LHS and RHS slots with shapes in slot_shapes):
- LHS Transpose to target order
free_L ++ contracting_L ++ batch_L. Skip if already identity. - LHS Reshape to
[M?, K, B...]by merging multiple free/contracting dims. Skip if|free_L| ≤ 1and|contracting_L| == 1. - RHS Transpose to target order
contracting_R ++ free_R ++ batch_R. Skip if already identity. - RHS Reshape to
[K, N?, B...]. Same skip rule as LHS. - Canonical
DotGeneralwith the dim-number positions described above. - Output Reshape — the XLA Step 5 that the pre-#729 skeleton was missing — restores the original output shape when either side had multiple free dims. It takes the canonical DotGeneral output and the original LHS and RHS slots as shape-providing inputs, so the dynamic free-dim sizes can be recovered at runtime.
Skipping the output Reshape (step 6) is what caused the historical downstream-Permute bug: downstream consumers would observe the merged rank instead of the expected unmerged rank, and subsequent Transpose/Reshape users would fall out of bounds. With the Reshape in place, downstream ops see the original output shape and no further rewriting is required.
Example (single non-canonical case):
Before:
DotGeneral lhs=[a, b, M, K1, K2] rhs=[a, b, K1, K2, N]
lhs_batch=[0, 1] rhs_batch=[0, 1]
lhs_cont=[3, 4] rhs_cont=[2, 3]
After:
Transpose(lhs, perm=[2, 3, 4, 0, 1]) -> [M, K1, K2, a, b]
Reshape(...) -> [M, K1*K2, a, b]
Transpose(rhs, perm=[2, 3, 4, 0, 1]) -> [K1, K2, N, a, b]
Reshape(...) -> [K1*K2, N, a, b]
DotGeneral(canonical) -> [M, N, a, b]
Interaction with TransposeFolding: TransposeFolding runs first and may leave dead Transpose producers (its folded-away producer is bypassed but not removed). The subsequent DeadCodeElimination reclaims that runtime cost. For programs where the user already wrote a canonical Transpose → DotGeneral, the fold+decomp combination is net-identity (the fold absorbs, the decomp re-emits), so the compiled result is equivalent to the input up to dead-code cleanup.
III.4 DeadCodeElimination
Scope: ExecProgram
Purpose: Remove instructions whose outputs are not consumed by any downstream instruction or program output. Preserves instructions with observable side effects (currently only ExecOp::ValidateNonsingular, which errors on singular matrices).
When to fire: always. In practice, instructions die after DotDecomposer re-canonicalizes a DotGeneral whose upstream Transpose was already folded into dim-numbers by TransposeFolding.
III.5 ReductionSimplification
ReductionSimplification no longer exists in the compiler. Historical notes about hoisting independent reductions remain useful background, but there is no implementation, no tests, and no pass-order slot for it in the current code.
IV. Pass Order
The current lowering pipeline is:
CompiledProgram<StdTensorOp or SemiringOp<_>>
│
│ lower directly to ExecProgram
│ infer dtype + output_shapes per instruction
▼
ExecProgram
│
│ 1. DotDimensionSorter
│ 2. TransposeFolding (fixed point)
│ 3. DotDecomposer
│ 4. DeadCodeElimination
│ 5. populate_last_use
▼
Ready for execution
DotDecomposer runs after TransposeFolding so that any pre-existing foldable transpose is absorbed into the DotGeneral dim numbers before the decomposition step canonicalizes the result. DeadCodeElimination runs last among the optimizations to remove bypassed instructions (typically Transpose producers the fold left behind) before populate_last_use records the final liveness map.
ReductionSimplification is intentionally absent because the pass was deleted.
V. Comparison to the Previous Execution Engine
| Previous behavior | Current status |
|---|---|
| Contracting-dimension normalization during execution planning | Moved into DotDimensionSorter on ExecProgram |
| Lazy transpose absorption around GEMM | Moved into TransposeFolding on ExecProgram |
Multi-step DotGeneral canonicalization |
Implemented in DotDecomposer (#729) with the full XLA-style algorithm, including the output Reshape step |
| Reduction hoisting before contraction | No dedicated pass today |
Linalg passthrough via string CustomCall |
Removed; linalg ops are structured ExecOp variants |
VI. Testing Expectations
The pass test suite should focus on:
- direct
ExecProgramfixtures - exact rewrites of
ExecOp::DotGeneraldimension metadata - fixed-point behavior for
TransposeFolding - preservation of output slots and
output_shapes
Regression tests live in tenferro/tests/compiler_passes.rs.