Backend Architecture
Date: 2026-05-28 Repos: tenferro-rs Related: ../architecture/ad-pipeline.md, primitive-catalog.md, ../reference/stablehlo-primitives.md, ../reference/jax-primitives.md
I. Overview
All computation, primal and derivative, flows through the same execution pipeline.
Execution path
MaterializedGraph
│
│ compile
▼
CompiledProgram<StdTensorOp>
│
│ compile_std_to_exec()
│ - lower StdTensorOp -> ExecOp
│ - infer dtype and output_shapes
│ - run DotDimensionSorter + TransposeFolding + DotDecomposer
│ - run DeadCodeElimination
▼
ExecProgram
│
│ GraphExecutor::eval_exec_ir()
│ - owns backend runtime cache
│ - owns extension runtime registry/cache
│ - routes through segmented dispatch
▼
internal exec dispatch
│
├── unsegmented core-only path
└── segmented path
│
▼
TensorBackend / BackendSession dispatch
There is no in-process StableHloProgram / StableHloOp layer. The current runtime contract is centered on ExecProgram.
The optional XLA path is a peer executor over a compiled GraphProgram lowering view, not a native backend. tenferro-xla may inspect immutable lowering metadata, emit StableHLO, and load PJRT plugins at runtime, but it does not implement TensorBackend and does not participate in GraphExecutor<B> dispatch.
II. Execution IR
ExecProgram is the single in-process execution IR for standard tensor programs and registered extension runtimes.
pub struct ExecProgram {
pub instructions: Vec<ExecInstruction>,
pub input_slots: Vec<usize>,
pub output_slots: Vec<usize>,
pub n_slots: usize,
}
pub struct ExecInstruction {
pub op: ExecOp,
pub input_slots: Vec<usize>,
pub output_slots: Vec<usize>,
pub dtype: DType,
pub output_shapes: Vec<Vec<DimExpr>>,
pub output_extents: Vec<Vec<ShapeExtent<DimExpr>>>,
pub last_use: Vec<bool>,
}Core guarantees
- Each instruction is SSA over slots: outputs are written once.
dtypeis a representative instruction dtype used by legacy single-output paths. For multi-output extensions, per-output slot metadata is the authoritative dtype source.output_shapescontains one symbolic shape per output slot.output_extentscontains one extent vector per output slot.last_useis populated after lowering and is used for buffer reclamation.- Multi-output extension instructions write directly to multiple output slots and may use mixed output dtypes.
ExecOp vocabulary
The execution IR keeps StableHLO-aligned naming where it remains useful, but the ops are the real runtime contract:
- Elementwise:
Add,Multiply,Negate,Conj,Divide,Abs,Sign,Maximum,Minimum,Compare,Select,Clamp,Exp,Log,Sin,Cos,Tanh,Sqrt,Rsqrt,Pow,Expm1,Log1p - Structural:
Transpose,Reshape,BroadcastInDim,Convert,ExtractDiag,EmbedDiag,Tril,Triu - Reductions:
ReduceSum,ReduceProd,ReduceMax,ReduceMin - Indexing / shape:
Gather,GatherDynamicSliceSizes,Scatter,Slice,DynamicSlice,DynamicUpdateSlice,Pad,Concatenate,Reverse,ShapeOf,DynamicTruncate,PadToMatch - Contraction:
DotGeneral - Extension boundary:
Extension - Constants:
Constant
String CustomCall dispatch is gone. Structured linalg variants are first class extension operations owned by tenferro-linalg, not core ExecOp variants.
III. Lowering Contract
StdTensorOp lowering
compile_std_to_exec() consumes:
CompiledProgram<StdTensorOp>- input dtypes
- input shapes as
Vec<DimExpr>
For each computegraph instruction it:
- infers output dtype, shape, and extent metadata; extension instructions use
infer_extension_output_meta()for one(dtype, shape)pair per output slot - resolves output extents
- lowers
StdTensorOptoExecOp - records output slot dtype/shape/extent metadata
- runs the compiler passes on the resulting
ExecProgram - populates
last_use
Current pass pipeline
The active optimizer passes are:
DotDimensionSorterTransposeFoldingDotDecomposerDeadCodeElimination
ReductionSimplification was deleted and is not part of the current backend contract.
For pass algorithms and rationale, see optimizer-passes.md.
IV. Dispatch Categories
Execution is divided into three instruction categories.
Backend-session instructions
These run through BackendSession inside TensorBackend::with_backend_session(). They are the operations eligible for grouped segmented execution and, when supported by the backend, elementwise fusion planning.
Examples:
- elementwise ops
- structural ops such as
Transpose,Reshape,BroadcastInDim - reductions such as
ReduceSum,ReduceProd,ReduceMax,ReduceMin - indexing ops such as
Gather,GatherDynamicSliceSizes,Scatter,Slice,DynamicSlice,DynamicUpdateSlice,Pad,Concatenate,Reverse
The helper that executes one such instruction is execute_backend_op().
Host instructions
These are handled without calling backend kernels:
ShapeOfDynamicTruncatePadToMatchConstant
GatherDynamicSliceSizes resolves its symbolic slice_sizes against concrete runtime tensor shapes in the execution layer, then calls the backend through the normal concrete Gather path.
Constant uses TensorBackend::upload_host_tensor() so device-specific execution still receives correctly placed tensors without implicit transfer of user-supplied inputs.
FFI / boundary instructions
These stay as single-instruction boundaries in segmented execution:
DotGeneralExtension
DotGeneral dispatches through TensorBackend. Extension dispatch routes through the registered ExtensionRuntime for the operation family; linalg, einsum, and FFT register those runtimes from their owning crates.
V. Segmented vs. Unsegmented Execution
GraphExecutor::eval_exec_ir() is the public execution entry point for an ExecProgram. It carries the backend cache, extension runtime registry, and extension runtime cache required to preserve dispatch invariants.
The segmented internal path groups fusible backend instructions:
ExecProgram
│
▼
segment_exec_program()
│
├── fused backend-session segments
├── single-instruction FFI segments
└── single-instruction host segments
Segmented execution exists to:
- reuse one backend execution session across consecutive backend ops
- enable elementwise fusion planning where the backend supports it
- preserve the same observable behavior as unsegmented execution
The unsegmented internal path evaluates one instruction at a time and is used for parity checks and narrow owner-scoped extension-runtime composition. It is not a general public execution surface. Extension instructions must run through a registered ExtensionRuntime; missing runtime registration is an error, not a fallback to ExtensionOp::eager_execute().
The engine uses last_use metadata to reclaim buffers via BackendSession::reclaim_buffer() or TensorBackend::reclaim_buffer().
VI. Backend Traits
TensorBackend
TensorBackend is the full standard-algebra backend surface in crates/tenferro-tensor/src/backend.rs.
It includes:
- elementwise arithmetic and analytic ops
- structural ops
- reductions
dot_general- indexing ops
with_backend_sessiondownload_to_hostupload_host_tensorreclaim_buffer
BackendSession is the session-local companion trait used by grouped backend execution. Backends may override with_backend_session() to install one shared execution scope, for example a CPU thread-pool context.
Custom operation families do not add a second backend trait. They lower to ExecOp::Extension and dispatch through their registered ExtensionRuntime. The owning extension crate is responsible for deciding whether that runtime uses the active TensorBackend, a provider-specific library, or an internal implementation.
VII. Layout and Device Contract
Layout
All runtime tensors are dense contiguous column-major tensors. The backend contract does not expose arbitrary stride-aware dispatch in ExecProgram.
This means:
ExecProgramdoes not encode layout transforms as a separate concern- backends receive dense tensors and can assume column-major layout
- compile-time shape reasoning is symbolic, but runtime storage layout is not
Device transfer
tenferro does not perform implicit CPU<->GPU transfer for user-visible backend ops. Tensors must already be on the correct device for the backend call.
The execution engine handles only two internal conveniences:
Constantcan be auto-uploaded throughupload_host_tensor()- host-only metadata/scalar operations can inspect or materialize tiny host values as part of execution
Unsupported backend operations must return an error rather than silently falling back across devices.
Placement
ExecProgram is placement-agnostic. Device placement lives on runtime Tensor values, not in the compiled IR.
VIII. Relation to StableHLO
StableHLO is now a reference vocabulary, not an in-process IR layer.
What remains true:
- many
StdTensorOp/ExecOpnames intentionally align with StableHLO - StableHLO and JAX documentation are still useful semantic references
- future external serialization could target StableHLO if the project adds such a backend later
What is no longer true:
- there is no
StableHloProgram - there is no
StableHloOp - there is no
lower_to_stablehlo()step in the live execution pipeline - there is no
GetTupleElement-style tuple indexing in runtime IR
IX. File Ownership
The current implementation is split across:
crates/tenferro-runtime/src/compiler/mod.rscrates/tenferro-runtime/src/shape_infer.rscrates/tenferro-runtime/src/exec.rscrates/tenferro-runtime/src/segment.rscrates/tenferro-runtime/src/graph/executor.rscrates/tenferro-tensor/src/backend.rs
Those files are the source of truth for the live backend contract. This document is intentionally a high-level summary of that code.