tenferro-rs Internal Design
Repo: tenferro-rs Parent: ../index.md Related: computegraph.md, chainrules.md, tidu.md, ../spec/backend-contract.md, ../spec/primitive-catalog.md
I. Purpose
This document defines the internal crate structure and type design of tenferro-rs. The key design driver is that all computation is graph-based: every operation (einsum, linalg, elementwise) produces nodes in a Fragment<Op>, and execution is always lazy through materialize_merge -> compile -> eval.
II. Architecture Migration
Removed crates
The previous architecture organized around eager execution families and tape-based AD. These were replaced by the graph + fragment model:
| Previous crate | Current | Reason |
|---|---|---|
internal/ad-core |
deleted | Fragment replaces tape |
internal/ad-ops |
→ tenferro-ops PrimitiveOp impl |
AD rules live on TensorOp |
internal/ad-linalg |
→ tenferro-ops PrimitiveOp impl |
AD rules in ops/ad/linalg.rs |
internal/ad-surface |
→ tidu-rs differentiate/transpose |
External crate |
internal/frontend-core |
→ tenferro TracedTensor |
Lazy, not eager |
internal/runtime |
→ tenferro Engine |
|
tenferro-dynamic-compute |
deleted | Always graph |
tenferro-tensor-compute |
→ tenferro-ops |
|
tenferro-linalg-prims |
→ tenferro-ops |
No need to separate |
tenferro-capi |
deferred | Phase 4+ |
extension/* |
deferred |
Retained crates
| Previous crate | Current crate | Notes |
|---|---|---|
tenferro-device |
tenferro-device |
Mostly unchanged |
tenferro-algebra |
tenferro-algebra |
Mostly unchanged |
tenferro-tensor |
tenferro-tensor |
Simplified |
tenferro-prims |
tenferro-ops |
Rewritten: single TensorOp enum |
tenferro-einsum |
tenferro-einsum |
Rewritten: graph builder |
tenferro-linalg |
→ tenferro-ops + tenferro |
AD rules → tenferro-ops, LAPACK kernels → tenferro backend |
tenferro (facade) |
tenferro |
TracedTensor, Engine, backends |
29 crates → 6 crates (plus 3 external: computegraph-rs, chainrules-rs, tidu-rs).
III. Crate Dependency Graph
tenferro-device
|
tenferro-algebra
|
tenferro-tensor ──── tensor runtime crate (data types, kernels, TensorBackend, backends)
|
tenferro-ops ─────── computegraph-rs (GraphOp, Fragment)
| chainrules-rs (PrimitiveOp)
|
├── tenferro-einsum (SemiringOps → Fragment construction)
|
tenferro ──────────── tidu-rs (differentiate, transpose)
(TracedTensor, Engine, backends)
IV. Two Op Types
The fundamental design constraint is that GraphOp::Operand is an associated type, so a single Op type can only serve one Operand type. Since standard algebra (Tensor) and custom algebras (TropicalTensor, etc.) have different Operand types, tenferro provides two Op types:
StdTensorOp — standard algebra, full vocabulary, AD-capable
StdTensorOp is a flat enum whose variants mostly mirror StableHLO ops 1:1 (documented exceptions: composite lowerings like Conj, multi-output linalg ops like Svd). It implements GraphOp (only — not EvalGraphOp), PrimitiveOp, and SemiringOps. There is no GraphOp::eval; all execution flows through the backend pipeline.
Canonical definition: spec/primitive-catalog.md (Section IV – Tenferro IR Vocabulary). AD trait (PrimitiveOp): spec/ad-contract.md.
SemiringOp<T> — custom algebra, semiring subset, no AD
SemiringOp<T> is a generic wrapper around SemiringOpKind that implements GraphOp only (not EvalGraphOp). It delegates algebraic ops to free functions in host_ops (dispatched through TensorBackend) and structural ops to algebra-independent free functions for structural operations. PrimitiveOp is not implemented – no AD for custom algebras.
Canonical definition: spec/primitive-catalog.md (Section IV).
Users extend tenferro by implementing SemiringBackend<Alg> (algebraic ops + kernel dispatch) for their algebra type, then use SemiringOp<Alg> as the op type. Structural ops (transpose, reshape, broadcast_in_dim) are provided automatically by the execution engine.
VI. SemiringOps Trait — Generic Einsum
SemiringOps bridges both StdTensorOp and SemiringOp<T> so that einsum Fragment construction is algebra-agnostic. Both op types implement it.
Canonical definition: spec/primitive-catalog.md.
Einsum is algebra-agnostic:
fn build_einsum_fragment<Op: SemiringOps>(
builder: &mut FragmentBuilder<Op>,
path: &ContractionPath,
inputs: &[ValRef<Op>],
) -> LocalValId {
// Constructs DotGeneral, Transpose, Reshape, etc. nodes
// Does not know which algebra is in use
}The contraction path optimization is also algebra-agnostic (it only depends on shapes and subscripts):
fn optimize_contraction_path(
subscripts: &Subscripts,
shapes: &[&[usize]],
) -> ContractionPath;VII. Einsum: N-ary to Graph
N-ary einsum is decomposed into a graph of binary operations:
einsum("ij,jk,kl->il", A, B, C)
|
| optimize_contraction_path (shape-based, algebra-agnostic)
v
ContractionPath: [(A,B) -> T, (T,C) -> result]
|
| build_einsum_fragment<Op: SemiringOps>
v
Fragment<Op>:
t0 = DotGeneral(A, B, {contract=[j]}) // "ij,jk->ik"
t1 = DotGeneral(t0, C, {contract=[k]}) // "ik,kl->il"
Each binary contraction step may insert Transpose, Reshape, or BroadcastInDim nodes as needed to align axes for DotGeneral.
For standard algebra, the resulting Fragment<StdTensorOp> can be differentiated and transposed by tidu-rs. For custom algebras, Fragment<SemiringOp<T>> goes directly to materialize_merge -> compile -> eval.
VIII. Backend Architecture — Single Execution IR
Design principle
All execution flows through a single in-process execution IR:
CompiledProgram<StdTensorOp>
│
│ compile_std_to_exec()
│ - infer per-instruction dtype + output_shapes
│ - run DotDimensionSorter + TransposeFolding
↓
ExecProgram / ExecOp / ExecInstruction
│
├── eval_exec_ir() / segmented dispatch → TensorBackend
└── eval_semiring_ir() → SemiringBackend + shared structural helpers
compile_std_to_exec() and compile_semiring_to_exec() lower directly from computegraph’s CompiledProgram into ExecProgram. There is no in-process StableHloProgram / StableHloOp layer anymore. StableHLO remains a useful semantic reference for naming and op design, but it is not part of the current runtime pipeline.
For custom algebras (SemiringOp<T>), the same direct lowering applies: compile_semiring_to_exec() produces ExecProgram, and eval_semiring_ir() dispatches algebra-dependent ops through SemiringBackend<Alg> while reusing the same structural helpers and pass pipeline.
Execution IR
ExecInstruction carries the ExecOp, slot wiring, inferred dtype, inferred output_shapes, and liveness metadata (last_use). The compiler lowers StdTensorOp and SemiringOpKind almost 1:1 into ExecOp, with structured linalg variants (Svd, Qr, Lu, Eigh, Eig, TriangularSolve, ValidateNonsingular) represented directly instead of via stringly-typed custom calls.
Pass pipeline
The optimizing compiler now runs directly on ExecProgram. The active passes are DotDimensionSorter and TransposeFolding; DotDecomposer is deferred to issue #729 now that per-instruction shape tracking is available.
Canonical ExecOp definition and pass list: spec/backend-contract.md. Pass algorithms: spec/optimizer-passes.md.
Generic execution engine
The generic engine interprets ExecProgram by dispatching each instruction to TensorBackend methods, standard kernels, or common infrastructure, depending on the dispatch category.
(illustrative, non-normative – see spec/backend-contract.md for canonical definition)
Backend trait
TensorBackend (defined in tenferro-tensor) is the single backend trait that encapsulates standard tensor kernel dispatch. TensorExec is the session-scoped companion trait used for batches of backend ops inside one execution context. CpuBackend lives in tenferro-tensor and implements TensorBackend; the CubeCL GPU backend is partial and feature-gated.
Canonical trait signatures: spec/backend-contract.md.
Standard and custom algebra backends
The standard backend path is CompiledProgram<StdTensorOp> -> compile_std_to_exec() -> ExecProgram -> eval_exec_ir(). Custom algebra backends implement SemiringBackend<Alg> and follow the analogous compile_semiring_to_exec() -> eval_semiring_ir() path.
Engine<B: TensorBackend> is the top-level entry point that orchestrates lowering + compilation + execution. TensorBackend (in tenferro-tensor) is the kernel-level trait that backend authors implement to provide kernels.
See spec/backend-contract.md for the canonical trait signatures.
Backend dispatch in Engine
struct Engine<B: TensorBackend> {
backend: B,
compile_cache: CompileCache,
einsum_cache: EinsumCache,
}For custom algebras, users construct their own evaluation pipeline:
let path = optimize_contraction_path(&subscripts, &shapes);
let fragment = build_einsum_fragment::<TropicalOp>(&mut builder, &path, &inputs);
let view = resolve(vec![fragment]);
let graph = materialize_merge(&view, &outputs);
let prog = compile(&graph);
// Choose backend
let mut backend = TropicalGpuBackend::new(cuda_ctx);
let result = backend.eval_program(&prog, &input_tensors);IX. TracedTensor and Engine
TracedTensor is the user-facing lazy type for standard algebra:
struct TracedTensor {
id: TracedTensorId,
rank: usize,
dtype: DType,
fragment: Arc<Fragment<StdTensorOp>>,
val: LocalValId,
data: Option<Tensor>,
// ... internal fields omitted
}Key operations:
impl TracedTensor {
/// Create from concrete data
fn from(tensor: Tensor) -> Self;
/// Lazy evaluation (single output, no intermediate sharing)
fn eval<B: TensorBackend>(&mut self, engine: &mut Engine<B>) -> Result<&Tensor>;
/// VJP: differentiate → transpose (via tidu-rs), still lazy
fn grad(&self, wrt: &TracedTensor) -> TracedTensor;
/// JVP: differentiate only (via tidu-rs), still lazy
fn jvp(&self, wrt: &TracedTensor, tangent: &TracedTensor) -> TracedTensor;
}
/// Evaluate multiple outputs together.
/// All fragments are resolved into one MaterializedGraph, so shared
/// intermediate nodes (primal values needed by both output and gradient)
/// are computed only once via GlobalValKey deduplication.
fn eval_all<B: TensorBackend>(
engine: &mut Engine<B>,
outputs: &mut [&mut TracedTensor],
) -> Result<Vec<Tensor>>;eval_all is the recommended API when primal outputs and their derivatives are needed together. Single-output eval is a convenience wrapper.
For custom algebras, users work with Fragment<SemiringOp<T>> and CompiledProgram<SemiringOp<T>> directly through the computegraph-rs API, without TracedTensor.
X. User Extension Points
| Goal | What to implement |
|---|---|
| New scalar algebra for einsum (CPU) | Semiring (4 methods) + SemiringBackend<Alg>::gemm (1 method) |
| Custom GPU backend for custom algebra | impl SemiringBackend<Alg> for MyGpuBackend (gemm + overrides) |
| Custom standard backend | impl TensorBackend for MyBackend |
| AD for custom algebra | Define own Op enum, impl PrimitiveOp (advanced) |
The minimal extension path (CPU, e.g., tropical semiring):
- Define algebra type, impl
Semiring—zero(),one(),add(),mul() impl SemiringBackend<MyAlgebra> for CpuBackend— onlygemm()required (e.g., calltropical_gemm).batched_gemm,add,mul,reduce_sumhave defaults using strided-kernel +Semiringtrait.- Use
SemiringOp<MyAlgebra>as the Op type — einsum + compile + eval work immediately viaeval_semiring_ir.
Adding a GPU backend for custom algebra:
impl SemiringBackend<MyAlgebra> for MyGpuBackend— providegemm()with GPU kernels. Overridebatched_gemm,add,mul,reduce_sumif optimized GPU versions exist.- Use the same
CompiledProgram<SemiringOp<MyAlgebra>>— graph construction and compilation are backend-agnostic.
XI. Backend Traits
The Operand trait has been removed from computegraph-rs entirely.
TensorBackend – standard algebra, full op set
TensorBackend (defined in tenferro-tensor) covers all ops for standard algebra. Operates on Tensor (type-erased). CpuBackend implements this trait. CudaBackend is a partial stub (feature-gated).
Canonical definition: spec/backend-contract.md.
SemiringBackend<Alg: Semiring> – custom algebra, semiring ops only
SemiringBackend<Alg> (defined in tenferro-tensor) covers semiring ops for custom algebra. Operates on TypedTensor<Alg::Scalar> (typed). User provides only gemm() (single GEMM); batched_gemm, add, mul, reduce_sum have default implementations using strided-kernel + Semiring trait methods.
The two traits are independent (no supertrait relationship).
Canonical definition: spec/backend-contract.md.
Structural ops – algebra-independent free functions
transpose, reshape, broadcast_in_dim, extract_diagonal, embed_diagonal are free functions in tenferro-tensor::cpu::structural. They are the same for all algebras. For standard algebra, TensorBackend methods delegate to these. For custom algebra, eval_semiring_ir calls them directly.
XII. Per-Crate Contents
tenferro-device
Defines the placement vocabulary and shared runtime errors. Placement contains memory_kind plus resident_device, while ComputeDevice remains a separate notion for execution. Public memory kinds follow JAX/XLA-style names: Device, PinnedHost, UnpinnedHost, and Other(String).
tenferro-algebra
Provides SemiringAlgebra trait, StandardAlgebra, scalar type constraints.
tenferro-tensor
Tensor runtime crate. No AD-related code. Organized by backend target.
types.rs—TypedTensor<T>(contiguous-only, no strides),Tensorenum,Buffer<T>,DType,Placement,MemoryKind,ComputeDeviceconfig.rs—DotGeneralConfig,CompareDir,GatherConfig,ScatterConfig,SliceConfig,PadConfig(moved from tenferro-ops to avoid dependency cycle)backend.rs—TensorBackendtrait,SemiringBackend<Alg>traitcpu/— CPU backend:backend.rs—CpuBackend: impl TensorBackendelementwise.rs— strided-kernel: add, mul, neg, conj, div, abs, exp, log, …reduction.rs— strided-kernel: reduce_sum, reduce_prod, reduce_max, reduce_minstructural.rs— strided-kernel: transpose, broadcast_in_dim, extract_diagonal; dedicated: reshape (metadata only), embed_diagonalindexing.rs— gather, scatter, slice, pad, concatenate, reversegemm/—faer_gemm.rs(cpu-faer),blas_gemm.rs(cpu-blas)linalg/—faer_linalg.rs(cpu-faer),lapack_linalg.rs(cpu-blas)
cuda/— CUDA backend (feature-gated)rocm/— ROCm backend (feature-gated, future)
No naive CPU loop fallbacks. All CPU kernels use strided-kernel (elementwise, reduction, structural), faer or BLAS (GEMM), faer or LAPACK (linalg). Exactly one of cpu-faer or cpu-blas must be enabled (compile_error! enforced).
tenferro-ops
The core crate:
SemiringOpKindenum (shared vocabulary, used only inSemiringOp<T>)SemiringOpstraitSemiringOp<T>generic wrapper +impl GraphOp(graph construction only, no eval — execution is dispatched throughTensorBackend)StdTensorOpenum — flat, most variants mirror a StableHLO op 1:1 (documented exceptions:Conj, multi-output linalg)impl GraphOp for StdTensorOp(graph construction only, no eval)impl PrimitiveOp for StdTensorOp(linearize + transpose_rule)impl SemiringOps for StdTensorOp— maps to flat variants directlyTensorInputKey+impl ADKey
Depends on: computegraph-rs, chainrules-rs, tenferro-tensor.
tenferro-einsum
Graph builder for N-ary einsum:
Subscriptsparsing and validationContractionPathoptimizationbuild_einsum_fragment<Op: SemiringOps>(algebra-agnostic)
Depends on: computegraph-rs, tenferro-ops.
tenferro
Top-level facade:
TracedTensor(lazy graph-aware wrapper)Engine(compilation cache, backend dispatch viaTensorBackendfrom tenferro-tensor, einsum cache)- Public API:
einsum(),grad(),jvp(),eval(),eval_all() compile_std_to_exec()(CompiledProgram<StdTensorOp>→ExecProgram)compile_semiring_to_exec()(CompiledProgram<SemiringOp<T>>→ExecProgram)- Optimizing compiler passes on
ExecProgramDotDimensionSorterTransposeFoldingDotDecomposerdeferred to issue#729
ExecProgram,ExecOp,ExecInstruction- Generic execution engine:
eval_exec_ir()— interpretsExecProgram, dispatches toTensorBackendmethods (from tenferro-tensor) - Generic semiring execution engine:
eval_semiring_ir()— interpretsExecProgram, dispatches toSemiringBackend<Alg>plus shared structural helpers - Standard backend:
CpuBackend(in tenferro-tensor) —ExecProgram→ generic engine → faer/BLAS/LAPACK
Depends on: all of the above + tidu-rs.
XIII. Implementation Status
Phases 1–3 (scalar fragment AD, tensor primitives + einsum, linalg + backends) are implemented and tested. Current work focuses on:
- Custom algebra end-to-end (tropical semiring)
- GPU backend expansion (CUDA kernels)
- C-API (FFI for Julia/Python)
- Logical-DAG-aware checkpoint scheduling
- Operator fusion in compiled IR