Primitive Catalog
Date: 2026-04-04 Parent: ../index.md Related: backend-contract.md, tensor-semantics.md, ../reference/stablehlo-primitives.md, ../reference/jax-primitives.md
I. Purpose
This document answers the question:
What exactly counts as a “primitive” or “instruction” in the current design at each level of the IR hierarchy, and what does each op mean?
Normative status: this file is the source of truth for the primitive and instruction-set vocabulary that tenferro is expected to implement at all levels. If another design document has a shorter summary and the two disagree, this file wins.
The design docs use “primitive” and “instruction” in several nearby but different senses. For readability, this document separates them explicitly:
| Layer | Example | Meaning |
|---|---|---|
| Surface API | einsum, sum, mean, grad, svd() |
what users call |
| Tenferro IR | DotGeneral, ReduceSum, BroadcastInDim |
what may appear as StdTensorOp / SemiringOp<T> nodes in a Fragment; fragment construction, AD, einsum decomposition |
| StableHLO IR | StableHloOp variants |
serializable to StableHLO MLIR for standard algebra; the single cut point between graph/AD and backends. XLA backend takes this directly (standard algebra only) |
| Execution IR | StableHLO ops + BatchedGemm - DotGeneral |
output of the optimizing compiler; input to faer / custom backends |
| Backend kernel | BLAS GEMM, cuSOLVER SVD, IREE module, faer routine | how an instruction is executed |
The 3 IR layers
┌─────────────────────────────────────────┐
│ Tenferro IR │
│ (StdTensorOp / SemiringOp<T>) │
│ Fragment construction, AD, einsum │
└──────────────┬──────────────────────────┘
│ lower_to_stablehlo()
┌──────────────▼──────────────────────────┐
│ StableHLO IR │
│ (StableHloOp) │
│ Serializable cut point │
│ XLA backend takes this directly │
└──────────────┬──────────────────────────┘
│ optimizing compiler
┌──────────────▼──────────────────────────┐
│ Execution IR │
│ (ExecOp) │
│ faer / custom backends execute this │
│ stride-aware engine dispatch │
└─────────────────────────────────────────┘
This document uses three orthogonal classifications:
- backend-facing execution architecture (Section III): the 2-level IR (StableHLO IR and Execution IR), optimizing compiler, backend traits, and execution engine
- Tenferro IR vocabulary (Section IV): what the graph / AD stack talks about at the Tenferro IR level
- standard arithmetic extensions (Section V): ops available only for ordinary dense numeric types
Important distinctions:
differentiate,transpose,resolve, andcompileare transforms, not primitives.einsumis surface syntax, not a final persistent Tenferro IR primitive. It is lowered into Tenferro IR primitives such asDotGeneral,Mul,Transpose,BroadcastInDim, andReduceSum.- High-level linalg ops such as
SVDandSolvemay remain explicit Tenferro IR primitives because they are meaningful semantic units, even though their derivative rules emit lower-level primitives. StdTensorOpis flat (noSemiringOpKindwrapping). Most variants map 1:1 to a StableHLO op. Documented exceptions include composite lowerings (e.g.,Conj-> 4 ops:real+imag+negate+complex) and multi-output linalg ops (e.g.,Svd->custom_call+get_tuple_elementx N).Tensorallows arbitrary strides at the user level. Input pre-processing happens at eval() time: contiguous data (including permuted-contiguous views frompermute()or.t()) is passed as-is with zero copy, preserving strides; only truly non-contiguous data (memory gaps from slicing) is physically copied to a contiguous buffer. No StableHLO ops are inserted for input normalization – the StableHLO program is layout-independent. The execution engine is stride-aware and handles permuted inputs via BLAS trans flags at dispatch time.
Responsibility boundary:
chainrules-rsowns thePrimitiveOpcontracttidu-rsowns generic AD transforms that calllinearizeandtranspose_rule- tenferro owns the concrete per-op derivative rules
So this directory keeps the primitive vocabulary and cross-crate architecture, but not a standalone per-op transpose-rule manual. Detailed formulas are a downstream tenferro design/implementation concern.
II. Reading the Tables
Shape notation
x: [b, m, n]meansxis a rank-3 tensor with shape(b, m, n).- Scalars are rank-0 tensors and are written as
[]. - Batch dimensions are written explicitly; nothing is implied by position alone.
No implicit broadcasting
Elementwise primitives such as Add and Mul do not silently broadcast. If shapes differ, the graph must contain an explicit BroadcastInDim.
Transpose vs AD transpose
Transpose(perm) is the tensor operation “permute axes”. It is unrelated to the AD transform transpose(linear_fragment).
Multi-output primitives
Some primitives produce multiple outputs:
SVD(A) -> (U, S, Vt)QR(A) -> (Q, R)
The output ordering must be part of the primitive definition because GlobalValKey includes output_slot.
Column-major (Fortran) convention
Engine-produced intermediates and outputs use column-major (Fortran) ordering. This is the convention for all data produced by the execution engine. Input tensors may be contiguous with arbitrary axis ordering; the engine inspects strides and adjusts dispatch accordingly (e.g., BLAS trans flags for transposed inputs).
What tenferro is expected to implement
From this document’s point of view, the implementation target is:
- implement the 2-level IR architecture and backend traits defined below
- implement the AD-closed graph core defined below
- implement the
Standard arithmetic onlyprimitives when tenferro claims standard dense numeric support - treat control-flow primitives as future work, not part of the initial required set
III. Relationship to Backend Execution
The backend pipeline, Execution IR dispatch categories, backend trait signatures (SemiringCore, SemiringFastPath), generic execution engine, buffer lifecycle, and memory layout are owned by backend-contract.md.
Key relationships:
- StableHLO IR uses the Tenferro IR ops from Section IV. For standard algebra, serializable to StableHLO MLIR. For custom algebra, same structure but semiring semantics (Add=⊕, Mul=⊗); XLA path not available.
- Execution IR = StableHLO ops +
BatchedGemm−DotGeneral. - Add/Mul dispatch is algebra-dependent (see backend-contract.md).
- Custom algebra minimum:
batched_gemm+reduce_sumviaSemiringCore. - Optimizing compiler: see
optimizer-passes.md.
IV. Core Traits (canonical signatures)
GraphOp
GraphOp is the operation node trait. computegraph-rs is fully generic over it and never references specific primitives.
trait GraphOp: Clone + Debug + Hash + Eq + Send + Sync + 'static {
type Operand: Operand;
type Context;
type InputKey: Clone + Debug + Hash + Eq + Send + Sync + 'static;
fn n_inputs(&self) -> usize;
fn n_outputs(&self) -> usize;
fn eval(&self, ctx: &mut Self::Context, inputs: &[&Self::Operand]) -> Vec<Self::Operand>;
}Operand
Operand is the runtime value type. Defined in computegraph-rs/src/traits.rs. Contains both algebraic and structural methods — computegraph-rs is a tensor computation graph engine, not a fully generic DAG engine.
pub trait Operand: Clone + Send + Sync + 'static {
fn zero(shape: &[usize]) -> Self;
fn one(shape: &[usize]) -> Self;
fn reshape(&self, shape: &[usize]) -> Self;
fn broadcast_in_dim(&self, shape: &[usize], dims: &[usize]) -> Self;
fn add(&self, other: &Self) -> Self;
fn multiply(&self, other: &Self) -> Self;
fn reduce_sum(&self, axes: &[usize]) -> Self;
fn dot_general(
&self, other: &Self,
lhs_contracting: &[usize], rhs_contracting: &[usize],
lhs_batch: &[usize], rhs_batch: &[usize],
) -> Self;
fn conj(&self) -> Self;
}TensorData (see tensor-semantics.md) provides additional buffer access methods (shape, strides, data) needed by the execution engine’s common infrastructure.
V. Tenferro IR Vocabulary
This section is about the graph-level vocabulary that computegraph-rs, chainrules-rs, tidu-rs, and tenferro’s StdTensorOp layer talk about.
These ops define the Tenferro IR. Each op maps to a StableHLO op when lowered via lower_to_stablehlo(). The XLA backend takes the StableHLO IR directly; other backends lower the StableHLO IR through the optimizing compiler to produce Execution IR.
AD-closed graph core
These are the tensor primitives needed to express:
- scalar and tensor JVP/VJP rules
- explicit broadcasting and reshaping
- general contractions (including repeated-index patterns like trace/diagonal)
- reverse-mode accumulation without hidden fan-out
Every op in this table is expected to implement PrimitiveOp directly (for StdTensorOp). The set is AD-closed: linearize and transpose_rule of any op in this table emit only ops from this table.
Implementation note: the boundary between this core set and the “Standard arithmetic extensions” (Section V) is primarily an implementation priority guide, not a formal algebraic boundary. The full StdTensorOp set (core + extensions) is also AD-closed. The core set is distinguished by two properties: (1) it is AD-closed on its own, and (2) all ops in it are well-defined for arbitrary semiring algebras (not just standard arithmetic), making them available to SemiringOp<T>.
Algebraic ops
| Primitive | Signature | Definition | Notes |
|---|---|---|---|
Add |
x0: S, x1: S -> y: S |
y[i] = x0[i] + x1[i] |
Same shape on both inputs; no hidden broadcasting. Maps to ⊕ for custom algebras. |
Mul |
x0: S, x1: S -> y: S |
y[i] = x0[i] * x1[i] |
Elementwise multiply; same-shape contract. Maps to ⊗ for custom algebras. |
Neg |
x: S -> y: S |
y[i] = -x[i] |
Unary elementwise. Standard algebra only (semirings lack additive inverse). |
Conj |
x: S -> y: S |
y[i] = conj(x[i]) |
Identity on real dtypes, conjugation on complex dtypes. Standard algebra only. |
DotGeneral(config) |
lhs: A, rhs: B -> out: C |
General tensor contraction over explicit batch axes and contracting axes | Canonical contraction primitive; uses ⊕ and ⊗ for custom algebras. Config defined below. |
ReduceSum(axes) |
x: [d0, ..., dn-1] -> y |
y is formed by summing x over the listed axes |
Uses ⊕ for custom algebras. Rank drops unless a later op restores it. |
Structural ops
These ops rearrange or select elements without any arithmetic. They are well-defined for all algebras and handled by common infrastructure at the backend level (not part of the custom backend contract).
| Primitive | Signature | Definition | Notes |
|---|---|---|---|
Transpose(perm) |
x: [d0, ..., dn-1] -> y: [d_perm[0], ..., d_perm[n-1]] |
Reorder axes according to perm |
Pure axis permutation |
Reshape(shape) |
x: [d0, ..., dn-1] -> y: shape |
Reinterpret the element sequence with a new shape | Total element count must stay unchanged. In the IR all tensors are logically dense column-major, so there is no stride ambiguity. |
BroadcastInDim(shape, dims) |
x: [a0, ..., ak-1] -> y: shape |
Place input axis j into output axis dims[j], repeating along the others |
Makes all broadcast semantics explicit |
Gather |
x: S -> y: S' |
Read values from x at positions specified by an index tensor |
Needed for repeated-index einsum patterns (trace, diagonal extraction). Pure index-based read; no arithmetic. |
Scatter |
updates: S, x: S' -> y: S' |
Write or accumulate values into y at positions specified by an index tensor |
Transpose of Gather. Accumulation uses ⊕. Needed for AD of Gather and for embed_diag. |
DotGeneral config
struct DotGeneralConfig {
lhs_contracting_dims: Vec<usize>,
rhs_contracting_dims: Vec<usize>,
lhs_batch_dims: Vec<usize>,
rhs_batch_dims: Vec<usize>,
}Contracting dims are summed over (inner product). Batch dims are preserved in the output. Remaining dims appear in the output in lhs-then-rhs order. This matches StableHLO’s dot_general dimension numbers.
Concrete examples
| Primitive | Example |
|---|---|
DotGeneral |
ij,jk->ik (ordinary matrix multiply) |
BroadcastInDim |
[n] -> [b, n] with dims=[1] |
ReduceSum |
[b, m, n] -> [b, n] with axes=[1] |
Transpose |
[b, m, n] -> [m, b, n] with perm=[1, 0, 2] |
Trace, diagonal, and their AD helpers
Decision: Trace/Diag/AntiTrace/AntiDiag are not dedicated Tenferro IR primitives. They are lowered to existing Tenferro IR ops (which map to StableHLO):
| Surface op | Tenferro IR lowering |
|---|---|
trace(A) |
einsum ii-> = diagonal extraction (Gather pattern) + ReduceSum |
diag(A) / extract_diag(A) |
einsum ii->i = Gather pattern |
embed_diag(v) |
einsum i->ii = Scatter / BroadcastInDim + Mul with identity mask |
| AntiTrace (AD helper) | Scatter + BroadcastInDim in transpose rules |
| AntiDiag (AD helper) | Scatter in transpose rules |
This keeps the Tenferro IR vocabulary aligned with StableHLO (which has no Trace/Diag ops) and avoids adding non-standard primitives. The einsum engine already handles repeated-index patterns (ii->i, ii->) internally via diagonal extraction in the previous (dispatch.rs diagonal plan).
V. Standard Arithmetic Only
These primitives are available only for the ordinary dense numeric setting (real/complex standard arithmetic). They are not assumed to exist for generic semirings such as tropical algebra.
This section should be kept as close as practical to the official StableHLO op set, so that tenferro’s Tenferro IR primitives lower cleanly to StableHLO. See ../reference/stablehlo-primitives.md for the StableHLO-facing reference and ../reference/jax-primitives.md for the JAX-side reference point.
Elementwise arithmetic, comparison, and selection
| Primitive | Definition | Notes |
|---|---|---|
Div |
y[i] = x0[i] / x1[i] |
Canonical division op |
Abs |
y[i] = abs(x[i]) |
Real magnitude or complex modulus, depending on dtype contract |
Sign |
y[i] = sign(x[i]) |
Often used in stabilization logic |
Maximum |
y[i] = max(x0[i], x1[i]) |
Ordered real comparison |
Minimum |
y[i] = min(x0[i], x1[i]) |
Ordered real comparison |
Compare(dir) |
Produce a predicate/mask tensor from an elementwise comparison | dir is things like eq, lt, le, gt, ge |
Select |
y[i] = pred[i] ? on_true[i] : on_false[i] |
Canonical conditional elementwise choice |
Clamp |
y[i] = min(max(x[i], lower[i]), upper[i]) |
Canonical clipping primitive |
Analytic elementwise primitives
| Primitive | Definition |
|---|---|
Exp |
exp(x) |
Log |
log(x) |
Sin |
sin(x) |
Cos |
cos(x) |
Tanh |
tanh(x) |
Sqrt |
sqrt(x) |
Rsqrt |
1 / sqrt(x) |
Pow |
x^y |
Expm1 |
exp(x) - 1 |
Log1p |
log(1 + x) |
The table above is the canonical analytic seed set. Additional analytic ops may be added later, but they are not part of the current required list unless this document is updated.
Indexing and structural data movement
Gather and Scatter are in the AD-closed graph core (Section IV) because they are needed for repeated-index einsum patterns and are well-defined for all algebras. The remaining indexing ops are standard-arithmetic only:
| Primitive | Definition | Notes |
|---|---|---|
Slice |
Read a static rectangular subregion | Start/limit/stride known in the op |
DynamicSlice |
Read a slice whose start index is data-dependent | Dynamic counterpart of Slice |
Pad |
Extend a tensor with edge/interior padding values | Needed for transpose of slicing-like ops |
Concatenate |
Join tensors along one axis | Rank-preserving shape change |
Reverse |
Reverse the order of elements along selected axes | Useful for convolutions and sequence models |
Additional reductions
| Primitive | Definition |
|---|---|
ReduceProd |
Multiply values over the listed axes |
ReduceMax |
Max over the listed axes |
ReduceMin |
Min over the listed axes |
ReduceSum stays in the AD-closed graph core because it is essential both for primal tensor code and for transpose rules.
Linalg primitives
| Primitive | Outputs | Definition | StableHLO lowering |
|---|---|---|---|
Cholesky |
(L) or (U) |
Cholesky factorization of a positive-definite matrix | Direct StableHLO op (stablehlo.cholesky) |
SVD |
(U, S, Vt) |
Thin singular value decomposition A = U diag(S) Vt |
stablehlo.custom_call |
QR |
(Q, R) |
Thin QR factorization A = Q R |
stablehlo.custom_call |
Eigh |
(eigenvalues, eigenvectors) |
Hermitian / symmetric eigendecomposition | stablehlo.custom_call |
Solve |
(X) |
Solve A X = B for X |
stablehlo.custom_call |
Cholesky has a direct StableHLO op. All other linalg primitives lower to stablehlo.custom_call with appropriate target names (matching JAX/XLA conventions for LAPACK/cuSOLVER dispatch).
Regardless of lowering path, derivative rules for all linalg ops emit graph primitives that satisfy PrimitiveOp closure.
Future control-flow primitives
| Primitive | Definition |
|---|---|
Cond |
Branch between two subcomputations based on a predicate |
While |
Loop while a condition remains true |
Scan |
Structured loop with carried state and stacked outputs |
These are intentionally future-facing and are not required for the initial vertical slice.
VI. StableHLO Alignment
When there is a choice, the Tenferro IR vocabulary should prefer the StableHLO-style name and semantics:
| Preferred | Instead of |
|---|---|
DotGeneral |
einsum or dot as a primitive |
BroadcastInDim |
implicit broadcasting or generic broadcast primitive |
Compare(dir) + Select |
surface names like greater, greater_equal, where |
ReduceSum / ReduceMax / … |
opaque reduction primitives whose combiner is not explicit |
The goal is not to copy StableHLO mechanically. The goal is to ensure that the Standard arithmetic only part of tenferro’s Tenferro IR vocabulary has an obvious, low-friction lowering path to StableHLO, because the StableHLO IR is the single cut point for all backends.
See also ../reference/stablehlo-primitives.md and ../reference/jax-primitives.md.
VII. Frontend Sugar and Canonical Lowering
Many familiar user-level ops are better treated as aliases or composites rather than as distinct graph primitives.
Constants and literals
Constants (scalar or tensor literals) are not Tenferro IR primitives. They enter the graph as Fragment input nodes with attached data (TracedTensor::from(Tensor::from_vec(...))). At StableHLO lowering, these become stablehlo.constant ops. Canonical lowerings that reference literal values (e.g., 1 / n in mean, 1 in reciprocal) construct these as Fragment inputs.
Lowering table
| Surface op | Tenferro IR form |
|---|---|
einsum(...) |
contraction planning + DotGeneral/Mul/Transpose/Reshape/BroadcastInDim/ReduceSum |
sum(x, axes) |
ReduceSum(x, axes) |
mean(x, axes) |
ReduceSum(x, axes) followed by Mul(result, constant(1/n)) |
sub(x, y) |
Add(x, Neg(y)) |
square(x) |
Mul(x, x) |
reciprocal(x) |
Div(constant(1), x) |
where(pred, a, b) |
Select(pred, a, b) |
greater(x, y) |
Compare(dir=gt) |
greater_equal(x, y) |
Compare(dir=ge) |
clamp_min(x, lo) / clamp_max(x, hi) |
special cases of Clamp |
trace(x) |
diagonal extraction (Gather pattern) + ReduceSum |
diag(x) / extract_diag(x) |
diagonal extraction (Gather pattern) |
This is useful for two reasons:
- the Tenferro IR vocabulary stays smaller and easier to reason about
- AD closure becomes easier to verify because fewer primitive rules are truly fundamental
VIII. Implementation Note
This catalog defines the semantic vocabulary at all IR levels that the graph, AD stack, and execution engine talk about.
- The Tenferro IR (Section IV) defines the graph-level vocabulary for fragment construction, AD, and einsum decomposition.
- The StableHLO IR (Section III.1) is the single cut point interface between the graph/AD world and execution.
- The Execution IR (Section III.3) is the interface between the optimizing compiler and backend kernels.
- Backend traits (Section III.4) define what a custom backend must implement.
The boundary is deliberate: graph-level concerns (AD closure, shape inference) live above the StableHLO IR cut point; execution concerns (memory layout, kernel dispatch, fast paths) live below it.