Expand description
High-level einsum with N-ary contraction tree optimization.
This crate provides Einstein summation notation for tenferro_tensor::Tensor
values. It supports:
- String notation:
"ij,jk->ik"(NumPy/PyTorch compatible) - Parenthesized notation:
"ij,(jk,kl)->il"respects user-specified contraction order viaNestedEinsum(OMEinsum.jl-compatible) - Integer label notation: omeinsum-rs compatible, using
u32labels - N-ary contraction: Automatic or manual optimization of pairwise
contraction order via
ContractionTree - Binary primitive: Public two-input einsum APIs (
einsum_binary*) for composing explicit contraction paths in higher layers - Accumulating variants:
einsum_into,einsum_with_subscripts_into,einsum_with_plan_intowrite into a pre-allocated output buffer with BLAS-stylealpha/betascaling, avoiding allocation in hot loops
§Backend dispatch
The backend is passed explicitly as a type parameter
Backend: EinsumBackend<Alg> with a mutable context
BackendContext<Alg, Backend>. This follows Rust
idiom of explicit ownership and mutability (no global/thread-local state).
The backend contract is the semiring core plus optional semiring fast paths.
§Examples
§Common operations
use tenferro_algebra::Standard;
use tenferro_einsum::einsum;
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::LogicalMemorySpace;
use tenferro_prims::{CpuBackend, CpuContext};
let col = MemoryOrder::ColumnMajor;
let mut ctx = CpuContext::new(4);
let a = Tensor::<f64>::from_slice(&[1.0, 2.0, 3.0, 4.0], &[2, 2], col).unwrap();
let b = Tensor::<f64>::from_slice(&[5.0, 6.0, 7.0, 8.0], &[2, 2], col).unwrap();
let v = Tensor::<f64>::from_slice(&[1.0, 2.0, 3.0], &[3], col).unwrap();
// Matrix multiplication: C = A @ B
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ij,jk->ik", &[&a, &b], None).unwrap();
// Trace: tr(A)
let tr = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ii->", &[&a], None).unwrap();
// Outer product: v_i * v_j -> M_{ij}
let outer =
einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i,j->ij", &[&v, &v], None).unwrap();
// Dot product: v . v
let dot = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i,i->", &[&v, &v], None).unwrap();
// Matrix-vector product: A @ v
let mv = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ij,j->i", &[&a, &v], None).unwrap();
// Diagonal embedding: vector -> diagonal matrix
// v = [1, 2, 3] -> [[1,0,0],[0,2,0],[0,0,3]]
let diag = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i->ii", &[&v], None).unwrap();
assert_eq!(diag.dims(), &[3, 3]);
// Diagonal extraction: matrix -> diagonal vector
let d = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ii->i", &[&a], None).unwrap();
// Higher-order diagonal: 3D tensor with repeated index
// Creates T_{iii} from v_i
let t = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i->iii", &[&v], None).unwrap();
assert_eq!(t.dims(), &[3, 3, 3]);
// Consuming variant: operands are moved (buffer reuse not yet implemented)
use tenferro_einsum::einsum_owned;
let x = Tensor::<f64>::from_slice(&[1.0, 2.0, 3.0, 4.0], &[2, 2], col).unwrap();
let y = Tensor::<f64>::from_slice(&[5.0, 6.0, 7.0, 8.0], &[2, 2], col).unwrap();
let z =
einsum_owned::<Standard<f64>, CpuBackend>(&mut ctx, "ij,jk->ik", vec![x, y], None)
.unwrap();§Batch operations
use tenferro_algebra::Standard;
// Batched GEMM: 10 independent matrix multiplications in one call
// A: (batch=10, m=3, k=4), B: (batch=10, k=4, n=5) -> C: (batch=10, m=3, n=5)
let a = Tensor::<f64>::zeros(&[10, 3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[10, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c =
einsum::<Standard<f64>, CpuBackend>(&mut ctx, "bij,bjk->bik", &[&a, &b], None).unwrap();
assert_eq!(c.dims(), &[10, 3, 5]);
// Multiple batch dimensions: (batch1=2, batch2=3, m, k) x (batch1=2, batch2=3, k, n)
let a = Tensor::<f64>::zeros(&[2, 3, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[2, 3, 5, 6], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "abij,abjk->abik", &[&a, &b], None)
.unwrap();
assert_eq!(c.dims(), &[2, 3, 4, 6]);
// Broadcast batch: A has batch dim, B is shared across batch
// A: (batch=10, m=3, k=4), B: (k=4, n=5) -> C: (batch=10, m=3, n=5)
let a = Tensor::<f64>::zeros(&[10, 3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "bij,jk->bik", &[&a, &b], None)
.unwrap();
assert_eq!(c.dims(), &[10, 3, 5]);§Ellipsis notation for batch dimensions
NumPy/PyTorch/JAX-style ellipsis notation (...) is fully supported for
batch dimensions, allowing generic code that works with any number of batch
dimensions (resolves issue #529).
use tenferro_algebra::Standard;
// Batched matrix multiply with ellipsis: works with any number of batch dims
// 1 batch dim: A[2,3,4] @ B[2,4,5] -> C[2,3,5]
let a = Tensor::<f64>::zeros(&[2, 3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[2, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "...ij,...jk->...ik", &[&a, &b], None)
.unwrap();
assert_eq!(c.dims(), &[2, 3, 5]);
// 2 batch dims: A[2,3,4,5] @ B[2,3,5,6] -> C[2,3,4,6]
let a = Tensor::<f64>::zeros(&[2, 3, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[2, 3, 5, 6], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "...ij,...jk->...ik", &[&a, &b], None)
.unwrap();
assert_eq!(c.dims(), &[2, 3, 4, 6]);
// No batch dims: A[3,4] @ B[4,5] -> C[3,5]
let a = Tensor::<f64>::zeros(&[3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "...ij,...jk->...ik", &[&a, &b], None)
.unwrap();
assert_eq!(c.dims(), &[3, 5]);§Integer label notation
use tenferro_algebra::Standard;
use tenferro_einsum::{einsum_with_subscripts, Subscripts};
// Same as "ij,jk->ik" but with integer labels
// Useful when indices exceed 52 (a-z, A-Z) or are computed programmatically
let subs = Subscripts::new(&[&[0, 1], &[1, 2]], &[0, 2]);
let c = einsum_with_subscripts::<Standard<f64>, CpuBackend>(&mut ctx, &subs, &[&a, &b], None)
.unwrap();§Contraction order control
use tenferro_algebra::Standard;
// Three matrices: D = A @ B @ C
// Parentheses specify: contract B*C first, then A*(BC)
let d = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ij,(jk,kl)->il", &[&a, &b, &c], None)
.unwrap();
// Or use ContractionTree for programmatic control
use tenferro_einsum::ContractionTree;
let subs = Subscripts::new(&[&[0, 1], &[1, 2], &[2, 3]], &[0, 3]);
let tree = ContractionTree::from_pairs(
&subs,
&[&[3, 4], &[4, 5], &[5, 6]],
&[(1, 2), (0, 3)], // B*C first (avoids large intermediate)
).unwrap();
let d = einsum_with_plan::<Standard<f64>, CpuBackend>(&mut ctx, &tree, &[&a, &b, &c], None)
.unwrap();§Accumulating into a pre-allocated output
use tenferro_algebra::Standard;
use tenferro_einsum::{einsum_with_plan_into, ContractionTree, Subscripts};
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::LogicalMemorySpace;
use tenferro_prims::{CpuBackend, CpuContext};
let col = MemoryOrder::ColumnMajor;
let mut ctx = CpuContext::new(4);
let subs = Subscripts::new(&[&[0, 1], &[1, 2]], &[0, 2]);
let tree = ContractionTree::optimize(&subs, &[&[3, 4], &[4, 5]]).unwrap();
let a = Tensor::<f64>::zeros(&[3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let mut c = Tensor::<f64>::zeros(&[3, 5], LogicalMemorySpace::MainMemory, col).unwrap();
// Hot loop: reuse output buffer, zero allocation per iteration
for _ in 0..1000 {
// C = 1.0 * (A @ B) + 0.0 * C (overwrite)
einsum_with_plan_into::<Standard<f64>, CpuBackend>(
&mut ctx, &tree, &[&a, &b], 1.0, 0.0, &mut c, None,
).unwrap();
}§GPU async chaining (deferred evaluation)
Status: Not yet implemented. GPU backends do not exist yet. The examples below are aspirational design targets, not working code.
GPU einsum operations return immediately. The result tensor carries a
CompletionEvent that tracks the
pending accelerator work. Passing this tensor to another einsum chains
via GPU stream dependencies — no CPU synchronization until data is
accessed from the host.
wait()— explicitly blocks until computation completesdims(),strides()— implicitly callwait()- For CPU tensors,
eventis alwaysNone(zero overhead)
use tenferro_algebra::Standard;
use tenferro_einsum::einsum;
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::LogicalMemorySpace;
use tenferro_prims::CudaBackend; // future
// In production, obtain memory spaces via BackendRegistry (future API).
let gpu_mem = LogicalMemorySpace::GpuMemory { device_id: 0 };
let col = MemoryOrder::ColumnMajor;
let mut gpu_ctx = /* CudaContext from BackendRegistry */;
let a = Tensor::<f64>::zeros(&[3, 4], gpu_mem, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], gpu_mem, col).unwrap();
// Both einsum calls submit work to the GPU and return immediately.
// The second call detects c's pending event and chains on the stream.
let c = einsum::<Standard<f64>, CudaBackend>(&mut gpu_ctx, "ij,jk->ik", &[&a, &b], None)
.unwrap();
let d = einsum::<Standard<f64>, CudaBackend>(&mut gpu_ctx, "ij,jk->ik", &[&c, &b], None)
.unwrap();
// wait() blocks until GPU computation completes
d.wait();§Specifying a compute device
Status: Not yet implemented. See GPU note above.
use tenferro_einsum::einsum;
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::{LogicalMemorySpace, ComputeDevice};
let col = MemoryOrder::ColumnMajor;
// In production, obtain memory spaces via BackendRegistry (future API).
let gpu_mem = LogicalMemorySpace::GpuMemory { device_id: 0 };
let mut a = Tensor::<f64>::zeros(&[3, 4], gpu_mem, col).unwrap();
let mut b = Tensor::<f64>::zeros(&[4, 5], gpu_mem, col).unwrap();
// Pin tensors to CUDA device 1 (overrides automatic device selection).
// This works when CUDA device 1 can access GpuMemory { device_id: 0 }
// (e.g., same physical GPU or NVLink-connected peer).
// If the device cannot access the memory space, einsum returns
// Err(NoCompatibleComputeDevice). In that case, transfer explicitly:
// let a = a.to_memory_space_async(GpuMemory { device_id: 1 }).unwrap();
a.set_preferred_compute_device(Some(ComputeDevice::Cuda { device_id: 1 }));
b.set_preferred_compute_device(Some(ComputeDevice::Cuda { device_id: 1 }));
// einsum dispatches to the specified CUDA device
let c = einsum::<Standard<f64>, CudaBackend>(&mut gpu_ctx, "ij,jk->ik", &[&a, &b], None)
.unwrap();
// Clear override — revert to automatic device selection
// a.set_preferred_compute_device(None);Structs§
- Contraction
Optimizer Options - Public options for automatic contraction-path optimization.
- Contraction
Tree - Contraction tree determining pairwise contraction order for N-ary einsum.
- Subscripts
- Einsum subscripts using integer labels (omeinsum-rs compatible).
Enums§
- Nested
Einsum - Recursive einsum tree that preserves parenthesized grouping.
Traits§
- Einsum
Backend - Public backend contract accepted by
tenferro-einsum.
Functions§
- einsum
- Execute einsum using string notation.
- einsum_
binary - Execute a binary einsum from string notation.
- einsum_
binary_ into - Execute a binary einsum and accumulate into an existing output buffer.
- einsum_
binary_ with_ subscripts - Execute a binary einsum from pre-parsed subscripts.
- einsum_
binary_ with_ subscripts_ into - Execute a binary einsum from pre-parsed subscripts, accumulating into output.
- einsum_
frule - Forward-mode rule (frule) for einsum without building a global tape.
- einsum_
into - Execute einsum using string notation, accumulating into an existing output.
- einsum_
owned - Execute einsum using string notation, consuming the input tensors.
- einsum_
rrule - Reverse-mode rule (rrule) for einsum without building a global tape.
- einsum_
with_ path - Execute N-ary einsum with an explicit pairwise contraction path.
- einsum_
with_ path_ into - Execute N-ary einsum with an explicit pairwise contraction path, accumulating into an existing output tensor.
- einsum_
with_ plan - Execute einsum with a pre-optimized
ContractionTree. - einsum_
with_ plan_ into - Execute einsum with a pre-optimized
ContractionTree, accumulating into an existing output. - einsum_
with_ plan_ owned - Execute einsum with a pre-optimized
ContractionTree, consuming the input tensors. - einsum_
with_ subscripts - Execute einsum with pre-built
Subscripts. - einsum_
with_ subscripts_ into - Execute einsum with pre-built
Subscripts, accumulating into an existing output. - einsum_
with_ subscripts_ owned - Execute einsum with pre-built
Subscripts, consuming the input tensors.
Type Aliases§
- Backend
Context - Context type used by public einsum APIs for a backend/algebra pair.