Crate tenferro_einsum

Crate tenferro_einsum 

Source
Expand description

High-level einsum with N-ary contraction tree optimization.

This crate provides Einstein summation notation for tenferro_tensor::Tensor values. It supports:

  • String notation: "ij,jk->ik" (NumPy/PyTorch compatible)
  • Parenthesized notation: "ij,(jk,kl)->il" respects user-specified contraction order via NestedEinsum (OMEinsum.jl-compatible)
  • Integer label notation: omeinsum-rs compatible, using u32 labels
  • N-ary contraction: Automatic or manual optimization of pairwise contraction order via ContractionTree
  • Binary primitive: Public two-input einsum APIs (einsum_binary*) for composing explicit contraction paths in higher layers
  • Accumulating variants: einsum_into, einsum_with_subscripts_into, einsum_with_plan_into write into a pre-allocated output buffer with BLAS-style alpha/beta scaling, avoiding allocation in hot loops

§Backend dispatch

The backend is passed explicitly as a type parameter Backend: EinsumBackend<Alg> with a mutable context BackendContext<Alg, Backend>. This follows Rust idiom of explicit ownership and mutability (no global/thread-local state). The backend contract is the semiring core plus optional semiring fast paths.

§Examples

§Common operations

use tenferro_algebra::Standard;
use tenferro_einsum::einsum;
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::LogicalMemorySpace;
use tenferro_prims::{CpuBackend, CpuContext};

let col = MemoryOrder::ColumnMajor;
let mut ctx = CpuContext::new(4);

let a = Tensor::<f64>::from_slice(&[1.0, 2.0, 3.0, 4.0], &[2, 2], col).unwrap();
let b = Tensor::<f64>::from_slice(&[5.0, 6.0, 7.0, 8.0], &[2, 2], col).unwrap();
let v = Tensor::<f64>::from_slice(&[1.0, 2.0, 3.0], &[3], col).unwrap();

// Matrix multiplication: C = A @ B
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ij,jk->ik", &[&a, &b], None).unwrap();

// Trace: tr(A)
let tr = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ii->", &[&a], None).unwrap();

// Outer product: v_i * v_j -> M_{ij}
let outer =
    einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i,j->ij", &[&v, &v], None).unwrap();

// Dot product: v . v
let dot = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i,i->", &[&v, &v], None).unwrap();

// Matrix-vector product: A @ v
let mv = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ij,j->i", &[&a, &v], None).unwrap();

// Diagonal embedding: vector -> diagonal matrix
// v = [1, 2, 3] -> [[1,0,0],[0,2,0],[0,0,3]]
let diag = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i->ii", &[&v], None).unwrap();
assert_eq!(diag.dims(), &[3, 3]);

// Diagonal extraction: matrix -> diagonal vector
let d = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ii->i", &[&a], None).unwrap();

// Higher-order diagonal: 3D tensor with repeated index
// Creates T_{iii} from v_i
let t = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "i->iii", &[&v], None).unwrap();
assert_eq!(t.dims(), &[3, 3, 3]);

// Consuming variant: operands are moved (buffer reuse not yet implemented)
use tenferro_einsum::einsum_owned;
let x = Tensor::<f64>::from_slice(&[1.0, 2.0, 3.0, 4.0], &[2, 2], col).unwrap();
let y = Tensor::<f64>::from_slice(&[5.0, 6.0, 7.0, 8.0], &[2, 2], col).unwrap();
let z =
    einsum_owned::<Standard<f64>, CpuBackend>(&mut ctx, "ij,jk->ik", vec![x, y], None)
        .unwrap();

§Batch operations

use tenferro_algebra::Standard;
// Batched GEMM: 10 independent matrix multiplications in one call
// A: (batch=10, m=3, k=4), B: (batch=10, k=4, n=5) -> C: (batch=10, m=3, n=5)
let a = Tensor::<f64>::zeros(&[10, 3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[10, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c =
    einsum::<Standard<f64>, CpuBackend>(&mut ctx, "bij,bjk->bik", &[&a, &b], None).unwrap();
assert_eq!(c.dims(), &[10, 3, 5]);

// Multiple batch dimensions: (batch1=2, batch2=3, m, k) x (batch1=2, batch2=3, k, n)
let a = Tensor::<f64>::zeros(&[2, 3, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[2, 3, 5, 6], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "abij,abjk->abik", &[&a, &b], None)
    .unwrap();
assert_eq!(c.dims(), &[2, 3, 4, 6]);

// Broadcast batch: A has batch dim, B is shared across batch
// A: (batch=10, m=3, k=4), B: (k=4, n=5) -> C: (batch=10, m=3, n=5)
let a = Tensor::<f64>::zeros(&[10, 3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "bij,jk->bik", &[&a, &b], None)
    .unwrap();
assert_eq!(c.dims(), &[10, 3, 5]);

§Ellipsis notation for batch dimensions

NumPy/PyTorch/JAX-style ellipsis notation (...) is fully supported for batch dimensions, allowing generic code that works with any number of batch dimensions (resolves issue #529).

use tenferro_algebra::Standard;

// Batched matrix multiply with ellipsis: works with any number of batch dims
// 1 batch dim: A[2,3,4] @ B[2,4,5] -> C[2,3,5]
let a = Tensor::<f64>::zeros(&[2, 3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[2, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "...ij,...jk->...ik", &[&a, &b], None)
    .unwrap();
assert_eq!(c.dims(), &[2, 3, 5]);

// 2 batch dims: A[2,3,4,5] @ B[2,3,5,6] -> C[2,3,4,6]
let a = Tensor::<f64>::zeros(&[2, 3, 4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[2, 3, 5, 6], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "...ij,...jk->...ik", &[&a, &b], None)
    .unwrap();
assert_eq!(c.dims(), &[2, 3, 4, 6]);

// No batch dims: A[3,4] @ B[4,5] -> C[3,5]
let a = Tensor::<f64>::zeros(&[3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let c = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "...ij,...jk->...ik", &[&a, &b], None)
    .unwrap();
assert_eq!(c.dims(), &[3, 5]);

§Integer label notation

use tenferro_algebra::Standard;
use tenferro_einsum::{einsum_with_subscripts, Subscripts};

// Same as "ij,jk->ik" but with integer labels
// Useful when indices exceed 52 (a-z, A-Z) or are computed programmatically
let subs = Subscripts::new(&[&[0, 1], &[1, 2]], &[0, 2]);
let c = einsum_with_subscripts::<Standard<f64>, CpuBackend>(&mut ctx, &subs, &[&a, &b], None)
    .unwrap();

§Contraction order control

use tenferro_algebra::Standard;
// Three matrices: D = A @ B @ C
// Parentheses specify: contract B*C first, then A*(BC)
let d = einsum::<Standard<f64>, CpuBackend>(&mut ctx, "ij,(jk,kl)->il", &[&a, &b, &c], None)
    .unwrap();

// Or use ContractionTree for programmatic control
use tenferro_einsum::ContractionTree;
let subs = Subscripts::new(&[&[0, 1], &[1, 2], &[2, 3]], &[0, 3]);
let tree = ContractionTree::from_pairs(
    &subs,
    &[&[3, 4], &[4, 5], &[5, 6]],
    &[(1, 2), (0, 3)],  // B*C first (avoids large intermediate)
).unwrap();
let d = einsum_with_plan::<Standard<f64>, CpuBackend>(&mut ctx, &tree, &[&a, &b, &c], None)
    .unwrap();

§Accumulating into a pre-allocated output

use tenferro_algebra::Standard;
use tenferro_einsum::{einsum_with_plan_into, ContractionTree, Subscripts};
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::LogicalMemorySpace;
use tenferro_prims::{CpuBackend, CpuContext};

let col = MemoryOrder::ColumnMajor;
let mut ctx = CpuContext::new(4);
let subs = Subscripts::new(&[&[0, 1], &[1, 2]], &[0, 2]);
let tree = ContractionTree::optimize(&subs, &[&[3, 4], &[4, 5]]).unwrap();
let a = Tensor::<f64>::zeros(&[3, 4], LogicalMemorySpace::MainMemory, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], LogicalMemorySpace::MainMemory, col).unwrap();
let mut c = Tensor::<f64>::zeros(&[3, 5], LogicalMemorySpace::MainMemory, col).unwrap();

// Hot loop: reuse output buffer, zero allocation per iteration
for _ in 0..1000 {
    // C = 1.0 * (A @ B) + 0.0 * C  (overwrite)
    einsum_with_plan_into::<Standard<f64>, CpuBackend>(
        &mut ctx, &tree, &[&a, &b], 1.0, 0.0, &mut c, None,
    ).unwrap();
}

§GPU async chaining (deferred evaluation)

Status: Not yet implemented. GPU backends do not exist yet. The examples below are aspirational design targets, not working code.

GPU einsum operations return immediately. The result tensor carries a CompletionEvent that tracks the pending accelerator work. Passing this tensor to another einsum chains via GPU stream dependencies — no CPU synchronization until data is accessed from the host.

  • wait() — explicitly blocks until computation completes
  • dims(), strides() — implicitly call wait()
  • For CPU tensors, event is always None (zero overhead)
use tenferro_algebra::Standard;
use tenferro_einsum::einsum;
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::LogicalMemorySpace;
use tenferro_prims::CudaBackend; // future

// In production, obtain memory spaces via BackendRegistry (future API).
let gpu_mem = LogicalMemorySpace::GpuMemory { device_id: 0 };
let col = MemoryOrder::ColumnMajor;
let mut gpu_ctx = /* CudaContext from BackendRegistry */;

let a = Tensor::<f64>::zeros(&[3, 4], gpu_mem, col).unwrap();
let b = Tensor::<f64>::zeros(&[4, 5], gpu_mem, col).unwrap();

// Both einsum calls submit work to the GPU and return immediately.
// The second call detects c's pending event and chains on the stream.
let c = einsum::<Standard<f64>, CudaBackend>(&mut gpu_ctx, "ij,jk->ik", &[&a, &b], None)
    .unwrap();
let d = einsum::<Standard<f64>, CudaBackend>(&mut gpu_ctx, "ij,jk->ik", &[&c, &b], None)
    .unwrap();

// wait() blocks until GPU computation completes
d.wait();

§Specifying a compute device

Status: Not yet implemented. See GPU note above.

use tenferro_einsum::einsum;
use tenferro_tensor::{Tensor, MemoryOrder};
use tenferro_device::{LogicalMemorySpace, ComputeDevice};

let col = MemoryOrder::ColumnMajor;
// In production, obtain memory spaces via BackendRegistry (future API).
let gpu_mem = LogicalMemorySpace::GpuMemory { device_id: 0 };

let mut a = Tensor::<f64>::zeros(&[3, 4], gpu_mem, col).unwrap();
let mut b = Tensor::<f64>::zeros(&[4, 5], gpu_mem, col).unwrap();

// Pin tensors to CUDA device 1 (overrides automatic device selection).
// This works when CUDA device 1 can access GpuMemory { device_id: 0 }
// (e.g., same physical GPU or NVLink-connected peer).
// If the device cannot access the memory space, einsum returns
// Err(NoCompatibleComputeDevice). In that case, transfer explicitly:
//   let a = a.to_memory_space_async(GpuMemory { device_id: 1 }).unwrap();
a.set_preferred_compute_device(Some(ComputeDevice::Cuda { device_id: 1 }));
b.set_preferred_compute_device(Some(ComputeDevice::Cuda { device_id: 1 }));

// einsum dispatches to the specified CUDA device
let c = einsum::<Standard<f64>, CudaBackend>(&mut gpu_ctx, "ij,jk->ik", &[&a, &b], None)
    .unwrap();

// Clear override — revert to automatic device selection
// a.set_preferred_compute_device(None);

Structs§

ContractionOptimizerOptions
Public options for automatic contraction-path optimization.
ContractionTree
Contraction tree determining pairwise contraction order for N-ary einsum.
Subscripts
Einsum subscripts using integer labels (omeinsum-rs compatible).

Enums§

NestedEinsum
Recursive einsum tree that preserves parenthesized grouping.

Traits§

EinsumBackend
Public backend contract accepted by tenferro-einsum.

Functions§

einsum
Execute einsum using string notation.
einsum_binary
Execute a binary einsum from string notation.
einsum_binary_into
Execute a binary einsum and accumulate into an existing output buffer.
einsum_binary_with_subscripts
Execute a binary einsum from pre-parsed subscripts.
einsum_binary_with_subscripts_into
Execute a binary einsum from pre-parsed subscripts, accumulating into output.
einsum_frule
Forward-mode rule (frule) for einsum without building a global tape.
einsum_into
Execute einsum using string notation, accumulating into an existing output.
einsum_owned
Execute einsum using string notation, consuming the input tensors.
einsum_rrule
Reverse-mode rule (rrule) for einsum without building a global tape.
einsum_with_path
Execute N-ary einsum with an explicit pairwise contraction path.
einsum_with_path_into
Execute N-ary einsum with an explicit pairwise contraction path, accumulating into an existing output tensor.
einsum_with_plan
Execute einsum with a pre-optimized ContractionTree.
einsum_with_plan_into
Execute einsum with a pre-optimized ContractionTree, accumulating into an existing output.
einsum_with_plan_owned
Execute einsum with a pre-optimized ContractionTree, consuming the input tensors.
einsum_with_subscripts
Execute einsum with pre-built Subscripts.
einsum_with_subscripts_into
Execute einsum with pre-built Subscripts, accumulating into an existing output.
einsum_with_subscripts_owned
Execute einsum with pre-built Subscripts, consuming the input tensors.

Type Aliases§

BackendContext
Context type used by public einsum APIs for a backend/algebra pair.