Core Concepts

tenferro separates tensor data, execution timing, automatic differentiation, and device location. That separation is the main design point: users can stay in typed tensor code for ordinary numeric work without autodiff, move to eager execution for PyTorch-like forward-and-backward training loops or functional grad, vjp, and jvp transforms, or move to traced graphs for compiled transform workflows plus compile/run reuse.

The Three Axes

Axis	What it controls	User-facing choices
Data layer	The value you pass around	`TypedTensor<T>`, `Tensor`, `EagerTensor`, `TracedTensor`
Execution model	When operations run	Direct, eager, traced compile/run
Backend/device	Where operations run	`CpuBackend` or `tenferro_gpu::CudaBackend`

CUDA is not a separate tensor type. The same concrete, eager, and traced APIs can run supported operations on CUDA tensors when data is explicitly uploaded to a CUDA backend.

Tensor Layers

Layer	Role	Good fit
`TypedTensor<T>`	Concrete tensor with compile-time scalar type	Most typed numeric code, typed host data, typed linalg/einsum
`Tensor`	Concrete tensor with runtime dtype	Dynamic dtype workflows, backend dispatch, CPU/CUDA values
`EagerTensor`	Concrete tensor in an eager runtime, with optional gradient tracking	Immediate forward execution; `backward()` on scalar losses and `EagerRuntime` functional transforms when tracked
`TracedTensor`	Graph-building tensor	Compiled `grad`, `vjp`, and `jvp` on traced graphs, graph optimization, repeated execution

Use the smallest layer that matches the job. AD is optional, and many projects should never leave the concrete tensor APIs.

Memory Model

tenferro stores dense tensors as contiguous column-major buffers. The leftmost dimension varies fastest in memory. This matches Fortran, Julia, MATLAB, and LAPACK-oriented workflows, and makes trailing batch axes natural for batched linear algebra and contractions.

use tenferro_runtime::{Tensor, TypedTensor};

fn main() {
    let typed =
        TypedTensor::<f64>::from_vec_col_major(vec![2, 3], vec![1.0, 4.0, 2.0, 5.0, 3.0, 6.0])
            .unwrap();
    assert_eq!(typed.as_slice().unwrap(), &[1.0, 4.0, 2.0, 5.0, 3.0, 6.0]);

    let dynamic =
        Tensor::from_vec_col_major(vec![2, 3], vec![1.0_f64, 4.0, 2.0, 5.0, 3.0, 6.0]).unwrap();
    assert_eq!(
        dynamic.as_slice::<f64>().unwrap(),
        typed.as_slice().unwrap()
    );
}

Use from_vec_col_major for tensor construction. Data copied from PyTorch, NumPy, JAX, or C-style examples must be explicitly reordered at the boundary before entering tenferro.

Read values back with as_slice::<T>() on Tensor, or as_slice() on TypedTensor<T>. Both return Result<&[T]> and yield Err on a dtype mismatch, not Option.

Direct Tensor Execution

Tensor operations run immediately through an explicit backend. Use this for ordinary tensor computation without autodiff when runtime dtype is useful.

use tenferro_cpu::CpuBackend;
use tenferro_runtime::{Tensor, TensorOpsExt};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut backend = CpuBackend::new();

    let a = Tensor::from_vec_col_major(vec![2, 3], vec![1.0_f64, 2.0, 3.0, 4.0, 5.0, 6.0])?;
    let b = Tensor::from_vec_col_major(vec![3, 2], vec![1.0_f64, 2.0, 3.0, 4.0, 5.0, 6.0])?;

    let c = a.matmul(&b, &mut backend)?;
    assert_eq!(c.shape(), &[2, 2]);

    Ok(())
}

TypedTensor<T> is the compile-time scalar type layer. It is useful when the project already knows it is working with f64, f32, complex values, or another supported scalar type.

Einsum is provided by the tenferro-einsum standard extension. Traced code uses GraphCompilerEinsumExt and registers tenferro_einsum::register_runtime on the executor.

Eager Execution And Autodiff

EagerTensor wraps concrete values in an EagerRuntime. Each operation computes or submits immediately and returns a concrete tensor handle. CPU values are host-readable; CUDA-resident values require explicit download before host inspection. If a tensor is a tracked variable, eager operations also record reverse-mode state so a scalar loss can call backward() and accumulate gradients.

When the derivative itself should be returned as an eager tensor instead of accumulated into a gradient slot, call EagerRuntime::grad, EagerRuntime::vjp, or EagerRuntime::jvp. These functional eager transforms can be composed for HVP-style workflows. Use TracedTensor when the derivative workflow should be compiled, optimized as a graph, or reused across runs.

use tenferro_ad::{EagerRuntime, Tensor};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ctx = EagerRuntime::new();
    let x = ctx.variable_from(Tensor::from_vec_col_major(vec![2], vec![1.0_f64, 2.0])?)?;
    let loss = x.mul(&x)?.reduce_sum(&[0])?;
    loss.backward()?;

    assert_eq!(x.grad()?.unwrap().as_slice::<f64>().unwrap(), &[2.0, 4.0]);

    Ok(())
}

Here Tensor is tenferro_ad’s re-export of tenferro_runtime::Tensor — the same type as the concrete tensor used in the sections above, so values move between the eager and direct APIs without conversion.

Traced Graph Execution

TracedTensor operations are lazy. They build a graph. A GraphCompiler lowers that graph into a reusable program, and a GraphExecutor<B> runs the program on a backend.

use tenferro_cpu::CpuBackend;
use tenferro_runtime::{GraphCompiler, GraphExecutor, TracedTensor};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let a = TracedTensor::from_vec_col_major(vec![2], vec![1.0_f64, 2.0])?;
    let b = TracedTensor::from_vec_col_major(vec![2], vec![3.0_f64, 4.0])?;
    let sum = (&a + &b)?;

    let mut compiler = GraphCompiler::new();
    let program = compiler.compile(&sum)?;
    let mut executor = GraphExecutor::new(CpuBackend::new());
    let result = executor.run(&program)?;

    assert_eq!(result.as_slice::<f64>().unwrap(), &[4.0, 6.0]);

    Ok(())
}

Traced mode is the right API when grad, vjp, jvp, or HVP-style composition should run on traced graphs with symbolic inputs, graph optimization, and repeated execution. Core primitive AD rules are available by default. Extension operation families that provide AD rules, such as tenferro-linalg, require enabling that crate’s autodiff feature and registering the extension rule set with with_extension_rules.