Devices and GPU

tenferro follows the PyTorch convention: no implicit CPU/GPU transfer. A tensor must already live on the device required by the backend operation.

CUDA and WebGPU are backend/device choices, not separate tensor types. The same concrete, eager, and traced APIs can run supported GPU operations when tensors are explicitly uploaded to the selected provider and the executor/backend uses the same provider.

CUDA support targets NVIDIA CUDA. WebGPU support is experimental and currently focused on explicit transfer plus limited dot_general/einsum coverage. AMD/ROCm is not a supported execution path yet.

The optional XLA/PJRT path is separate from these tensor backends. It lowers static-shaped traced programs to StableHLO in tenferro-xla and loads PJRT plugins at runtime. See XLA and PJRT.

Provider Matrix

Provider Status Feature Notes
CPU Supported default CPU provider features Host execution
CUDA Supported cuda NVIDIA CUDA through CubeCL-CUDA plus CUDA libraries
WebGPU Experimental webgpu Explicit transfer and limited dot_general/einsum coverage
ROCm Not supported for execution rocm reserved Future compile-only substrate; no runtime quickstart

Transfer Model

Boundary What happens
CPU tensor to CUDA backend Upload first with tenferro_gpu::upload_tensor
CPU tensor to WebGPU backend Upload first with tenferro_gpu::upload_webgpu_tensor
CUDA tensor to CUDA backend Runs on CUDA for supported op/dtype combinations
WebGPU tensor to WebGPU backend Runs on WebGPU for supported op/dtype combinations
CUDA tensor to CPU backend Result-returning CPU backend ops fail; download first
WebGPU tensor to CPU backend Result-returning CPU backend ops fail; download first
GPU tensor to host inspection Direct host slice APIs panic; download first
Unsupported CUDA op or dtype Error, not silent CPU fallback
Unsupported WebGPU op or dtype Error, not silent CPU fallback

Keep tensors on one GPU provider across a GPU workload. Download only when the host needs to inspect values or hand data to CPU-only code.

View compaction follows the same rule. A CUDA backend may copy a CUDA view into compact CUDA memory, a WebGPU backend may copy a WebGPU view into compact WebGPU memory, and host code may copy a host view into compact host memory, but tenferro does not use that copy as a hidden CPU/GPU transfer.

Eager GPU Synchronization

Eager GPU execution submits work immediately and returns a provider-resident Tensor handle. Normal kernel launches do not imply host synchronization after every op. Subsequent GPU ops can consume the returned handle on the same backend stream or queue.

The host waits when a value is downloaded or otherwise inspected on the host. Some library-backed operations also synchronize internally when they must read device-side status.

Use EagerRuntime::synchronize() when code needs an explicit host-side barrier for the eager runtime. CPU runtimes return immediately; CUDA runtimes wait for the current backend stream, and WebGPU runtimes wait for the WebGPU queue. Direct GPU backend code can call the provider runtime’s synchronize() through backend.runtime().

For a time-axis diagram, see Execution Models.

CUDA Quickstart

use tenferro_gpu::{download_tensor, upload_tensor, CudaBackend};
use tenferro_tensor::{Tensor, TensorElementwise};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    if !tenferro_gpu::gpu_available() {
        return Ok(());
    }

    let mut backend = CudaBackend::new(0)?;
    let cpu_a = Tensor::from_vec_col_major(vec![2], vec![1.0_f64, 2.0]).unwrap();
    let cpu_b = Tensor::from_vec_col_major(vec![2], vec![3.0_f64, 4.0]).unwrap();

    let gpu_a = upload_tensor(backend.runtime(), &cpu_a)?;
    let gpu_b = upload_tensor(backend.runtime(), &cpu_b)?;
    let gpu_c = backend.add(&gpu_a, &gpu_b)?;
    let cpu_c = download_tensor(backend.runtime(), &gpu_c)?;

    assert_eq!(cpu_c.as_slice::<f64>().unwrap(), &[4.0, 6.0]);
    Ok(())
}

Compile-check the example without requiring a GPU:

cargo check -p tenferro-gpu --features cuda --example cuda_quickstart

Run it on a configured CUDA machine:

CUDA_PATH=/usr/local/cuda-12.8 \
LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH \
  cargo run -p tenferro-gpu --features cuda --example cuda_quickstart

The example downloads the result back to CPU and asserts the expected values.

Use the installed CUDA root on your machine. If several roots exist, inspect them first:

ls -d /usr/local/cuda*

If CUDA libraries or cuTENSOR are outside the standard dynamic-linker paths, set:

export CUDA_PATH=/usr/local/cuda-12.8
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:/usr/lib/x86_64-linux-gnu/libcutensor/12:$LD_LIBRARY_PATH
export TENFERRO_CUTENSOR_PATH=/usr/lib/x86_64-linux-gnu/libcutensor/12/libcutensor.so.2
export TENFERRO_CUSOLVER_PATH=$CUDA_PATH/lib64/libcusolver.so.12
export TENFERRO_CUBLAS_PATH=$CUDA_PATH/lib64/libcublas.so.12
export CUBECL_DEBUG_LOG=0

For XLA/PJRT GPU verification, add the PJRT plugin path separately:

export TENFERRO_PJRT_GPU_PLUGIN=/path/to/pjrt_c_api_gpu_plugin.so

WebGPU Quickstart

WebGPU is experimental and currently useful for explicit transfer plus F32/C32 dot_general and einsum paths. For a WebGPU-only binary, disable default features on tenferro-gpu unless the same crate also needs the default CPU provider stack:

[dependencies]
tenferro-gpu = { version = "...", default-features = false, features = ["webgpu"] }
tenferro-tensor = "..."

For a local scratch crate inside the checkout, use matching path dependencies and add an empty [workspace] table as described in Getting Started.

use tenferro_gpu::{
    download_webgpu_tensor, upload_webgpu_tensor, webgpu_available, WebGpuBackend,
};
use tenferro_tensor::{DotGeneralConfig, Tensor, TensorDot};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    if !webgpu_available() {
        return Ok(());
    }

    let mut backend = WebGpuBackend::new_default()?;
    let lhs = Tensor::from_vec_col_major(vec![2, 3], vec![1.0_f32, 2.0, 3.0, 4.0, 5.0, 6.0])?;
    let rhs = Tensor::from_vec_col_major(vec![3, 2], vec![1.0_f32, 2.0, 3.0, 4.0, 5.0, 6.0])?;
    let config = DotGeneralConfig {
        lhs_contracting_dims: vec![1],
        rhs_contracting_dims: vec![0],
        lhs_batch_dims: vec![],
        rhs_batch_dims: vec![],
    };

    let gpu_lhs = upload_webgpu_tensor(backend.runtime(), &lhs)?;
    let gpu_rhs = upload_webgpu_tensor(backend.runtime(), &rhs)?;
    let gpu_out = backend.dot_general(&gpu_lhs, &gpu_rhs, &config)?;
    let out = download_webgpu_tensor(backend.runtime(), &gpu_out)?;

    assert_eq!(out.shape(), &[2, 2]);
    assert_eq!(out.as_slice::<f32>().unwrap(), &[22.0, 28.0, 49.0, 64.0]);
    backend.runtime().synchronize()?;
    Ok(())
}

DotGeneralConfig uses StableHLO-style dimension-number fields: lhs_contracting_dims, rhs_contracting_dims, lhs_batch_dims, and rhs_batch_dims. Direct backend code can use the lower-level backend.runtime().synchronize() barrier; eager code can use EagerRuntime::synchronize().

CUDA Across Tensor Layers

Tensor model How CUDA fits
TypedTensor<T, R> Fixed-dtype runtime tensor with optional compile-time rank; host access still requires explicit download for CUDA buffers.
Tensor Main concrete CUDA value for backend execution
EagerTensor Wraps CUDA-resident Tensor values when using an EagerRuntime with CudaBackend
TracedTensor Graphs can be executed by GraphExecutor<CudaBackend> for supported ops

CUDA coverage is about backend dispatch. It is not the same as AD coverage.

Coverage

The CUDA backend uses the same concrete, eager, and traced tensor APIs as the CPU backend. The table below describes the current CUDA backend dispatch coverage for CUDA-resident Tensor values. It is not an autodiff coverage table.

Legend:

  • F32, F64, I32, I64, Bool, C32, and C64 are the current public Tensor dtypes.
  • Listed dtypes have CUDA implementations for that operation.
  • Missing dtypes or rows marked “No CUDA implementation” return an error rather than silently falling back to CPU.
Operation or family CUDA dtype support Notes
Allocation, upload, download F32, F64, I32, I64, Bool, C32, C64 Explicit CPU/GPU transfer only
add, mul, div F32, F64, C32, C64 Same dtype inputs only; integer and Bool arithmetic are not implemented
neg F32, F64, C32, C64 Integer and Bool negation are not implemented
conj F32, F64, C32, C64 Real floating dtypes are identity; integer and Bool inputs are not implemented
abs, sign F32, F64 Complex, integer, and Bool inputs are not implemented
maximum, minimum, compare, select, clamp F32, F64 Complex ordering is not defined; compare returns a Bool tensor and select takes a Bool predicate
exp, log, sin, cos, tanh, sqrt, rsqrt, expm1, log1p F32, F64 Complex analytic kernels are not implemented
pow F32, F64 Same dtype inputs only
reshape F32, F64, I32, I64, Bool, C32, C64 Metadata-only shape change
transpose, broadcast_in_dim, extract_diagonal, embed_diagonal, tril, triu F32, F64, I32, I64, C32, C64 Structural tensor operations; Bool is not implemented
checked convert, explicit cast F32, F64, C32, C64 among those dtypes; I32, I64, and Bool identity only convert applies the public checked conversion contract before backend dispatch; cast is explicit dtype projection. Conversion to or from integer or Bool dtypes is not implemented on CUDA except identity
reduce_sum, reduce_prod F32, F64, I32, I64, C32, C64 Multi-axis reductions are composed from single-axis kernels; Bool is not implemented
reduce_max, reduce_min F32, F64 Complex ordering is not defined; integer and Bool min/max are not implemented
dot_general F32, F64, C32, C64 cuTENSOR-backed contraction; same dtype inputs only
gather operand F32, F64, I32, C32, C64; indices F32, F64, I32, or I64 Complex and Bool index tensors; I64 and Bool operands are not implemented
scatter operand/update F32, F64, C32, C64; indices F32, F64, I32, or I64 Add-scatter semantics; complex and Bool index tensors and integer/Bool operands are not implemented
slice, pad, concatenate, reverse F32, F64, I32, I64, C32, C64 Dense structural/indexing operations; Bool is not implemented
dynamic_slice input F32, F64, I32, C32, C64; starts F32, F64, I32, or I64 Complex and Bool start tensors; I64 and Bool inputs are not implemented
dynamic_update_slice No CUDA implementation Returns an error
cholesky, triangular_solve, lu, svd, qr, eigh, solve F32, F64, C32, C64 cuSOLVER/cuBLAS-backed; integer and Bool dtypes are not implemented
full_piv_lu, full_piv_lu_solve No CUDA implementation Returns an error
General eig No CUDA implementation cuSOLVER does not provide LAPACK dgeev-style general eigendecomposition; download to CPU explicitly
AMD/ROCm No supported execution backend ROCm remains reserved for future loader-backed work

WebGPU Coverage

The WebGPU backend is experimental. It uses the same tensor APIs as CUDA and CPU, but its operation coverage is intentionally much narrower. Unsupported rows return explicit errors and do not fall back to CPU.

Operation or family WebGPU dtype support Notes
Allocation, upload, download F32, F64, I32, I64, Bool, C32, C64 Explicit CPU/WebGPU transfer only
dot_general, dot_general_with_conj F32, C32 CubeK-backed BGEMM planner; C32 conjugation is handled by the CubeK complex GEMM API. Supports rank-2, batched, and same-device packed operand layouts covered by tests
Binary einsum lowering to dot_general F32, C32 Eager F32/C32 and traced F32 paths are covered when inputs are explicitly uploaded to WebGPU
dot_general with zero contracting size No WebGPU implementation Returns an error until CubeK behavior is validated
dot_general for F64, C64 No WebGPU implementation Returns an error; no CPU fallback
Elementwise and analytic ops No WebGPU implementation Returns an error
Reductions No WebGPU implementation Returns an error
Structural/indexing ops beyond transfer-owned allocation metadata No WebGPU implementation Returns an error
Linalg No WebGPU implementation Returns an error
ROCm No supported execution backend No ROCm quickstart is provided

If cuTENSOR, cuSOLVER, or cuBLAS are installed outside normal dynamic-linker paths, set TENFERRO_CUTENSOR_PATH, TENFERRO_CUSOLVER_PATH, or TENFERRO_CUBLAS_PATH.