Execution Session Architecture
Overview
BackendSession is the execution-time primitive surface. Ops run within a backend-owned execution scope when the backend has one, such as a GPU runtime or the CPU backend’s reusable buffer scope. Individual ops must not re-enter the same backend scope.
TensorBackend::with_backend_session creates the scope. GraphExecutor owns backend cache and extension runtime state, then routes an ExecProgram through segmented execution. Consecutive backend-session instructions may run inside one backend session.
GraphExecutor::eval_exec_ir(program, inputs)
└── segmented execution
└── fused backend segment
└── backend.with_backend_session(|exec| {
for inst in segment {
exec.transpose(...)
exec.reclaim_buffer(...)
}
})
Why Sessions
Without sessions, each backend method independently prepares its execution state and scratch-buffer access. For N-ary einsum with hundreds of small GEMM steps, repeating that setup per instruction can dominate.
Sessions amortize that setup by creating one BackendSession for a fused backend segment instead of one per instruction.
Backend Mapping
CPU (faer)
CpuContext stores the requested CPU thread count and owns the Rayon pool used by tenferro-owned multi-threaded CPU work. CpuContext::install runs the closure on that owned pool for multi-thread contexts and inline for one-thread contexts. faer-backed kernels use Par::Seq for one thread and Par::rayon(0) otherwise, so faer joins the already-entered CpuContext pool.
impl TensorBackend for CpuBackend {
fn with_backend_session<R: Send>(
&mut self,
f: impl FnOnce(&mut dyn BackendSession) -> R + Send,
) -> R {
let mut buffers = std::mem::take(&mut self.buffers);
let ctx = Arc::clone(&self.ctx);
let result = ctx.install(|| {
let mut session = CpuExecSession { ctx: &ctx, buffers: &mut buffers };
f(&mut session)
});
self.buffers = buffers;
result
}
}CpuExecSession implements BackendSession by calling kernel functions directly after the session has entered CpuContext. Individual ops should not re-enter the pool.
CubeCL/CUDA
CudaBackend is the current CUDA GPU backend. It uses CubeCL/CubeCL-CUDA and runtime-loaded CUDA libraries from crates/tenferro-gpu/src/cubecl/.
Today CudaBackend does not define a separate exec-session struct. It uses the default TensorBackend::with_backend_session adapter, so each BackendSession call forwards to the backend method. The backend method launches CubeCL kernels or calls the relevant cuTENSOR/cuSOLVER/cuBLAS wrapper against the backend’s CudaRuntime.
| CPU concept | CubeCL/CUDA concept |
|---|---|
CpuContext (thread count and Rayon pool) |
CudaRuntime (CUDA device/client) |
Par::rayon(0) / Par::Seq |
CubeCL launch through the stored runtime |
BufferPool (host Vec<T>) |
CubeCL device buffers plus upload/download helpers |
| faer/rayon CPU work | kernel launch on stream |
| per-step session setup overhead | per-kernel launch/runtime dispatch overhead |
Future CubeCL work may introduce a dedicated GPU exec session if there is a measurable benefit from binding temporary workspace, stream state, or device buffer pooling across an entire compiled program. That should extend CudaBackend; it should not add a separate CudaBackend type.
Default (no-op)
Backends that don’t need session batching use the default implementation which wraps the backend itself as a BackendSession via BackendSessionAdapter.
Trait Relationship
TensorBackend — factory: creates sessions, owns long-lived state
with_backend_session() — creates execution scope
dot_general() — standalone op (with per-op context entry)
...
BackendSession — session surface: ops without context re-entry
dot_general() — op within session (no install/set_device)
reclaim_buffer() — return buffer to pool within session
...
TensorBackend methods remain for use outside eval_exec_ir (e.g., standalone tensor operations, linalg solve multi-step logic). BackendSession is used only by the eval loop.