Dot-General Fixed Overhead

This note records the current interpretation of the crates/tenferro-tensor/benches/dot_general_overhead.rs benchmark. It is a developer-facing performance note for small, repeated contractions such as MPS inner products.

Benchmark Shape

The benchmark contracts a length-32 complex MPS inner product using two dot_general_with_conj calls per site. It compares three execution styles:

Variant	What it measures
`fresh_backend_cache`	Calls `backend.dot_general_with_conj` for every local contraction.
`persistent_backend_cache`	Reuses the backend runtime cache but still calls the backend method for every contraction.
`single_exec_session`	Opens one `backend.with_backend_session_cached` around the whole MPS contraction and calls `BackendSession` methods inside it.

Representative one-thread timings from the May 2026 investigation:

`L = 32`, `d = 2`, `Complex64`	`chi = 4`	`chi = 8`	`chi = 16`	`chi = 32`	`chi = 64`
`fresh_backend_cache`	~445 us	~446 us	~445 us	~1.53 ms	~8.69 ms
`persistent_backend_cache`	~421 us	~434 us	~444 us	~1.52 ms	~8.67 ms
`single_exec_session`	~45.8 us	~61.2 us	~190 us	~1.12 ms	~8.12 ms

Exact numbers vary by machine and allocator state. The important signal is the scaling difference between per-op backend calls and a single execution session.

Interpretation

For small contractions, the dominant overhead is not GEMM analysis cache misses, hash-map-heavy label handling, or contraction path search. The main fixed cost is opening the full backend wrapper path for each local contraction:

backend.dot_general_with_conj(...)
  -> backend.with_backend_session(...)
       -> dtype dispatch
            -> Tensor wrapper / enum dispatch
                 -> CpuExecSession::dot_general_with_conj(...)
                      -> GEMM planning / execution

When this sequence is repeated for every tiny local contraction, the backend.with_backend_session / dtype dispatch / wrapper path cost dominates. Current CPU execution enters the tenferro-owned CpuContext Rayon pool for backend calls and sessions; use one-thread measurements when isolating fixed wrapper/session overhead from parallel scheduling. Reusing a persistent runtime cache helps only slightly in this benchmark, which indicates that GEMM analysis caching is not the principal bottleneck for low bond dimensions.

For larger bond dimensions the actual GEMM work becomes dominant, so the gap between fresh_backend_cache and single_exec_session naturally shrinks.

Design Consequences

Prefer executing many small contractions inside one BackendSession when the caller naturally owns a loop or compiled program.
Do not add ad hoc MPS-specific fast paths to bypass the public backend API. The general abstraction to improve is an execution-session surface that can be reused by tensor-network code and compiled graph execution alike.
Further SmallVec/hash-map tuning in contraction metadata is unlikely to move this benchmark unless profiling shows that the workload has shifted away from backend/session fixed costs.
A future public or crate-private batched eager execution API should make session lifetime explicit, so callers can avoid opening one backend session per small contraction without reaching into backend internals.

Reproduction

Run with one CPU worker and one BLAS worker to avoid hiding fixed costs behind parallel scheduling noise:

RAYON_NUM_THREADS=1 \
VECLIB_MAXIMUM_THREADS=1 \
OPENBLAS_NUM_THREADS=1 \
OMP_NUM_THREADS=1 \
MKL_NUM_THREADS=1 \
cargo bench -p tenferro-tensor --bench dot_general_overhead