Module hptt

Expand description

HPTT-faithful cache-efficient tensor permutation.

Based on the algorithm described in HPTT (High-Performance Tensor Transpose) by Paul Springer, Tong Su, and Paolo Bientinesi. Original C++ implementation: https://github.com/springer13/hptt Licensed under BSD-3-Clause. See THIRD-PARTY-LICENSES for details.

Implements the key techniques from HPTT:

Bilateral dimension fusion (fuse dims contiguous in both src and dst)
2D micro-kernel transpose (4×4 scalar for f64, 8×8 for f32)
Macro-kernel: BLOCK × BLOCK tile via grid of micro-kernel calls
Recursive ComputeNode loop nest (only stride-1 dims get blocked)
ConstStride1 fast path when src and dst stride-1 dims coincide

Structs§

PermutePlan: Complete permutation plan.

Functions§

build_permute_plan: Build a permutation plan using bilateral fusion and HPTT-style blocking.
execute_permute_blocked^⚠: Execute the permutation plan (single-threaded).

Module hptt

Module hptt Copy item path

Structs§

Functions§

Module hptt