Expand description
HPTT-faithful cache-efficient tensor permutation.
Based on the algorithm described in HPTT (High-Performance Tensor Transpose) by Paul Springer, Tong Su, and Paolo Bientinesi. Original C++ implementation: https://github.com/springer13/hptt Licensed under BSD-3-Clause. See THIRD-PARTY-LICENSES for details.
Implements the key techniques from HPTT:
- Bilateral dimension fusion (fuse dims contiguous in both src and dst)
- 2D micro-kernel transpose (4×4 scalar for f64, 8×8 for f32)
- Macro-kernel: BLOCK × BLOCK tile via grid of micro-kernel calls
- Recursive ComputeNode loop nest (only stride-1 dims get blocked)
- ConstStride1 fast path when src and dst stride-1 dims coincide
Structs§
- Permute
Plan - Complete permutation plan.
Functions§
- build_
permute_ plan - Build a permutation plan using bilateral fusion and HPTT-style blocking.
- execute_
permute_ ⚠blocked - Execute the permutation plan (single-threaded).