Expand description
Cache-efficient tensor permutation and transpose routines.
Modules§
- block
- Block size computation ported from Strided.jl
- copy
- Copy/permutation operations on strided views.
- fuse
- Dimension fusion logic ported from Strided.jl/src/mapreduce.jl
- hptt
- HPTT-faithful cache-efficient tensor permutation.
- kernel
- Block-based iteration engine for strided permutation operations.
- order
- Loop ordering algorithm ported from Strided.jl
Structs§
Constants§
- BLOCK_
MEMORY_ SIZE - CACHE_
LINE_ SIZE - SMALL_
TENSOR_ THRESHOLD - Maximum total elements for the small tensor fast path.
Functions§
- build_
plan_ fused - Build an execution plan with dimension fusion.
- build_
plan_ fused_ small - Simplified plan for small tensors that fit in L1 cache.
- compress_
dims - Remove size-1 dimensions from fused dims and all corresponding strides.
- compute_
costs - Compute the minimum stride cost for each dimension.
- compute_
importance - Compute the “importance” of each dimension for loop ordering.
- compute_
order - Compute the optimal iteration order for dimensions.
- copy_
into - Copy elements from source to destination:
dest[i] = src[i]. - copy_
into_ col_ major - Copy elements to a col-major destination.
- for_
each_ inner_ block_ preordered - Iterate over blocks with pre-ordered dimensions and initial offsets.
- fuse_
dims - Fuse contiguous dimensions across multiple arrays.
- sort_
by_ importance - Get the permutation that sorts by importance (descending).
- total_
len - Utility: total number of elements.
- try_
fuse_ group - Try to fuse a contiguous dimension group into a single (total_size, innermost_stride).