Skip to main content

Module perm

Module perm 

Source
Expand description

Cache-efficient tensor permutation and transpose routines.

Modules§

block
Block size computation ported from Strided.jl
copy
Copy/permutation operations on strided views.
fuse
Dimension fusion logic ported from Strided.jl/src/mapreduce.jl
hptt
HPTT-faithful cache-efficient tensor permutation.
kernel
Block-based iteration engine for strided permutation operations.
order
Loop ordering algorithm ported from Strided.jl

Structs§

KernelPlan

Constants§

BLOCK_MEMORY_SIZE
CACHE_LINE_SIZE
SMALL_TENSOR_THRESHOLD
Maximum total elements for the small tensor fast path.

Functions§

build_plan_fused
Build an execution plan with dimension fusion.
build_plan_fused_small
Simplified plan for small tensors that fit in L1 cache.
compress_dims
Remove size-1 dimensions from fused dims and all corresponding strides.
compute_costs
Compute the minimum stride cost for each dimension.
compute_importance
Compute the “importance” of each dimension for loop ordering.
compute_order
Compute the optimal iteration order for dimensions.
copy_into
Copy elements from source to destination: dest[i] = src[i].
copy_into_col_major
Copy elements to a col-major destination.
for_each_inner_block_preordered
Iterate over blocks with pre-ordered dimensions and initial offsets.
fuse_dims
Fuse contiguous dimensions across multiple arrays.
sort_by_importance
Get the permutation that sorts by importance (descending).
total_len
Utility: total number of elements.
try_fuse_group
Try to fuse a contiguous dimension group into a single (total_size, innermost_stride).