# Dynamic And Symbolic Shape Metadata **Status:** current design and implementation note for issue #829 **Related:** `../spec/optimizer-passes.md`, `../spec/ad-contract.md`, `../spec/primitive-catalog.md`, `../spec/backend-contract.md` ## Purpose This note defines the shape-metadata contract needed for dimensions that are not plain constants or input-axis sizes. The immediate triggers are: - `DynamicTruncate(input, size_scalar, axis)`, whose output extent depends on a runtime scalar tensor value. - `transpose_scatter`, which needs inverse gather `slice_sizes` derived from symbolic update-window dimensions. Both expose the same root problem: current metadata can describe concrete sizes and symbolic arithmetic over tensor axis sizes, but it cannot say whether the result is exact, conservative, or derived from runtime tensor values. This document is the current shape contract for #829 work. ## Current Model The current system has two related expression forms: - `DimExpr`: op-local expressions over the current op's input shapes. - `SymDim`: value-side symbolic expressions used by traced graph construction and AD metadata. This is enough for expressions such as: ```text output_dim = input0.axis(0) * input1.axis(2) ``` It is not enough for: ```text output_dim = clamp_runtime_scalar(input1, 0, input0.axis(axis)) ``` Nor is it enough to distinguish these two claims: ```text the output axis is exactly n the output axis is at most n ``` The second distinction matters because compiler passes may use metadata to emit runtime `Reshape` or `BroadcastInDim` parameters. An upper bound is useful for some safety checks, but it is not a legal replacement for an exact dimension. ## Design Goals - Preserve the fast static-shape path for existing concrete programs. - Make rank metadata exact even when some extents are dynamic. - Distinguish exact extents from upper-bound or unknown extents. - Keep backend kernels concrete: backend configs receive resolved `usize` sizes, not unresolved symbolic expressions. - Let graph and compiler layers carry symbolic config values until they can be resolved at execution time. - Avoid new AD construction panics for resolvable symbolic shape metadata. - Avoid implementing full dynamic shape polymorphism in the first pass. ## Non-Goals - Do not replace every shape expression user in one PR. - Do not require all backend kernels to accept dynamic shape parameters. - Do not introduce a constraint solver. - Do not add scatter-only or `DynamicTruncate`-only hacks that bypass shared metadata invariants. - Do not change user-facing tensor operation semantics beyond replacing inaccurate metadata and panics with explicit behavior. ## Recommended Approach Use a two-layer model: ```text Value shape metadata rank: exact extents: Vec ShapeExtent Exact(ExtentExpr) UpperBound(ExtentExpr) Unknown ExtentExpr Const(usize) InputAxis { input_idx, axis } RuntimeScalar { input_idx, semantics } Add/Sub/Mul/FloorDiv/Min/Max(...) ``` The concrete implementation stores `Vec>` directly in `TensorMeta` and `ExecInstruction::output_extents`; there is no separate public shape-metadata wrapper. The important split is semantic: - `ExtentExpr` says how a size would be computed. - `ShapeExtent` says what guarantee the expression provides. For current static programs, every extent remains `Exact(Const(...))` or `Exact(InputAxis { ... })`. The new states are only needed where current code already has inaccurate metadata or panic behavior. ### `Exact` An exact extent may be used to construct runtime shape parameters. For example, a `Reshape` target may be built from exact expressions because execution can resolve them from the concrete input tensors. Exact does not mean compile-time constant. It means the expression denotes the true runtime size. ### `UpperBound` An upper-bound extent means: ```text actual_runtime_extent <= expression_value ``` It may be used for conservative reasoning, diagnostics, allocation guards, or skip decisions. It must not be used as if it were the true output shape. For the first implementation pass, `DynamicTruncate` can use: ```text axis != truncated_axis: Exact(input.axis(axis)) axis == truncated_axis: UpperBound(input.axis(axis)) ``` This immediately fixes the false exactness without requiring runtime scalar expressions to be threaded through every compiler path. ### `Unknown` Unknown means no useful extent expression is available. It should be rare. Code that sees `Unknown` must either avoid shape-sensitive rewrites or return a structured unsupported-dynamic-shape error. ## Value-Side Metadata Boundary Shape metadata belongs to values, not to operation payloads. Operation payloads should carry only structural identity and output requirements that are part of the op's semantics. Input-shape snapshots used for AD, validation, or replay belong in value metadata. | Payload kind | Owner | |---|---| | Structural parameters such as axes, permutation order, or contraction dims | Op payload | | Required output shapes supplied by the user or frontend | Op payload as exact shape expressions | | Input shape snapshots, inferred output shape facts, and guardable metadata | Value metadata | `ShapeGuardContext` is the normative AD-facing metadata surface. Builder and emitter helpers may provide convenience accessors, but they must read from the same metadata store and record the same guards. AD rules must not recover shape facts by inspecting unrelated op payloads or assuming concrete extents from earlier graph-building phases. ## Runtime Scalar Dimensions The long-term exact representation for `DynamicTruncate` is a runtime scalar dimension expression: ```text Exact(Min( RuntimeScalar { input_idx: 1, semantics: DynamicTruncateSize }, InputAxis { input_idx: 0, axis } )) ``` The `semantics` tag is required because converting a scalar tensor into a dimension is not a generic numeric cast. `DynamicTruncate` currently accepts specific scalar dtypes and applies operation-specific rounding and clamping rules. Those rules must stay explicit. This should be a second implementation stage. The first stage only needs to stop reporting an upper bound as exact. ## Symbolic Operation Configs Backend-facing configs should remain concrete. For example, `tenferro_tensor::GatherConfig` can continue to carry `slice_sizes: Vec` because backend kernels execute on concrete tensors. Graph-facing configs need a symbolic form where shape-derived sizes can appear. The current implementation uses `StdTensorOp::GatherDynamicSliceSizes` and the matching `ExecOp::GatherDynamicSliceSizes`: ```text GatherDynamicSliceSizes offset_dims: Vec collapsed_slice_dims: Vec start_index_map: Vec index_vector_dim: usize slice_sizes: Vec ``` Lowering from graph config to backend config resolves symbolic slice sizes against concrete runtime inputs immediately before dispatch. Backends still see the existing concrete `GatherConfig`. This keeps the layering clean: - AD and compiler code may express symbolic config sizes. - Execution resolves them at the backend boundary. - Backend kernels stay optimized for concrete sizes. ## Scatter Transpose Policy `transpose_scatter` builds inverse gather `slice_sizes` from the primal updates shape: 1. If all required update-window extents are concrete, keep emitting the existing concrete inverse gather. 2. If an extent is symbolic, emit `GatherDynamicSliceSizes` and add the updates tensor as a non-differentiable shape-source input. The generated dynamic gather is AD-closed: its forward rule applies the same dynamic gather to the operand tangent, and its transpose emits the same inverse scatter as concrete `Gather` while returning `None` for indices and shape sources. ## Compiler Pass Contract Compiler passes must state what shape guarantee they require. | Consumer | Required guarantee | Rule | |---|---|---| | Rank checks | exact rank | Rank is always exact metadata. | | `Transpose` metadata | any known extents | Permute extent metadata without changing guarantees. | | `BroadcastInDim` execution shape | exact target extents | Reject or defer if any target extent is not exact. | | `Reshape` execution shape | exact target extents | Never use upper bounds as reshape sizes. | | `DotDecomposer` merge reshapes | runtime shape inputs for execution; best extent metadata | Emit reshape parameters from actual input shapes, and propagate exact/upper-bound/unknown metadata without upgrading guarantees. | | DCE and last-use analysis | no extent guarantee | Shape metadata is irrelevant. | | Diagnostics | best available | May print exact, upper-bound, or unknown metadata. | The important invariant is that an optimization pass may become conservative, but it may not silently reinterpret upper-bound metadata as exact metadata. ## AD Contract AD rules must not call `constant_value().unwrap_or_else(panic)` for user-reachable symbolic shapes. There are two acceptable outcomes for newly touched user-reachable symbolic shape paths: - emit a graph using exact symbolic metadata - return an unsupported-dynamic-shape error Current AD rule signatures are not uniformly `Result`-returning. The implementation should introduce one shared error channel before adding future AD paths that cannot be expressed as graph ops. The error should identify: - the primitive - the metadata field that required an exact extent - whether the observed extent was symbolic, upper-bound, or unknown ## Alternatives Considered ### Local `DynamicTruncate` patch Returning another static shape from `shape_infer` is not acceptable. The pre-truncation input shape is only an upper bound, and another guessed static extent would have the same false-exactness bug. ### Symbolic `GatherConfig` only This would fix one panic, but it would not define how compiler passes should treat exact versus conservative metadata. It also leaves `DynamicTruncate` incorrect. ### Full runtime-scalar shape expressions first This is the clean end state, but it touches shape inference, lowering, execution, AD, and config resolution at once. The first implementation should land the exactness contract and conservative handling before threading runtime scalar expressions through all layers. ## Migration Plan 1. Done: add the metadata contract types or equivalent internal representation. Preserve current exact behavior for existing static programs. 2. Done: mark `DynamicTruncate`'s truncated axis as an upper bound instead of exact. Update consumers so they do not use that extent as a concrete shape parameter. 3. Done: replace `transpose_scatter`'s symbolic update-window panic with a dynamic graph gather config. 4. Done: add graph-facing symbolic config support for gather-like `slice_sizes`. 5. Deferred: add structured AD construction errors for future dynamic or unsupported shape requirements. 6. Deferred: add exact runtime scalar extents for `DynamicTruncate` once the compiler and execution layers can resolve them safely. Each step should add focused regression tests. The first tests should cover: - `DynamicTruncate` metadata no longer being treated as exact pre-truncation shape. - `transpose_scatter` with symbolic update-window dimensions no longer panicking. - compiler passes refusing to build `Reshape` or `BroadcastInDim` parameters from upper-bound extents. ## Open Questions - Should graph-facing symbolic gather config eventually replace `StdTensorOp::Gather` directly, or should `GatherDynamicSliceSizes` remain as the narrow dynamic variant? - Which public transform APIs should surface unsupported dynamic-shape AD errors first? The preferred bias is to keep public surface narrow until the compiler and AD contracts settle.