SVD AD Notes

Conventions

Unless noted otherwise, Linearization and Transpose are written for the raw-output-space thin SVD before any DB observable such as u_abs, vh_abs, or uvh_product is applied. For complex tensors, Transpose means the adjoint under the real Frobenius inner product

\langle X, Y \rangle_{\mathbb{R}} = \operatorname{Re}\operatorname{tr}(X^\dagger Y).

Forward

The raw operator is

A \mapsto (U, S, V^\dagger), \qquad A = U \operatorname{diag}(S) V^\dagger.

Linearization

Let

dP = U^\dagger (dA) V, \qquad dS = \operatorname{Re}(\operatorname{diag}(dP)), \qquad dX = dP - \operatorname{diag}(dS).

Then the square-thin linearization is determined by the same spectral-gap solves summarized later in the note:

dU = U \left(\frac{\operatorname{sym}(dX \Sigma)}{E} + \operatorname{diag}\!\left(i \, \operatorname{Im}(\operatorname{diag}(dX)) \oslash (2 S)\right)\right),

dV = V \left(\frac{\operatorname{sym}(\Sigma dX)}{E} - \operatorname{diag}\!\left(i \, \operatorname{Im}(\operatorname{diag}(dX)) \oslash (2 S)\right)\right),

together with the non-square corrections recorded below.

JVP

The JVP is the same linearization evaluated at dA, returned on the raw factor outputs (dU, dS, dV^\dagger).

Transpose

For raw output cotangents (\bar{U}, \bar{S}, \bar{V}), the transpose map is

\bar{A} = U \Gamma V^\dagger + \mathbf{1}_{M > K}(I_M - U U^\dagger)\bar{U}\operatorname{diag}(S_{\text{inv}})V^\dagger + \mathbf{1}_{N > K}U\operatorname{diag}(S_{\text{inv}})\bar{V}^\dagger(I_N - V V^\dagger),

with

\Gamma = \Gamma_{\bar{U}} + \Gamma_{\bar{V}} + \Gamma_{\bar{S}}

defined by the spectral-gap helpers below.

VJP (JAX convention)

JAX reads the same raw transpose rule on the thin SVD outputs. Gauge-dependent observables should be removed after the raw rule, not baked into it.

VJP (PyTorch convention)

PyTorch uses the same raw adjoint, returned as cotangents for U, S, and Vh. The public DB families avoid raw singular-vector gauge issues by publishing gauge-insensitive observables.

Forward Definition

For a real or complex matrix

A = U \Sigma V^\dagger, \qquad A \in \mathbb{C}^{M \times N}, \qquad K = \min(M, N),

the thin SVD uses

  • U \in \mathbb{C}^{M \times K} with U^\dagger U = I_K
  • \Sigma = \operatorname{diag}(\sigma_1, \ldots, \sigma_K) with \sigma_i > 0
  • V \in \mathbb{C}^{N \times K} with V^\dagger V = I_K

If a decomposition is returned with full orthonormal factors, the AD rules still depend only on the leading thin factors. The thin SVD is therefore the mathematical source of truth for this note and for the oracle DB.

Reverse Rule

Given cotangents \bar{U}, \bar{S}, and \bar{V} of a real scalar loss \ell, compute the cotangent \bar{A} = \partial \ell / \partial A^*. If an implementation returns Vh = V^\dagger instead of V, translate its cotangent back via \bar{V} = (\bar{Vh})^\dagger before substituting into the formulas below.

Step 1: Spectral-gap helpers

Define

E_{ij} = \begin{cases} \sigma_j^2 - \sigma_i^2, & i \neq j, \\ 1, & i = j, \end{cases}

and, equivalently, the stabilized inverse-gap matrix

F_{ij} = \begin{cases} \dfrac{\sigma_j^2 - \sigma_i^2}{(\sigma_j^2 - \sigma_i^2)^2 + \eta} \approx \dfrac{1}{\sigma_j^2 - \sigma_i^2}, & i \neq j, \\ 0, & i = j, \end{cases}

with a small \eta > 0 for repeated or nearly repeated singular values. Also define

S_{\text{inv},i} = \frac{\sigma_i}{\sigma_i^2 + \eta} \approx \frac{1}{\sigma_i}.

The matrices E and F encode the same off-diagonal inverse-gap information.

Step 2: Inner matrix split

Introduce

\Gamma = \Gamma_{\bar{U}} + \Gamma_{\bar{V}} + \Gamma_{\bar{S}}.

The three contributions are:

From \bar{U}

J = F \odot (U^\dagger \bar{U})

\Gamma_{\bar{U}} = (J + J^\dagger)\Sigma + \operatorname{diag}\!\left(i \, \operatorname{Im}(\operatorname{diag}(U^\dagger \bar{U})) \odot S_{\text{inv}}\right).

The off-diagonal part reconstructs the skew-Hermitian variation allowed by the constraint U^\dagger U = I. In the complex case, the diagonal imaginary term captures the phase gauge freedom of the singular vectors. In the real case this term vanishes.

From \bar{V}

K = F \odot (V^\dagger \bar{V})

\Gamma_{\bar{V}} = \Sigma (K + K^\dagger).

This is the right-singular-vector analogue of the \bar{U} path.

From \bar{S}

\Gamma_{\bar{S}} = \operatorname{diag}(\bar{S}).

This is the direct singular-value path.

Step 3: Core square-case formula

For the square-thin part,

\bar{A}_{\text{core}} = U \Gamma V^\dagger.

Equivalently, the same expression can be written as

\bar{A}_{\text{core}} = U \left[ \left(\frac{\operatorname{skew}(U^\dagger \bar{U})}{E}\right)\Sigma + \Sigma \left(\frac{\operatorname{skew}(V^\dagger \bar{V})}{E}\right) + \operatorname{diag}(\bar{S}) + \operatorname{diag}\!\left(i \, \operatorname{Im}(\operatorname{diag}(U^\dagger \bar{U})) \odot S_{\text{inv}}\right) \right] V^\dagger,

where the diagonal imaginary term is grouped into the U side using the gauge constraint discussed below.

Step 4: Non-square corrections

The thin factors only span a K-dimensional subspace. When M \neq N, the parts of \bar{U} or \bar{V} orthogonal to that subspace contribute extra terms.

Tall case: M > K

\bar{A} \mathrel{+}= (I_M - U U^\dagger)\bar{U} \operatorname{diag}(S_{\text{inv}}) V^\dagger.

Wide case: N > K

\bar{A} \mathrel{+}= U \operatorname{diag}(S_{\text{inv}}) \bar{V}^\dagger (I_N - V V^\dagger).

Complete reverse rule

\bar{A} = U \Gamma V^\dagger + \mathbf{1}_{M > K}(I_M - U U^\dagger)\bar{U}\operatorname{diag}(S_{\text{inv}})V^\dagger + \mathbf{1}_{N > K}U\operatorname{diag}(S_{\text{inv}})\bar{V}^\dagger(I_N - V V^\dagger).

Gauge condition and ill-defined losses

For complex SVD, (U, V) is only defined up to a diagonal phase action

(U, V) \mapsto (U L, V L), \qquad L = \operatorname{diag}(e^{i \theta_k}).

The loss must therefore be invariant along this fibre. A necessary condition is

\operatorname{Im}(\operatorname{diag}(U^\dagger \bar{U} + V^\dagger \bar{V})) = 0.

Losses that violate this condition are ill-defined for derivatives through the singular-vector phase gauge. The DB’s gauge_ill_defined family records those expected failures.

Forward Rule

The forward rule solves for (dU, dS, dV) using the same spectral-gap machinery. Define

dP = U^\dagger (dA) V, \qquad dS = \operatorname{Re}(\operatorname{diag}(dP)), \qquad dX = dP - \operatorname{diag}(dS),

and let \operatorname{sym}(X) = X + X^\dagger. Then:

Square-thin part

dU = U \left(\frac{\operatorname{sym}(dX \Sigma)}{E} + \operatorname{diag}\!\left(i \, \operatorname{Im}(\operatorname{diag}(dX)) \oslash (2 S)\right)\right)

dV = V \left(\frac{\operatorname{sym}(\Sigma dX)}{E} - \operatorname{diag}\!\left(i \, \operatorname{Im}(\operatorname{diag}(dX)) \oslash (2 S)\right)\right)

dS = \operatorname{Re}(\operatorname{diag}(dP)).

In the real case the diagonal imaginary terms vanish.

Non-square forward corrections

For M > K,

dU \mathrel{+}= (I_M - U U^\dagger)(dA) V \operatorname{diag}(S_{\text{inv}}).

For N > K,

dV \mathrel{+}= (I_N - V V^\dagger)(dA)^\dagger U \operatorname{diag}(S_{\text{inv}}).

Equivalent formulations may return V^\dagger instead of V and therefore report d(V^\dagger) = (dV)^\dagger directly.

Numerical and Domain Notes

  • The formulas assume distinct singular values. Repeated singular values make the inverse spectral-gap matrix unstable.
  • If A is rectangular, the reverse rule also assumes the active singular values are nonzero so that S_{\text{inv}} is well defined.
  • full_matrices=True does not make the extra singular-vector columns differentiable; implementations narrow to the thin factors before applying AD.
  • Raw singular vectors are gauge-dependent, so the DB does not publish raw U or Vh.

Verification

Forward reconstruction

Check

\|A - U \operatorname{diag}(S) V^\dagger\|_F < \varepsilon,

together with U^\dagger U \approx I, V^\dagger V \approx I, and descending nonnegative singular values.

Backward checks

Representative scalar test functions:

  • dU only: f(A) = \operatorname{Re}(\psi^\dagger H \psi) with \psi = U_{:,1}
  • dV only: f(A) = \operatorname{Re}(\psi^\dagger H \psi) with \psi = V_{:,1}
  • dS only: f(A) = \sum_i \sigma_i
  • mixed: f(A) = \operatorname{Re}(U_{1,1}^* V_{1,1})

where H is a random Hermitian matrix independent of A.

References

  1. J. Townsend, “Differentiating the Singular Value Decomposition,” 2016.
  2. M. B. Giles, “An extended collection of matrix derivative results for forward and reverse mode automatic differentiation,” 2008.
  3. M. Seeger et al., “Auto-Differentiating Linear Algebra,” 2018.

DB Families

### u_abs

The DB publishes U.abs() rather than raw U to remove sign and phase gauge ambiguity.

### s

The DB publishes the singular values directly.

### vh_abs

The DB publishes the pair (S, Vh.abs()) so that singular values remain paired with a gauge-stable right singular-vector observable.

### uvh_product

The DB publishes (U @ Vh, S), which preserves the gauge-invariant subspace information while keeping the singular values explicit.

### svdvals/identity

The svdvals family is the singular-value-only projection of the same spectral rule. It reuses the singular-value part of the SVD differential.

### gauge_ill_defined

This family records expected failure cases where the chosen loss is not gauge-invariant and derivatives through the decomposition are intentionally ill defined.