From bc81aece53a3ba09aa3342751a6fe71b828d4e0e Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sat, 30 May 2026 12:39:33 -0700 Subject: [PATCH 01/22] Add spec-layer matrix factorizations (Cholesky, QR, symmetric eig, SVD) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Motivation: formalizing Computational Hypergraph Discovery (CHD) in TorchLean. CHD (prototype: https://github.com/TheoBourdais/ComputationalHypergraphDiscovery) is a Gaussian-process / kernel method: it fits a kernel ridge regression per node and prunes "ancestors" by a noise/activation criterion. Essentially every statistically meaningful quantity it computes — the regression solution, the gamma (noise) selection, and the Z-test — is derived from the *full* symmetric eigendecomposition of the kernel matrix K (all eigenvalues AND eigenvectors). TorchLean's spec layer lacked this. The only eigendecomposition available (`Spec.eigendecompSpec`) is a power-iteration stub that recovers just the largest eigenpair, and there were no Cholesky / QR / SVD routines at all. That makes CHD inexpressible. This commit adds real reference implementations so the kernel linear algebra CHD depends on can be written in Lean. New: NN/Spec/Core/Tensor/Factorizations.lean (namespace Spec), executable over Float / ℝ, shape-indexed like the rest of the spec layer: - choleskySpec : A = L · Lᵀ for SPD A (lower-triangular L) - qrSpec/qrQSpec/qrRSpec : A = Q · R via modified Gram–Schmidt - symEigJacobiSpec : FULL symmetric eigendecomposition via cyclic Jacobi (all n eigenpairs) — the replacement for the largest-only stub, which is left untouched so PCA keeps building - svdSpec : A = U · diag(σ) · Vᵀ via the eig of Aᵀ·A The iterative Jacobi loop runs over a strict `Array (Array α)` representation (converted to/from Spec.Tensor only at the boundary): threading the functional `Fin n → Fin n → α` representation through one matrix product per rotation builds deep closure chains that blow up under evaluation, whereas arrays are strict values and keep execution cheap. Examples: NN/Examples/Factorization/{Common,Cholesky,QR,SymEig,SVD}.lean (+ the NN.Examples.Factorization umbrella). Each reconstructs the input from its factors and asserts (compiled `assertLt` via #eval, which fails the build) that the maximum reconstruction error is below tolerance. All reconstruct to 0.000000; SymEig recovers eigenvalues {1.3249, 2.4608, 5.2143} and SVD singular values {5, 3, 0} as expected. These are executable reference defs (matching the bar of the existing determinantSpec / inverseSpec / eigendecompSpec); formal correctness theorems are a planned follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 30 +++ NN/Examples/Factorization/Cholesky.lean | 42 +++ NN/Examples/Factorization/Common.lean | 71 +++++ NN/Examples/Factorization/QR.lean | 44 ++++ NN/Examples/Factorization/SVD.lean | 50 ++++ NN/Examples/Factorization/SymEig.lean | 52 ++++ NN/Spec/Core/Tensor/Factorizations.lean | 332 ++++++++++++++++++++++++ 7 files changed, 621 insertions(+) create mode 100644 NN/Examples/Factorization.lean create mode 100644 NN/Examples/Factorization/Cholesky.lean create mode 100644 NN/Examples/Factorization/Common.lean create mode 100644 NN/Examples/Factorization/QR.lean create mode 100644 NN/Examples/Factorization/SVD.lean create mode 100644 NN/Examples/Factorization/SymEig.lean create mode 100644 NN/Spec/Core/Tensor/Factorizations.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean new file mode 100644 index 0000000..7f7be49 --- /dev/null +++ b/NN/Examples/Factorization.lean @@ -0,0 +1,30 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +public import NN.Examples.Factorization.Cholesky +public import NN.Examples.Factorization.QR +public import NN.Examples.Factorization.SymEig +public import NN.Examples.Factorization.SVD + +/-! +# Matrix factorization examples + +Executable sanity checks for the spec-layer matrix factorizations in +`NN.Spec.Core.Tensor.Factorizations`: + +- `Cholesky` — `A = L · Lᵀ` +- `QR` — `A = Q · R`, `Qᵀ·Q = I` +- `SymEig` — full symmetric eigendecomposition `A = V · diag(λ) · Vᵀ` +- `SVD` — `A = U · diag(σ) · Vᵀ` + +Each example reconstructs the original matrix and asserts (via `#guard`) that the maximum +reconstruction error is below `tol`, so the build fails if a factorization is incorrect. +-/ + +@[expose] public section diff --git a/NN/Examples/Factorization/Cholesky.lean b/NN/Examples/Factorization/Cholesky.lean new file mode 100644 index 0000000..fc23255 --- /dev/null +++ b/NN/Examples/Factorization/Cholesky.lean @@ -0,0 +1,42 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: Cholesky factorization + +`choleskySpec A` returns the lower-triangular `L` with `A = L · Lᵀ` for a symmetric +positive-definite `A`. Here we factor a 3×3 SPD matrix and check the reconstruction error. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.Cholesky + +/-- A symmetric positive-definite test matrix. -/ +def A : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[4, 2, 2], + [2, 5, 3], + [2, 3, 6]] + +/-- The Cholesky factor `L` (lower-triangular). -/ +def L : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := Spec.choleskySpec A + +/-- Reconstruction error `‖A - L·Lᵀ‖_max`. -/ +def reconErr : Float := maxMatErr A (mm L (tr L)) + +-- Inspect the diagonal of the factor. +#eval vecToList (Spec.ofVecFn (fun i : Fin 3 => Spec.get2 L i i)) + +-- Compiled assertion: the factorization reconstructs A (fails the build otherwise). +#eval assertLt "Cholesky A = L·Lᵀ" reconErr + +end NN.Examples.Factorization.Cholesky diff --git a/NN/Examples/Factorization/Common.lean b/NN/Examples/Factorization/Common.lean new file mode 100644 index 0000000..c970e32 --- /dev/null +++ b/NN/Examples/Factorization/Common.lean @@ -0,0 +1,71 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Spec.Core.Tensor.Factorizations + +/-! +# Factorization examples — shared helpers + +Small `Float`-valued helpers used by the matrix-factorization examples +(`Cholesky`, `QR`, `SymEig`, `SVD`). These examples are *executable sanity checks*: each one +reconstructs the original matrix from its factors and asserts (via `#guard`) that the maximum +entrywise reconstruction error is below a tolerance, so the build fails if a factorization is wrong. + +These run over `Float` (the executable 64-bit runtime scalar), which is the precision the +factorizations target for Gaussian-process / kernel-method use. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization + +/-- Build an `m × n` `Float` matrix tensor from a row-major nested list. Missing entries are `0`. -/ +def mkMat {m n : Nat} (rows : List (List Float)) : Spec.Tensor Float (.dim m (.dim n .scalar)) := + Spec.ofMatFn (fun i j => (rows.getD i.val []).getD j.val 0.0) + +/-- Maximum entrywise absolute difference between two `m × n` matrices. -/ +def maxMatErr {m n : Nat} (A B : Spec.Tensor Float (.dim m (.dim n .scalar))) : Float := + (List.finRange m).foldl (fun acc i => + (List.finRange n).foldl + (fun a j => max a (Float.abs (Spec.get2 A i j - Spec.get2 B i j))) acc) 0.0 + +/-- Matrix product `A · B` (thin wrapper over `matMulSpec`). -/ +def mm {m n p : Nat} (A : Spec.Tensor Float (.dim m (.dim n .scalar))) + (B : Spec.Tensor Float (.dim n (.dim p .scalar))) : Spec.Tensor Float (.dim m (.dim p .scalar)) := + Spec.matMulSpec A B + +/-- Matrix transpose. -/ +def tr {m n : Nat} (A : Spec.Tensor Float (.dim m (.dim n .scalar))) : + Spec.Tensor Float (.dim n (.dim m .scalar)) := + Spec.Tensor.matrixTransposeSpec A + +/-- Turn a length-`n` vector into an `n × n` diagonal matrix. -/ +def diagFromVec {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : + Spec.Tensor Float (.dim n (.dim n .scalar)) := + Spec.ofMatFn (fun i j => if i.val == j.val then Spec.Tensor.toScalar (Spec.get v i) else 0.0) + +/-- Read a vector tensor back out as a `List Float` (for display). -/ +def vecToList {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : List Float := + (List.finRange n).map (fun i => Spec.Tensor.toScalar (Spec.get v i)) + +/-- Shared tolerance for reconstruction-error assertions. -/ +def tol : Float := 1e-6 + +/-- +Compiled assertion used by the examples: print `name: OK (err)` when `err < tol`, otherwise raise an +`IO` error so the build/`#eval` fails. Running this through `#eval` evaluates with the compiler +(fast), unlike `#guard`, which forces slow kernel reduction of the whole factorization. +-/ +def assertLt (name : String) (err : Float) (tolerance : Float := tol) : IO Unit := + if err < tolerance then + IO.println s!"{name}: OK (err = {err})" + else + throw (IO.userError s!"{name}: FAIL (err = {err} ≥ tol = {tolerance})") + +end NN.Examples.Factorization diff --git a/NN/Examples/Factorization/QR.lean b/NN/Examples/Factorization/QR.lean new file mode 100644 index 0000000..0080de7 --- /dev/null +++ b/NN/Examples/Factorization/QR.lean @@ -0,0 +1,44 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: QR factorization + +`qrSpec A` returns `(Q, R)` with `A = Q · R`, `Q` having orthonormal columns and `R` +upper-triangular (modified Gram–Schmidt). We check both `A = Q·R` and `Qᵀ·Q = I`. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.QR + +/-- A 3×3 test matrix (the classic Householder/QR example). -/ +def A : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[12, -51, 4], + [6, 167, -68], + [-4, 24, -41]] + +/-- Orthonormal `Q` factor. -/ +def Q : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := Spec.qrQSpec A +/-- Upper-triangular `R` factor. -/ +def R : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := Spec.qrRSpec A + +/-- Reconstruction error `‖A - Q·R‖_max`. -/ +def reconErr : Float := maxMatErr A (mm Q R) +/-- Orthonormality error `‖Qᵀ·Q - I‖_max`. -/ +def orthoErr : Float := maxMatErr (mm (tr Q) Q) (Spec.identityTensorSpec 3) + +-- Compiled assertions (fail the build otherwise). +#eval assertLt "QR A = Q·R" reconErr +#eval assertLt "QR Qᵀ·Q = I" orthoErr + +end NN.Examples.Factorization.QR diff --git a/NN/Examples/Factorization/SVD.lean b/NN/Examples/Factorization/SVD.lean new file mode 100644 index 0000000..0b2369c --- /dev/null +++ b/NN/Examples/Factorization/SVD.lean @@ -0,0 +1,50 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: singular value decomposition + +`svdSpec A sweeps` returns `(U, σ, V)` with `A = U · diag(σ) · Vᵀ`. The singular values come from +the symmetric eigendecomposition of `Aᵀ·A`. We check the reconstruction of a 2×3 matrix whose +singular values are `{5, 3, 0}`. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.SVD + +/-- A 2×3 test matrix with singular values `{5, 3}` (third is `0` since rank 2 < 3). -/ +def A : Spec.Tensor Float (.dim 2 (.dim 3 .scalar)) := + mkMat [[3, 2, 2], + [2, 3, -2]] + +/-- `(U, σ, V)` from the SVD. -/ +def svd : Spec.Tensor Float (.dim 2 (.dim 3 .scalar)) × Spec.Tensor Float (.dim 3 .scalar) × + Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + Spec.svdSpec A 12 + +/-- Left singular vectors `U` (2×3). -/ +def U : Spec.Tensor Float (.dim 2 (.dim 3 .scalar)) := svd.1 +/-- Singular values `σ`. -/ +def σ : Spec.Tensor Float (.dim 3 .scalar) := svd.2.1 +/-- Right singular vectors `V` (3×3). -/ +def V : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := svd.2.2 + +/-- Reconstruction error `‖A - U·diag(σ)·Vᵀ‖_max`. -/ +def reconErr : Float := maxMatErr A (mm (mm U (diagFromVec σ)) (tr V)) + +#eval vecToList σ + +-- Compiled assertion (fails the build otherwise). +#eval assertLt "SVD A = U·diag(σ)·Vᵀ" reconErr + +end NN.Examples.Factorization.SVD diff --git a/NN/Examples/Factorization/SymEig.lean b/NN/Examples/Factorization/SymEig.lean new file mode 100644 index 0000000..5426e07 --- /dev/null +++ b/NN/Examples/Factorization/SymEig.lean @@ -0,0 +1,52 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: symmetric eigendecomposition (cyclic Jacobi) + +`symEigJacobiSpec A sweeps` returns `(eigenvalues, V)` for a symmetric `A`, where the columns of +`V` are the (orthonormal) eigenvectors. Unlike the power-iteration `eigendecompSpec`, this recovers +**all** eigenpairs. We check the spectral reconstruction `A = V · diag(λ) · Vᵀ` and orthogonality +`Vᵀ · V = I`. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.SymEig + +/-- A symmetric test matrix (eigenvalues ≈ {1.3249, 2.4608, 5.2143}). -/ +def A : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[2, 1, 1], + [1, 3, 1], + [1, 1, 4]] + +/-- Eigenvalues (diagonal after Jacobi sweeps) and eigenvector matrix `V`. -/ +def eig : Spec.Tensor Float (.dim 3 .scalar) × Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + Spec.symEigJacobiSpec A 8 + +/-- Eigenvalues. -/ +def evals : Spec.Tensor Float (.dim 3 .scalar) := eig.1 +/-- Eigenvector matrix (columns are eigenvectors). -/ +def V : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := eig.2 + +/-- Spectral reconstruction error `‖A - V·diag(λ)·Vᵀ‖_max`. -/ +def reconErr : Float := maxMatErr A (mm (mm V (diagFromVec evals)) (tr V)) +/-- Orthogonality error `‖Vᵀ·V - I‖_max`. -/ +def orthoErr : Float := maxMatErr (mm (tr V) V) (Spec.identityTensorSpec 3) + +#eval vecToList evals + +-- Compiled assertions (fail the build otherwise). +#eval assertLt "SymEig A = V·diag(λ)·Vᵀ" reconErr +#eval assertLt "SymEig Vᵀ·V = I" orthoErr + +end NN.Examples.Factorization.SymEig diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean new file mode 100644 index 0000000..0fbe845 --- /dev/null +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -0,0 +1,332 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Spec.Core.Tensor.Linalg +public import NN.Spec.Core.TensorReductionShape.LinearAlgebra + +/-! +# Matrix factorizations (spec layer) + +This file provides **real**, shape-indexed reference implementations of the matrix +factorizations that classical / scientific-ML models (Gaussian processes, kernel ridge +regression, PCA, least squares) depend on, and which were previously missing from the spec +layer: + +- `choleskySpec` — Cholesky factorization `A = L · Lᵀ` (lower-triangular `L`), for SPD `A`. +- `qrSpec` — QR factorization `A = Q · R` via modified Gram–Schmidt + (`Q` has orthonormal columns, `R` upper-triangular). +- `symEigJacobiSpec` — **full** symmetric eigendecomposition via the cyclic Jacobi algorithm + (all eigenpairs, not just the largest). +- `svdSpec` — singular value decomposition `A = U · diag(σ) · Vᵀ`, built on the + symmetric eigendecomposition of `Aᵀ·A`. + +## Relationship to `eigendecompSpec` + +`Spec.eigendecompSpec` (in `NN/Spec/Models/CommonHelpers.lean`) is a power-iteration *stub* that +only recovers the **largest** eigenpair. It is intentionally left untouched (PCA depends on it). +`symEigJacobiSpec` here is the full replacement: for a symmetric matrix it returns *all* `n` +eigenvalues and an orthogonal matrix of eigenvectors. + +## Intent / tradeoffs + +Like the rest of the spec layer (`determinantSpec`, `inverseSpec`, `matMulSpec`), these prioritize +**mathematical clarity** and **shape safety** over performance, and are intended for small/medium +matrices and proof-oriented reference code. For large-scale numerics, use array-backed runtime +kernels. + +Internally the algorithms are written over the plain function representation +`Fin n → Fin n → α` (matrices) and `Fin n → α` (vectors), then wrapped back into `Spec.Tensor` +at the boundary. This keeps the numerical formulas readable and keeps later correctness proofs +working on ordinary functions rather than on nested `Tensor` `match`es. + +The iterative routines (Jacobi) take an explicit `sweeps` count: convergence of Jacobi is +asymptotic, so the caller chooses how much work to do. A dozen sweeps is ample for the small +matrices these specs target. +-/ + +@[expose] public section + + +namespace Spec + +open Tensor + +variable {α : Type} [Context α] + +/-! ## Boundary conversions between `Spec.Tensor` and plain functions -/ + +/-- View a matrix tensor as a function `Fin m → Fin n → α`. -/ +def toMatFn {m n : Nat} (A : Tensor α (.dim m (.dim n .scalar))) : Fin m → Fin n → α := + fun i j => get2 A i j + +/-- Build a matrix tensor from a function `Fin m → Fin n → α`. -/ +def ofMatFn {m n : Nat} (f : Fin m → Fin n → α) : Tensor α (.dim m (.dim n .scalar)) := + Tensor.dim (fun i => Tensor.dim (fun j => Tensor.scalar (f i j))) + +/-- View a vector tensor as a function `Fin n → α`. -/ +def toVecFn {n : Nat} (v : Tensor α (.dim n .scalar)) : Fin n → α := + fun i => Tensor.toScalar (get v i) + +/-- Build a vector tensor from a function `Fin n → α`. -/ +def ofVecFn {n : Nat} (f : Fin n → α) : Tensor α (.dim n .scalar) := + Tensor.dim (fun i => Tensor.scalar (f i)) + +/-! ## Small numeric helpers on the function representation -/ + +/-- Dot product of two length-`p` vectors. -/ +def dotFn {p : Nat} (u v : Fin p → α) : α := + (List.finRange p).foldl (fun s i => s + u i * v i) 0 + +/-- Euclidean norm of a length-`p` vector. -/ +def normFn {p : Nat} (v : Fin p → α) : α := + MathFunctions.sqrt (dotFn v v) + +/-- Decide `x < y` as a `Bool` (via the `Context`'s decidable `>`). -/ +def ltBool (x y : α) : Bool := Context.gtBool y x + +/-! ## Cholesky factorization + +For a symmetric positive-definite `A`, compute the lower-triangular `L` with `A = L · Lᵀ`. + +The columns are computed left to right. Column `j` uses only columns `0 .. j-1`: + +- diagonal: `L[j,j] = sqrt(A[j,j] - Σ_{k j` +- above: `L[i,j] = 0` for `i < j` +-/ + +/-- +The list of columns of the Cholesky factor `L`, as length-`n` vectors, computed left to right. +Element `j` of the result is column `j` of `L`. Built by a left fold so that when column `j` is +formed, `cols` already holds columns `0 .. j-1`. +-/ +def choleskyColsFn {n : Nat} (A : Fin n → Fin n → α) : List (Fin n → α) := + (List.finRange n).foldl (fun cols j => + -- Σ_{k ck j)).foldl (fun s x => s + x * x) 0 + let Ljj := MathFunctions.sqrt (A j j - sumsq) + let colj : Fin n → α := fun i => + if i.val < j.val then 0 + else if i.val == j.val then Ljj + else + -- Σ_{k ck i * ck j)).foldl (fun acc x => acc + x) 0 + (A i j - s) / Ljj + cols ++ [colj]) [] + +/-- Cholesky factor as a function: `L[i,j] = (choleskyColsFn A)[j] i`. -/ +def choleskyFn {n : Nat} (A : Fin n → Fin n → α) : Fin n → Fin n → α := + let cols := choleskyColsFn A + fun i j => (cols.getD j.val (fun _ => 0)) i + +/-- +Cholesky factorization of a symmetric positive-definite matrix `A`, returning the +lower-triangular factor `L` with `A = L · Lᵀ`. + +PyTorch analogue: `torch.linalg.cholesky(A)`. +-/ +def choleskySpec {n : Nat} (A : Tensor α (.dim n (.dim n .scalar))) : + Tensor α (.dim n (.dim n .scalar)) := + ofMatFn (choleskyFn (toMatFn A)) + +/-! ## QR factorization (modified Gram–Schmidt) + +For `A : m × n`, produce `Q : m × n` with orthonormal columns and `R : n × n` upper-triangular +such that `A = Q · R`. Modified Gram–Schmidt is used for better numerical behavior than the +classical variant. +-/ + +/-- Internal state for the Gram–Schmidt fold: computed `Q` columns and `R` columns so far. -/ +structure GSState (m n : Nat) (α : Type) where + /-- Orthonormal `Q` columns produced so far (each of length `m`). -/ + qs : List (Fin m → α) + /-- `R` columns produced so far (each of length `n`, upper-triangular). -/ + rcols : List (Fin n → α) + +/-- +Run modified Gram–Schmidt over the columns of `A`, returning the `Q` columns and `R` columns. +Column `j` is orthogonalized against the previously produced `Q` columns. +-/ +def gramSchmidtFn {m n : Nat} (A : Fin m → Fin n → α) : GSState m n α := + (List.finRange n).foldl (fun (st : GSState m n α) j => + let a : Fin m → α := fun i => A i j + -- r[k,j] = qₖ · a for each previously computed column k + let rkjs : List α := st.qs.map (fun qk => dotFn qk a) + -- v = a - Σ r[k,j] qₖ + let v : Fin m → α := fun i => + a i - (List.zip st.qs rkjs).foldl (fun acc (qk, r) => acc + r * qk i) 0 + let rjj := normFn v + let qj : Fin m → α := fun i => if Context.gtBool rjj 0 then v i / rjj else 0 + let rcolj : Fin n → α := fun k => + if k.val < j.val then rkjs.getD k.val 0 + else if k.val == j.val then rjj + else 0 + { qs := st.qs ++ [qj], rcols := st.rcols ++ [rcolj] }) { qs := [], rcols := [] } + +/-- The `Q` factor (orthonormal columns) of the QR factorization of `A`. -/ +def qrQSpec {m n : Nat} (A : Tensor α (.dim m (.dim n .scalar))) : + Tensor α (.dim m (.dim n .scalar)) := + let st := gramSchmidtFn (toMatFn A) + ofMatFn (fun i j => (st.qs.getD j.val (fun _ => 0)) i) + +/-- The `R` factor (upper-triangular) of the QR factorization of `A`. -/ +def qrRSpec {m n : Nat} (A : Tensor α (.dim m (.dim n .scalar))) : + Tensor α (.dim n (.dim n .scalar)) := + let st := gramSchmidtFn (toMatFn A) + ofMatFn (fun k j => (st.rcols.getD j.val (fun _ => 0)) k) + +/-- +QR factorization of `A : m × n` via modified Gram–Schmidt, returning `(Q, R)` with +`A = Q · R`, `Q` orthonormal columns, `R` upper-triangular. + +PyTorch analogue: `torch.linalg.qr(A)`. +-/ +def qrSpec {m n : Nat} (A : Tensor α (.dim m (.dim n .scalar))) : + Tensor α (.dim m (.dim n .scalar)) × Tensor α (.dim n (.dim n .scalar)) := + (qrQSpec A, qrRSpec A) + +/-! ## Symmetric eigendecomposition (cyclic Jacobi) + +For a symmetric `A`, iteratively apply Givens rotations `J` that zero one off-diagonal entry at a +time, accumulating `A ← Jᵀ A J` and `V ← V J`. Each `J` is orthogonal, so every step is an +orthogonal similarity: the spectrum is preserved and `V` stays orthogonal. After enough sweeps the +off-diagonal mass vanishes; the diagonal holds the eigenvalues and the columns of `V` are the +eigenvectors. +-/ + +/-! +The iteration below runs over an `Array (Array α)` representation rather than `Fin n → Fin n → α`. +Arrays are strict values, so threading them through the rotation loop cannot build the deep closure +chains that a functional representation would (one matrix product per rotation), which is what keeps +execution cheap. We convert to/from `Spec.Tensor` only at the boundary. +-/ + +/-- Read entry `(i, j)` of an `Array (Array α)` matrix (`0` if out of bounds). -/ +def arrGet (M : Array (Array α)) (i j : Nat) : α := (M.getD i #[]).getD j 0 + +/-- Materialize a matrix function into a strict `Array (Array α)`. -/ +def matToArr {n : Nat} (X : Fin n → Fin n → α) : Array (Array α) := + Array.ofFn (fun i : Fin n => Array.ofFn (fun j : Fin n => X i j)) + +/-- Matrix product `X · Y` of two `n × n` array matrices. -/ +def arrMatMul (n : Nat) (X Y : Array (Array α)) : Array (Array α) := + Array.ofFn (fun i : Fin n => Array.ofFn (fun j : Fin n => + (List.finRange n).foldl (fun s k => s + arrGet X i.val k.val * arrGet Y k.val j.val) 0)) + +/-- Transpose of an `n × n` array matrix. -/ +def arrTr (n : Nat) (X : Array (Array α)) : Array (Array α) := + Array.ofFn (fun i : Fin n => Array.ofFn (fun j : Fin n => arrGet X j.val i.val)) + +/-- `n × n` identity as an array matrix. -/ +def arrId (n : Nat) : Array (Array α) := + Array.ofFn (fun i : Fin n => Array.ofFn (fun j : Fin n => if i.val == j.val then 1 else 0)) + +/-- +Givens rotation in the `(p, q)` plane as an array matrix: +identity except `J[p,p]=J[q,q]=c`, `J[p,q]=s`, `J[q,p]=-s`. +-/ +def arrGivens (n : Nat) (p q : Nat) (c s : α) : Array (Array α) := + Array.ofFn (fun i : Fin n => Array.ofFn (fun j : Fin n => + if i.val == p && j.val == p then c + else if i.val == q && j.val == q then c + else if i.val == p && j.val == q then s + else if i.val == q && j.val == p then -s + else if i.val == j.val then 1 else 0)) + +/-- +Apply one Jacobi rotation that targets off-diagonal entry `(p, q)`, updating `(A, V)` as strict +arrays. If `A[p,q]` is already (numerically) zero, the state is returned unchanged. + +The rotation parameters follow Golub & Van Loan: +`τ = (A[q,q] - A[p,p]) / (2 A[p,q])`, `t = sign(τ)/(|τ| + sqrt(1+τ²))` (or `1` if `τ = 0`), +`c = 1/sqrt(1+t²)`, `s = t·c`. +-/ +def arrJacobiRotate (n : Nat) (A V : Array (Array α)) (p q : Nat) : + Array (Array α) × Array (Array α) := + let apq := arrGet A p q + if Context.gtBool (MathFunctions.abs apq) 0 then + let τ := (arrGet A q q - arrGet A p p) / (Numbers.two * apq) + let absτ := MathFunctions.abs τ + let sgn : α := if ltBool τ 0 then Numbers.neg_one else 1 + let t : α := + if Context.gtBool absτ 0 then sgn / (absτ + MathFunctions.sqrt (1 + τ * τ)) else 1 + let c := 1 / MathFunctions.sqrt (1 + t * t) + let s := t * c + let J := arrGivens n p q c s + (arrMatMul n (arrTr n J) (arrMatMul n A J), arrMatMul n V J) + else + (A, V) + +/-- All index pairs `(p, q)` with `p < q`, in row-major order (one cyclic Jacobi sweep). -/ +def jacobiPairs (n : Nat) : List (Nat × Nat) := + (List.range n).flatMap (fun p => + (List.range n).filterMap (fun q => if p < q then some (p, q) else none)) + +/-- One Jacobi sweep: rotate through every `(p, q)` pair with `p < q`. -/ +def arrJacobiSweep (n : Nat) (st : Array (Array α) × Array (Array α)) : + Array (Array α) × Array (Array α) := + (jacobiPairs n).foldl (fun s pq => arrJacobiRotate n s.1 s.2 pq.1 pq.2) st + +/-- Run `sweeps` Jacobi sweeps starting from `(A, I)`, returning the rotated `A` and accumulated `V`. -/ +def arrJacobiRun (n : Nat) (A : Array (Array α)) (sweeps : Nat) : + Array (Array α) × Array (Array α) := + (List.range sweeps).foldl (fun st _ => arrJacobiSweep n st) (A, arrId n) + +/-- +Full symmetric eigendecomposition of `A` via cyclic Jacobi, returning `(eigenvalues, eigenvectors)`. + +The eigenvalues are the diagonal of the rotated matrix; the eigenvectors are the **columns** of the +returned matrix `V` (so `eigenvectors[i, j]` is the `i`-th component of the `j`-th eigenvector). +`sweeps` controls how many Jacobi sweeps to run (default `12`). + +Unlike `eigendecompSpec`, this recovers **all** `n` eigenpairs. + +PyTorch analogue: `torch.linalg.eigh(A)`. +-/ +def symEigJacobiSpec {n : Nat} (A : Tensor α (.dim n (.dim n .scalar))) (sweeps : Nat := 12) : + Tensor α (.dim n .scalar) × Tensor α (.dim n (.dim n .scalar)) := + let (Af, Vf) := arrJacobiRun n (matToArr (toMatFn A)) sweeps + (ofVecFn (fun i => arrGet Af i.val i.val), ofMatFn (fun i j => arrGet Vf i.val j.val)) + +/-! ## Singular value decomposition + +For `A : m × n`, form the symmetric `M = Aᵀ·A : n × n`, eigendecompose it as `M = V Λ Vᵀ`, +take `σ = sqrt(max(Λ, 0))`, and recover `U` columns as `uⱼ = A vⱼ / σⱼ` (zero when `σⱼ = 0`). +Then `A = U · diag(σ) · Vᵀ`. This is the simplest reference SVD and is exact (up to the Jacobi +sweep count) for `A` of full column rank. +-/ + +/-- +Singular value decomposition of `A : m × n` returning `(U, σ, V)` with +`A = U · diag(σ) · Vᵀ`, `U : m × n` with orthonormal columns (full-rank case), `σ : n` the singular +values, and `V : n × n` orthogonal. + +`sweeps` controls the Jacobi sweep count used for the eigendecomposition of `Aᵀ·A`. + +PyTorch analogue: `torch.linalg.svd(A, full_matrices=False)`. +-/ +def svdSpec {m n : Nat} (A : Tensor α (.dim m (.dim n .scalar))) (sweeps : Nat := 12) : + Tensor α (.dim m (.dim n .scalar)) × Tensor α (.dim n .scalar) × + Tensor α (.dim n (.dim n .scalar)) := + let Af := toMatFn A + -- M = Aᵀ A (n × n, symmetric PSD), as a strict array matrix + let M : Array (Array α) := + Array.ofFn (fun i : Fin n => Array.ofFn (fun j : Fin n => + (List.finRange m).foldl (fun s k => s + Af k i * Af k j) 0)) + let (Mf, Vf) := arrJacobiRun n M sweeps + let σ : Fin n → α := fun j => + let d := arrGet Mf j.val j.val + MathFunctions.sqrt (if ltBool d 0 then 0 else d) + let U : Fin m → Fin n → α := fun i j => + let sj := σ j + if Context.gtBool sj 0 then + ((List.finRange n).foldl (fun s k => s + Af i k * arrGet Vf k.val j.val) 0) / sj + else 0 + (ofMatFn U, ofVecFn σ, ofMatFn (fun i j => arrGet Vf i.val j.val)) + +end Spec From 35923c4cbc47241741911a3dea7d0f356e46c707 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sat, 30 May 2026 15:13:00 -0700 Subject: [PATCH 02/22] Add correctness theorems for matrix factorizations (CHD foundation) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Formal verification companion to the spec-layer factorizations added in e0d08ac. The motivation is Computational Hypergraph Discovery (https://github.com/TheoBourdais/ComputationalHypergraphDiscovery), a Gaussian- process / kernel-ridge method whose numerical core is the full symmetric eigendecomposition of a kernel matrix K (solve_variationnal, find_gamma, Z_test). This commit gives that core a verified linear-algebra foundation. New: NN/Proofs/Tensor/Basic/Factorizations.lean (sorry-free, over ℝ via Mathlib). A refinement architecture: specification predicates on Mathlib matrices, with the CHD consequences proved from the spec independent of the float algorithm. - Specifications: IsCholesky, IsQR, IsSymEig, IsSVD. - CHD foundation (consumed by solve_variationnal / find_gamma / Z_test): * IsSymEig.add_smul_inv — the regularized inverse (K + γI)⁻¹ = V·diag(1/(λ+γ))·Vᵀ, proved from orthogonality of V. * IsSymEig.trace_eq / det_eq (trace K = Σλ, det K = Πλ), isHermitian. * IsSVD.gram_isSymEig — an SVD of A is an eigendecomposition of the Gram matrix AᵀA with eigenvalues σ², the form CHD actually builds. - Exact algorithm invariants: trace_orthogonal_conj / det_orthogonal_conj (every Jacobi sweep is a spectrum-preserving orthogonal similarity), givens_normSq (c²+s²=1), choleskyFn_lower_triangular (via a reusable List.foldl indexing lemma getD_foldl_finRange). - Residual certificate (Tier D): symEig_reconstruction_residual and symEig_frobenius_residual prove the reconstruction error equals the off-diagonal mass of the rotated matrix exactly; isSymEig_of_diagonal closes the loop in the zero-residual limit. This replaces an impossible a-priori convergence proof (Mathlib v4.30.0 has no Jacobi convergence theory, and Float never diagonalizes exactly), matching the runtime assertLt checks. Scope honesty: the exact algebraic reconstruction of the finite float folds (A=L·Lᵀ, A=QR) is documented as the remaining increment (it needs a prefix-fold induction plus per-pivot positivity); the spec-level facts CHD relies on do not depend on it. Blueprint: new chapter Ch4_Verification/Factorizations.lean ("Matrix Factorizations for Kernel Methods"), registered in Guide.lean and cross-linked from ScientificMLVerification. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Proofs/Tensor/Basic.lean | 1 + NN/Proofs/Tensor/Basic/Factorizations.lean | 387 ++++++++++++++++++ blueprint/TorchLeanBlueprint/Guide.lean | 3 + .../Ch4_Verification/Factorizations.lean | 115 ++++++ .../ScientificMLVerification.lean | 6 + 5 files changed, 512 insertions(+) create mode 100644 NN/Proofs/Tensor/Basic/Factorizations.lean create mode 100644 blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index f9a7b2f..c079721 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -9,6 +9,7 @@ module public import NN.Proofs.Tensor.Basic.Core public import NN.Proofs.Tensor.Basic.Folds public import NN.Proofs.Tensor.Basic.LinearAlgebra +public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.BoundsNorms public import NN.Proofs.Tensor.Basic.Algebra diff --git a/NN/Proofs/Tensor/Basic/Factorizations.lean b/NN/Proofs/Tensor/Basic/Factorizations.lean new file mode 100644 index 0000000..e4f5fc6 --- /dev/null +++ b/NN/Proofs/Tensor/Basic/Factorizations.lean @@ -0,0 +1,387 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Spec.Core.Tensor.Factorizations +public import NN.Proofs.Tensor.Basic.LinearAlgebra +public import Mathlib.Analysis.Matrix.Spectrum +public import Mathlib.Analysis.Matrix.PosDef +public import Mathlib.Analysis.Matrix.HermitianFunctionalCalculus +public import Mathlib.LinearAlgebra.Matrix.PosDef +public import Mathlib.LinearAlgebra.UnitaryGroup +public import Mathlib.LinearAlgebra.Matrix.NonsingularInverse +public import Mathlib.Data.List.GetD + +/-! +# Correctness of the matrix factorizations (foundation for CHD) + +This file provides the **formal correctness theorems** for the spec-layer factorizations in +[`NN.Spec.Core.Tensor.Factorizations`](../../../Spec/Core/Tensor/Factorizations.lean) +(`choleskySpec`, `qrSpec`, `symEigJacobiSpec`, `svdSpec`). The motivation is +[Computational Hypergraph Discovery](https://github.com/TheoBourdais/ComputationalHypergraphDiscovery): +a Gaussian-process / kernel-ridge method whose numerical core reduces to the **full symmetric +eigendecomposition** of a kernel matrix `K`. CHD's `solve_variationnal`, `find_gamma` and `Z_test` +are all expressed through the eigendecomposition of `K`, so a verified linear-algebra foundation is a +prerequisite for formalizing CHD. + +## Architecture (refinement) + +* **Specifications** (`IsCholesky`, `IsQR`, `IsSymEig`, `IsSVD`) are `Prop`s on Mathlib + `Matrix (Fin n) (Fin n) ℝ`. Mathlib's `Matrix m n α` is *definitionally* `m → n → α`, so the + function representation `Spec.toMatFn` produced by the executable specs bridges for free. +* **Foundation theorems** (this is what CHD consumes) are proved from the *specifications*, independent + of the executable algorithm, via Mathlib's spectral theorem and continuous functional calculus. +* **Algorithm theorems** connect the executable `Spec.*Fn` defs to the specifications. Proven here: + the Cholesky factor is lower-triangular (`choleskyFn_lower_triangular`); the Jacobi/SVD routines + satisfy their *exact* invariants — orthogonal similarity preserves trace/determinant + (`trace_orthogonal_conj`, `det_orthogonal_conj`), the Givens rotation is orthogonal + (`givens_normSq`), and the eigendecomposition is exact in the zero-residual limit + (`isSymEig_of_diagonal`), with the finite-sweep error captured a-posteriori by + `symEig_frobenius_residual`. + +## Scope honesty + +`A = V · diag(λ) · Vᵀ` is **not** an exact theorem for the finite-sweep / floating-point Jacobi output; +it is the *target* certified at runtime by the `assertLt` checks in `NN/Examples/Factorization`, and +bounded a-posteriori here by `symEig_frobenius_residual` (residual = off-diagonal mass of `Af`). +Mathlib v4.30.0 has no Jacobi convergence theory and `Float` never diagonalizes exactly, so no +a-priori convergence theorem is possible. + +The exact algebraic reconstruction of the executable *finite* factorizations — `A = L · Lᵀ` for +`choleskyFn` (under SPD pivots) and `A = Q · R`, `Qᵀ Q = 1` for `gramSchmidtFn` (under full column +rank) — is the remaining increment: it requires an induction relating the `List.foldl` prefix at step +`j` to the first `j` columns (extending `getD_foldl_finRange`) plus the per-pivot positivity discharge. +The specification-level consequences CHD needs (above) are independent of that algorithmic step. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Matrix +open scoped BigOperators + +variable {n : Nat} + +/-! ## Specifications + +The mathematical meaning of each factorization, as a predicate over real matrices. Over `ℝ`, +`star = id` so `conjTranspose = transpose`; we phrase everything with `ᵀ`. +-/ + +/-- `L` is a Cholesky factor of `A`: lower-triangular with `A = L · Lᵀ`. -/ +def IsCholesky (A L : Matrix (Fin n) (Fin n) ℝ) : Prop := + (∀ i j, i < j → L i j = 0) ∧ A = L * Lᵀ + +/-- `(Q, R)` is a QR factorization of `A`: `Q` has orthonormal columns, `R` is upper-triangular, +`A = Q · R`. -/ +def IsQR {m k : Nat} (A Q : Matrix (Fin m) (Fin k) ℝ) (R : Matrix (Fin k) (Fin k) ℝ) : Prop := + Qᵀ * Q = 1 ∧ (∀ i j, j < i → R i j = 0) ∧ A = Q * R + +/-- `(Λ, V)` is a symmetric eigendecomposition of `A`: `V` orthogonal, `A = V · diag(Λ) · Vᵀ`. -/ +def IsSymEig (A : Matrix (Fin n) (Fin n) ℝ) (Λ : Fin n → ℝ) (V : Matrix (Fin n) (Fin n) ℝ) : Prop := + Vᵀ * V = 1 ∧ A = V * Matrix.diagonal Λ * Vᵀ + +/-- `(U, σ, V)` is a (thin) SVD of `A`: `U`, `V` have orthonormal columns, `σ ≥ 0`, +`A = U · diag(σ) · Vᵀ`. -/ +def IsSVD {m k : Nat} (A U : Matrix (Fin m) (Fin k) ℝ) (σ : Fin k → ℝ) + (V : Matrix (Fin k) (Fin k) ℝ) : Prop := + Uᵀ * U = 1 ∧ Vᵀ * V = 1 ∧ (∀ j, 0 ≤ σ j) ∧ A = U * Matrix.diagonal σ * Vᵀ + +/-! ## Foundation theorems consumed by CHD + +These follow from the *specification*, not from any particular algorithm. -/ + +/-- A symmetric eigendecomposition exhibits `A` as Hermitian (here: symmetric, over `ℝ`). -/ +theorem IsSymEig.isHermitian {A : Matrix (Fin n) (Fin n) ℝ} {Λ V} + (h : IsSymEig A Λ V) : A.IsHermitian := by + obtain ⟨_, hA⟩ := h + unfold Matrix.IsHermitian + rw [hA] + simp [Matrix.mul_assoc] + +/-- From a symmetric eigendecomposition, an orthogonal matrix `V` satisfies `V · Vᵀ = 1` as well as +`Vᵀ · V = 1`. -/ +theorem IsSymEig.mul_transpose_self {A : Matrix (Fin n) (Fin n) ℝ} {Λ V} + (h : IsSymEig A Λ V) : V * Vᵀ = 1 := + mul_eq_one_comm.mp h.1 + +/-! ### The kernel-ridge / `solve_variationnal` identity + +CHD repeatedly forms `(K + γ I)⁻¹ b`. Diagonalizing `K = V diag(λ) Vᵀ` turns this into a per-eigenvalue +rescaling `V diag(1/(λ+γ)) Vᵀ b`, which is the basis of `solve_variationnal`, `find_gamma` and the +`Z_test`. The identity below is proved purely from orthogonality of `V` (no appeal to Mathlib's own +spectral decomposition), so it holds for *any* eigendecomposition the algorithm returns. -/ + +/-- Conjugating a diagonal by an orthogonal `V` is inverted by conjugating the entrywise inverse: +`(V · diag(d) · Vᵀ) · (V · diag(d⁻¹) · Vᵀ) = 1` when every `d i ≠ 0`. -/ +theorem orthogonal_conj_diagonal_mul_inv {V : Matrix (Fin n) (Fin n) ℝ} (hV : Vᵀ * V = 1) + {d : Fin n → ℝ} (hd : ∀ i, d i ≠ 0) : + (V * Matrix.diagonal d * Vᵀ) * (V * Matrix.diagonal (fun i => (d i)⁻¹) * Vᵀ) = 1 := by + have hdd : (Matrix.diagonal d) * (Matrix.diagonal (fun i => (d i)⁻¹)) + = (1 : Matrix (Fin n) (Fin n) ℝ) := by + rw [Matrix.diagonal_mul_diagonal] + rw [show (fun i => d i * (d i)⁻¹) = (fun _ : Fin n => (1 : ℝ)) from + funext fun i => mul_inv_cancel₀ (hd i)] + exact Matrix.diagonal_one + calc + (V * Matrix.diagonal d * Vᵀ) * (V * Matrix.diagonal (fun i => (d i)⁻¹) * Vᵀ) + = V * Matrix.diagonal d * (Vᵀ * V) * Matrix.diagonal (fun i => (d i)⁻¹) * Vᵀ := by + simp [Matrix.mul_assoc] + _ = V * (Matrix.diagonal d * Matrix.diagonal (fun i => (d i)⁻¹)) * Vᵀ := by + rw [hV]; simp [Matrix.mul_assoc] + _ = V * Vᵀ := by rw [hdd, Matrix.mul_one] + _ = 1 := mul_eq_one_comm.mp hV + +/-- `K + γ I` rewritten through the eigendecomposition: `V · diag(λ + γ) · Vᵀ`. -/ +theorem IsSymEig.add_smul_eq {A : Matrix (Fin n) (Fin n) ℝ} {Λ V} + (h : IsSymEig A Λ V) (γ : ℝ) : + A + γ • (1 : Matrix (Fin n) (Fin n) ℝ) + = V * Matrix.diagonal (fun i => Λ i + γ) * Vᵀ := by + obtain ⟨hV, hA⟩ := h + have hVV : V * Vᵀ = 1 := mul_eq_one_comm.mp hV + have hsplit : Matrix.diagonal (fun i => Λ i + γ) + = Matrix.diagonal Λ + γ • (1 : Matrix (Fin n) (Fin n) ℝ) := by + ext i j + by_cases hij : i = j <;> + simp [Matrix.add_apply, Matrix.smul_apply, hij] + rw [hsplit, hA] + rw [Matrix.mul_add, Matrix.add_mul] + congr 1 + rw [Matrix.mul_smul, Matrix.smul_mul, Matrix.mul_one, hVV] + +/-- **Regularized inverse / `solve_variationnal`.** For `γ` avoiding `-λᵢ`, the regularized system +`K + γ I` is inverted by per-eigenvalue rescaling: `(K + γ I)⁻¹ = V · diag(1/(λ + γ)) · Vᵀ`. -/ +theorem IsSymEig.add_smul_inv {A : Matrix (Fin n) (Fin n) ℝ} {Λ V} + (h : IsSymEig A Λ V) (γ : ℝ) (hγ : ∀ i, Λ i + γ ≠ 0) : + (A + γ • (1 : Matrix (Fin n) (Fin n) ℝ))⁻¹ + = V * Matrix.diagonal (fun i => (Λ i + γ)⁻¹) * Vᵀ := by + apply Matrix.inv_eq_right_inv + rw [h.add_smul_eq γ] + exact orthogonal_conj_diagonal_mul_inv h.1 hγ + +/-! ### Spectral trace and determinant (used by `find_gamma` / model-evidence terms) -/ + +/-- `trace K = Σ λᵢ`. -/ +theorem IsSymEig.trace_eq {A : Matrix (Fin n) (Fin n) ℝ} {Λ V} + (h : IsSymEig A Λ V) : A.trace = ∑ i, Λ i := by + obtain ⟨hV, hA⟩ := h + rw [hA, Matrix.trace_mul_comm, ← Matrix.mul_assoc, hV, Matrix.one_mul, + Matrix.trace_diagonal] + +/-- `det K = Π λᵢ`. -/ +theorem IsSymEig.det_eq {A : Matrix (Fin n) (Fin n) ℝ} {Λ V} + (h : IsSymEig A Λ V) : A.det = ∏ i, Λ i := by + obtain ⟨hV, hA⟩ := h + have hVV : V * Vᵀ = 1 := mul_eq_one_comm.mp hV + rw [hA, Matrix.det_mul, Matrix.det_mul, Matrix.det_diagonal, + mul_right_comm, ← Matrix.det_mul, hVV, Matrix.det_one, one_mul] + +/-! ### SVD ⟹ eigendecomposition of the Gram matrix + +CHD forms the kernel/Gram matrix `K = Aᵀ A` and eigendecomposes it. An SVD of `A` *is* such an +eigendecomposition, with eigenvalues `σᵢ²` and the same orthogonal `V`. -/ + +/-- The right singular vectors `V` of `A` diagonalize the Gram matrix `Aᵀ A`, with eigenvalues `σᵢ²`. -/ +theorem IsSVD.gram_isSymEig {m k : Nat} {A U : Matrix (Fin m) (Fin k) ℝ} + {σ : Fin k → ℝ} {V} (h : IsSVD A U σ V) : + IsSymEig (Aᵀ * A) (fun i => σ i ^ 2) V := by + obtain ⟨hU, hV, _, hA⟩ := h + refine ⟨hV, ?_⟩ + have hσσ : Matrix.diagonal σ * Matrix.diagonal σ + = Matrix.diagonal (fun i => σ i ^ 2) := by + rw [Matrix.diagonal_mul_diagonal]; simp [pow_two] + rw [hA, Matrix.transpose_mul, Matrix.transpose_mul, Matrix.transpose_transpose, + Matrix.diagonal_transpose] + -- V Dᵀ Uᵀ · U D Vᵀ with Dᵀ = D + calc + V * (Matrix.diagonal σ * Uᵀ) * (U * Matrix.diagonal σ * Vᵀ) + = V * Matrix.diagonal σ * (Uᵀ * U) * Matrix.diagonal σ * Vᵀ := by + simp [Matrix.mul_assoc] + _ = V * (Matrix.diagonal σ * Matrix.diagonal σ) * Vᵀ := by + rw [hU]; simp [Matrix.mul_assoc] + _ = V * Matrix.diagonal (fun i => σ i ^ 2) * Vᵀ := by rw [hσσ] + +/-! ## Tier B — exact structural & invariant facts + +These hold *exactly* (no convergence/rounding caveat). The orthogonal-similarity invariants below are +the precise sense in which the Jacobi iteration is faithful: every sweep is an orthogonal similarity +`A ← Jᵀ A J`, so trace, determinant and spectrum are preserved at every step, independent of how far +the off-diagonal has been driven down. -/ + +/-- Orthogonal similarity preserves the trace: `trace (V · M · Vᵀ) = trace M` when `Vᵀ · V = 1`. -/ +theorem trace_orthogonal_conj {V M : Matrix (Fin n) (Fin n) ℝ} (hV : Vᵀ * V = 1) : + (V * M * Vᵀ).trace = M.trace := by + rw [Matrix.trace_mul_comm, ← Matrix.mul_assoc, hV, Matrix.one_mul] + +/-- Orthogonal similarity preserves the determinant: `det (V · M · Vᵀ) = det M` when `Vᵀ · V = 1`. -/ +theorem det_orthogonal_conj {V M : Matrix (Fin n) (Fin n) ℝ} (hV : Vᵀ * V = 1) : + (V * M * Vᵀ).det = M.det := by + have hVV : V * Vᵀ = 1 := mul_eq_one_comm.mp hV + rw [Matrix.det_mul, Matrix.det_mul, mul_right_comm, ← Matrix.det_mul, hVV, Matrix.det_one, + one_mul] + +/-- **Givens rotation is orthogonal.** With `c = 1/√(1+t²)` and `s = t·c` (the parameters +`arrJacobiRotate` uses), the rotation satisfies `c² + s² = 1`, so every Jacobi step is an orthogonal +transformation. -/ +theorem givens_normSq (t : ℝ) : + (1 / Real.sqrt (1 + t ^ 2)) ^ 2 + (t * (1 / Real.sqrt (1 + t ^ 2))) ^ 2 = 1 := by + have hpos : (0 : ℝ) < 1 + t ^ 2 := by positivity + have hsqrt : Real.sqrt (1 + t ^ 2) ^ 2 = 1 + t ^ 2 := Real.sq_sqrt hpos.le + have hne : (1 + t ^ 2) ≠ 0 := ne_of_gt hpos + have hc2 : (1 / Real.sqrt (1 + t ^ 2)) ^ 2 = 1 / (1 + t ^ 2) := by + rw [div_pow, one_pow, hsqrt] + rw [mul_pow, hc2] + field_simp + +/-! ### Fold-indexing for the column-building specs + +`choleskyColsFn` and `gramSchmidtFn` build their output with a left fold that appends one column per +index. The lemmas here read off the column produced at a given position, bridging the executable +`List.foldl` form to per-entry reasoning. They are generic over the appended-value function `g`. -/ + +section FoldSnoc + +variable {β : Type _} {ι : Type _} + +/-- A left fold that appends one element per input grows the accumulator by `l.length`. -/ +private theorem length_foldl_snoc (g : List β → ι → β) (l : List ι) (acc : List β) : + (l.foldl (fun s a => s ++ [g s a]) acc).length = acc.length + l.length := by + induction l generalizing acc with + | nil => simp + | cons a t ih => + rw [List.foldl_cons, ih] + simp only [List.length_append, List.length_cons, List.length_nil] + omega + +/-- A fold that only appends never changes an index already inside the accumulator. -/ +private theorem getD_foldl_snoc_lt (g : List β → ι → β) (d : β) (l : List ι) (acc : List β) + (k : Nat) (hk : k < acc.length) : + (l.foldl (fun s a => s ++ [g s a]) acc).getD k d = acc.getD k d := by + induction l generalizing acc with + | nil => simp + | cons a t ih => + rw [List.foldl_cons, + ih (acc ++ [g acc a]) (by rw [List.length_append]; omega), + List.getD_append _ _ _ _ hk] + +/-- The element at position `j` of the snoc-fold over `finRange n` is `g` applied to the fold of the +length-`j` prefix and the index `j`. -/ +private theorem getD_foldl_finRange (g : List β → Fin n → β) (d : β) (j : Fin n) : + ((List.finRange n).foldl (fun s a => s ++ [g s a]) []).getD j.val d + = g (((List.finRange n).take j.val).foldl (fun s a => s ++ [g s a]) []) j := by + have hjlen : j.val < (List.finRange n).length := by + rw [List.length_finRange]; exact j.isLt + have htake : (List.finRange n).take (j.val + 1) + = (List.finRange n).take j.val ++ [j] := by + rw [List.take_succ_eq_append_getElem hjlen] + congr 1 + simp [List.getElem_finRange] + have hplen : (((List.finRange n).take j.val).foldl (fun s a => s ++ [g s a]) []).length + = j.val := by + rw [length_foldl_snoc, List.length_nil, List.length_take, List.length_finRange, Nat.zero_add, + Nat.min_eq_left (Nat.le_of_lt j.isLt)] + calc + ((List.finRange n).foldl (fun s a => s ++ [g s a]) []).getD j.val d + = (((List.finRange n).drop (j.val + 1)).foldl (fun s a => s ++ [g s a]) + ((List.finRange n).take (j.val + 1) |>.foldl (fun s a => s ++ [g s a]) [])).getD + j.val d := by + conv_lhs => rw [show List.finRange n + = (List.finRange n).take (j.val + 1) ++ (List.finRange n).drop (j.val + 1) from + (List.take_append_drop _ _).symm] + rw [List.foldl_append] + _ = ((List.finRange n).take (j.val + 1) |>.foldl (fun s a => s ++ [g s a]) []).getD j.val d := by + apply getD_foldl_snoc_lt + rw [length_foldl_snoc, List.length_nil, List.length_take, List.length_finRange, + Nat.zero_add] + omega + _ = g (((List.finRange n).take j.val).foldl (fun s a => s ++ [g s a]) []) j := by + rw [htake, List.foldl_append, List.foldl_cons, List.foldl_nil] + rw [List.getD_append_right _ _ _ _ (le_of_eq hplen), hplen, Nat.sub_self] + rfl + +end FoldSnoc + +/-! ### Cholesky factor is lower-triangular + +A structural fact about the executable `choleskyFn`, proved directly from the column fold: the entry +above the diagonal is forced to `0` by the construction. -/ + +/-- Reading an entry of a matrix tensor built by `ofMatFn` returns the underlying function value. -/ +theorem get2_ofMatFn {m k : Nat} (f : Fin m → Fin k → ℝ) (i : Fin m) (j : Fin k) : + Spec.get2 (Spec.ofMatFn f) i j = f i j := rfl + +/-- The executable Cholesky factor is lower-triangular: entries strictly above the diagonal vanish. -/ +theorem choleskyFn_lower_triangular (A : Fin n → Fin n → ℝ) {i j : Fin n} (hij : i.val < j.val) : + Spec.choleskyFn A i j = 0 := by + unfold Spec.choleskyFn Spec.choleskyColsFn + rw [getD_foldl_finRange] + rw [if_pos hij] + +/-- Tensor-level statement: the Cholesky factor `choleskySpec A` is lower-triangular. -/ +theorem choleskySpec_lower_triangular (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) + {i j : Fin n} (hij : i.val < j.val) : + Spec.get2 (Spec.choleskySpec A) i j = 0 := by + rw [show Spec.choleskySpec A = Spec.ofMatFn (Spec.choleskyFn (Spec.toMatFn A)) from rfl, + get2_ofMatFn] + exact choleskyFn_lower_triangular _ hij + +/-! ## Tier D — convergence as an a-posteriori residual certificate + +The cyclic Jacobi iteration produces `(Λ, V)` from the rotated matrix `Af = Vᵀ A V` (an *exact* +orthogonal similarity — see `trace_orthogonal_conj`), with `Λ` the diagonal of `Af`. After finitely +many sweeps `Af` is only *approximately* diagonal, so `A = V·diag(Λ)·Vᵀ` does not hold exactly (and +never does in floating point). Mathlib v4.30.0 has no Jacobi convergence theory, so instead of an +*a-priori* convergence proof we give the *a-posteriori* certificate: the reconstruction residual is +exactly the orthogonal conjugation of the off-diagonal part of `Af`, hence its Frobenius mass equals +the off-diagonal mass — which the runtime `assertLt` checks in `NN/Examples/Factorization` bound on +concrete inputs. -/ + +/-- The off-diagonal part of a matrix (`0` iff the matrix is diagonal). -/ +def offDiagonal (M : Matrix (Fin n) (Fin n) ℝ) : Matrix (Fin n) (Fin n) ℝ := + M - Matrix.diagonal (fun i => M i i) + +/-- **Exact residual identity.** Reconstructing with the diagonal of `Af` leaves exactly the orthogonal +conjugation of `Af`'s off-diagonal part: `A − V·diag(Af)·Vᵀ = V · offDiag(Af) · Vᵀ`. -/ +theorem symEig_reconstruction_residual {A V Af : Matrix (Fin n) (Fin n) ℝ} + (hA : A = V * Af * Vᵀ) : + A - V * Matrix.diagonal (fun i => Af i i) * Vᵀ = V * offDiagonal Af * Vᵀ := by + rw [hA, offDiagonal, Matrix.mul_sub, Matrix.sub_mul] + +/-- **Frobenius residual certificate.** The squared Frobenius reconstruction error +`‖A − V·diag(Af)·Vᵀ‖²` equals the squared Frobenius off-diagonal mass `‖offDiag(Af)‖²` (expressed as +`trace(Rᵀ R)`), because orthogonal conjugation preserves the Frobenius norm. In particular it is `0` +iff `Af` is diagonal — the exact sense in which "more Jacobi sweeps ⟹ smaller residual". -/ +theorem symEig_frobenius_residual {A V Af : Matrix (Fin n) (Fin n) ℝ} (hV : Vᵀ * V = 1) + (hA : A = V * Af * Vᵀ) : + ((A - V * Matrix.diagonal (fun i => Af i i) * Vᵀ)ᵀ + * (A - V * Matrix.diagonal (fun i => Af i i) * Vᵀ)).trace + = ((offDiagonal Af)ᵀ * offDiagonal Af).trace := by + rw [symEig_reconstruction_residual hA] + have hB : (V * offDiagonal Af * Vᵀ)ᵀ = V * (offDiagonal Af)ᵀ * Vᵀ := by + rw [Matrix.transpose_mul, Matrix.transpose_mul, Matrix.transpose_transpose, Matrix.mul_assoc] + have key : (V * offDiagonal Af * Vᵀ)ᵀ * (V * offDiagonal Af * Vᵀ) + = V * ((offDiagonal Af)ᵀ * offDiagonal Af) * Vᵀ := by + rw [hB] + calc + (V * (offDiagonal Af)ᵀ * Vᵀ) * (V * offDiagonal Af * Vᵀ) + = V * (offDiagonal Af)ᵀ * (Vᵀ * V) * offDiagonal Af * Vᵀ := by simp [Matrix.mul_assoc] + _ = V * ((offDiagonal Af)ᵀ * offDiagonal Af) * Vᵀ := by rw [hV]; simp [Matrix.mul_assoc] + rw [key] + exact trace_orthogonal_conj hV + +/-- **Conditional correctness of Jacobi.** When the rotated matrix `Af = Vᵀ A V` is diagonal (zero +residual — the limit the sweeps drive toward), the Jacobi output `(diag Af, V)` is an *exact* +symmetric eigendecomposition `IsSymEig`. Together with `symEig_frobenius_residual` this is the precise +correctness statement: orthogonality and the orthogonal-similarity hold always; full diagonalization +holds exactly in the zero-residual limit. -/ +theorem isSymEig_of_diagonal {A V Af : Matrix (Fin n) (Fin n) ℝ} (hV : Vᵀ * V = 1) + (hA : A = V * Af * Vᵀ) (hdiag : Af = Matrix.diagonal (fun i => Af i i)) : + IsSymEig A (fun i => Af i i) V := + ⟨hV, by rw [hA]; conv_lhs => rw [hdiag]⟩ + +end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide.lean b/blueprint/TorchLeanBlueprint/Guide.lean index 59903a3..b97833d 100644 --- a/blueprint/TorchLeanBlueprint/Guide.lean +++ b/blueprint/TorchLeanBlueprint/Guide.lean @@ -34,6 +34,7 @@ import TorchLeanBlueprint.Guide.Ch4_Verification.ApproximationTheory import TorchLeanBlueprint.Guide.Ch4_Verification.ClassicalMLProofs import TorchLeanBlueprint.Guide.Ch4_Verification.ProbabilityAndGradients import TorchLeanBlueprint.Guide.Ch4_Verification.ScientificMLVerification +import TorchLeanBlueprint.Guide.Ch4_Verification.Factorizations import TorchLeanBlueprint.Guide.Ch4_Verification.Certificates import TorchLeanBlueprint.Guide.Ch4_Verification.FP32Soundness import TorchLeanBlueprint.Guide.Ch4_Verification.TwoStageWorkflows @@ -233,6 +234,8 @@ into precise mathematical statements. {include 2 TorchLeanBlueprint.Guide.Ch4_Verification.ScientificMLVerification} +{include 2 TorchLeanBlueprint.Guide.Ch4_Verification.Factorizations} + {include 2 TorchLeanBlueprint.Guide.Ch4_Verification.Certificates} {include 2 TorchLeanBlueprint.Guide.Ch4_Verification.TwoStageWorkflows} diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean new file mode 100644 index 0000000..13f8df3 --- /dev/null +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -0,0 +1,115 @@ +import VersoManual +import VersoBlueprint + +open Verso.Genre Manual + +#doc (Manual) "Matrix Factorizations for Kernel Methods" => +%%% +tag := "matrix-factorizations" +%%% + +Kernel and Gaussian-process methods do not reduce to a single forward pass. Their numerical core is a +matrix factorization. The motivating target here is +[Computational Hypergraph Discovery](https://github.com/TheoBourdais/ComputationalHypergraphDiscovery) +(CHD): a Gaussian-process / kernel-ridge method that recovers the dependency structure of a system by +repeatedly solving regularized kernel systems and testing the resulting variances. Every quantity CHD +inspects — the variational solution, the noise/ridge parameter, and the `Z`-test — is a function of the +*full symmetric eigendecomposition* of a kernel matrix `K`. + +TorchLean previously had only a power-iteration stub that recovers the *largest* eigenpair. The spec +layer now provides real, shape-indexed reference factorizations in +[`NN.Spec.Core.Tensor.Factorizations`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Spec/Core/Tensor/Factorizations.lean): +Cholesky (`choleskySpec`), QR via modified Gram–Schmidt (`qrSpec`), the full symmetric +eigendecomposition via cyclic Jacobi (`symEigJacobiSpec`), and the SVD (`svdSpec`). The correctness +theorems live in +[`NN.Proofs.Tensor.Basic.Factorizations`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/Factorizations.lean). + +# What "verified factorization" can and cannot mean + +A subtle but decisive point governs the whole chapter. The executable specs are +`Context`-polymorphic and run over Lean's native `Float` (IEEE binary64). Two of them — Cholesky and +QR — are *finite* constructions, so over the reals they reconstruct their input exactly under the +usual success hypotheses. The other two — the cyclic Jacobi eigensolver and the SVD built on it — are +*iterative*. After a finite number of sweeps the rotated matrix is only approximately diagonal, and in +floating point it is never exactly diagonal. Mathlib v4.30.0 contains no Jacobi convergence theory. + +So `A = V · diag(λ) · Vᵀ` is _not_ an a-priori theorem about the floating-point output. The honest +verification therefore splits into three kinds of statement, all proved over `ℝ`: + +- *Specification consequences*: facts CHD consumes, proved from a predicate that says "these matrices + form an eigendecomposition", independent of any algorithm. +- *Exact invariants*: properties the algorithm satisfies on the nose at every step. +- *A-posteriori certificate*: an exact identity bounding the reconstruction residual by the + off-diagonal mass, with the runtime `assertLt` checks supplying the numeric bound on concrete inputs. + +# Specification consequences (the CHD foundation) + +The specification predicate is `IsSymEig A Λ V`: an orthogonal `V` (`Vᵀ V = 1`) with +`A = V · diag(Λ) · Vᵀ`. From it the kernel-method facts follow without reference to the solver. + +The central one is the regularized inverse behind `solve_variationnal`. CHD repeatedly forms +`(K + γ I)⁻¹ b`; diagonalizing turns this into a per-eigenvalue rescaling: + +$$`(K+\gamma I)^{-1} = V\,\operatorname{diag}\!\left(\tfrac{1}{\lambda_i+\gamma}\right) V^\top, +\qquad \gamma \neq -\lambda_i.` + +This is `IsSymEig.add_smul_inv`, proved purely from orthogonality of `V` (so it holds for *any* +eigendecomposition the solver returns, not only Mathlib's canonical one). The supporting rewrite +`IsSymEig.add_smul_eq` expresses `K + γI = V · diag(λ + γ) · Vᵀ`, and +`orthogonal_conj_diagonal_mul_inv` is the reusable fact that conjugating a diagonal by an orthogonal +matrix is inverted by conjugating the entrywise inverse. + +The scalar summaries used by `find_gamma` and the evidence terms are `IsSymEig.trace_eq` +(`trace K = Σ λᵢ`) and `IsSymEig.det_eq` (`det K = Π λᵢ`). Symmetry itself is `IsSymEig.isHermitian`. + +CHD actually builds the Gram matrix `K = Aᵀ A`. `IsSVD.gram_isSymEig` records that an SVD of `A` is +exactly an eigendecomposition of that Gram matrix, with eigenvalues `σᵢ²` and the same orthogonal `V` — +connecting the SVD spec to the eigendecomposition foundation. + +# Exact invariants of the algorithms + +Some properties hold exactly, with no convergence or rounding caveat, and these pin down the precise +sense in which the iterative solver is faithful. + +The cyclic Jacobi iteration applies Givens rotations `J` with `A ← Jᵀ A J` and `V ← V J`. Each `J` is +orthogonal: with `c = 1/\sqrt{1+t^2}` and `s = t c` (the parameters the implementation uses), +`givens_normSq` proves `c² + s² = 1`. Consequently every sweep is an *orthogonal similarity*, and +`trace_orthogonal_conj` and `det_orthogonal_conj` show that the trace and determinant of the running +matrix equal those of the original at every step — the spectrum is preserved exactly, however far the +off-diagonal has been driven down. + +For the finite Cholesky construction, `choleskyFn_lower_triangular` (and its tensor-level form +`choleskySpec_lower_triangular`) proves the factor is lower-triangular: entries above the diagonal +vanish by construction. The proof reads the column produced at each position out of the `List.foldl` +that builds the factor, via the reusable indexing lemma `getD_foldl_finRange`. + +# The a-posteriori residual certificate + +For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact +residual identity. Writing `Af = Vᵀ A V` for the rotated matrix and `Λ` for its diagonal, +`symEig_reconstruction_residual` shows + +$$`A - V\,\operatorname{diag}(A_f)\,V^\top \;=\; V\,\operatorname{offDiag}(A_f)\,V^\top,` + +so the reconstruction error is exactly the orthogonal conjugation of the off-diagonal part of `Af`. +Because orthogonal conjugation preserves the Frobenius norm, `symEig_frobenius_residual` upgrades this +to an equality of squared Frobenius masses: + +$$`\bigl\|A - V\,\operatorname{diag}(A_f)\,V^\top\bigr\|_F^2 + \;=\; \bigl\|\operatorname{offDiag}(A_f)\bigr\|_F^2,` + +expressed in Lean as an equality of `trace(Rᵀ R)` terms. The residual is `0` exactly when `Af` is +diagonal, which is the precise meaning of "more Jacobi sweeps shrink the error". And in that +zero-residual limit, `isSymEig_of_diagonal` shows the solver output `(diag Af, V)` is an exact +`IsSymEig` decomposition. The numeric `assertLt` reconstruction checks in +`NN/Examples/Factorization` are concrete instances of this certificate: they bound the off-diagonal +mass on specific matrices. + +# What remains + +The exact algebraic reconstruction of the *finite* executable factorizations — `A = L · Lᵀ` for the +Cholesky column fold under positive pivots, and `A = Q · R` with `Qᵀ Q = 1` for Gram–Schmidt under +full column rank — is the natural next increment. It needs an induction relating the `List.foldl` +prefix at step `j` to the first `j` produced columns (a strengthening of `getD_foldl_finRange`) +together with the per-pivot positivity discharge from `Matrix.PosDef`. The specification-level facts +the kernel methods rely on are independent of that step, so the CHD foundation is already in place. diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/ScientificMLVerification.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/ScientificMLVerification.lean index 4f8a12a..e9a497a 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/ScientificMLVerification.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/ScientificMLVerification.lean @@ -146,3 +146,9 @@ models. These sections give readers a clear place to start when their question i The answer is the same pattern we use elsewhere: small certificate formats, explicit parsers, checked predicates, and theorem statements that say exactly which mathematical claim follows. + +Kernel and Gaussian-process methods bring their own version of this discipline through matrix +factorizations rather than corridors or residuals. The next section, Matrix Factorizations for Kernel +Methods, develops the eigendecomposition foundation that Computational Hypergraph Discovery relies on, +including the same split between exact specification consequences and a-posteriori numeric +certificates. From e9d851ffc36ca849ceb87e6d99b39db53f08531d Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sat, 30 May 2026 15:50:44 -0700 Subject: [PATCH 03/22] =?UTF-8?q?Add=20exact=20Cholesky=20reconstruction?= =?UTF-8?q?=20A=20=3D=20L=C2=B7L=E1=B5=80=20(finite-fold=20increment)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prove the exact algebraic reconstruction of the finite executable Cholesky factorization over ℝ, the increment documented in Basic/Factorizations.lean. `isCholesky_of_pos`: for symmetric A with positive executable pivots (0 < choleskyFn A j j — the success condition over ℝ), L = choleskyFn A is a genuine Cholesky factor (lower-triangular, A = L·Lᵀ), satisfying the `IsCholesky` spec. Tensor-level corollary `choleskySpec_reconstruction`. Method: a general snoc-fold read lemma (`getD_foldl_snoc_read`) reads the j-th built column as the step function on the length-j prefix; `prefix_eq_map` identifies that prefix with the first j columns of L; `take_map_sum_eq` rewrites the code's List.foldl sums as masked Finset partial sums. Positive pivots discharge the √-radicand and divisor side conditions; symmetry of A lifts the lower-triangular reconstruction to the full matrix. Blueprint: new "Exact Cholesky reconstruction" section; "What remains" narrowed to QR (dual-list GSState structure-fold + the orthonormality invariant). Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Proofs/Tensor/Basic.lean | 1 + .../Basic/FactorizationsReconstruction.lean | 377 ++++++++++++++++++ .../Ch4_Verification/Factorizations.lean | 38 +- 3 files changed, 410 insertions(+), 6 deletions(-) create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index c079721..9e44d53 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -10,6 +10,7 @@ public import NN.Proofs.Tensor.Basic.Core public import NN.Proofs.Tensor.Basic.Folds public import NN.Proofs.Tensor.Basic.LinearAlgebra public import NN.Proofs.Tensor.Basic.Factorizations +public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.BoundsNorms public import NN.Proofs.Tensor.Basic.Algebra diff --git a/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean b/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean new file mode 100644 index 0000000..386d77c --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean @@ -0,0 +1,377 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Spec.Core.Tensor.Factorizations +public import NN.Proofs.Tensor.Basic.Factorizations +public import Mathlib.Data.List.GetD +public import Mathlib.Algebra.BigOperators.Fin + +/-! +# Exact reconstruction of the finite Cholesky factorization + +This file proves the *exact* algebraic reconstruction of the finite executable Cholesky +factorization from [`NN.Spec.Core.Tensor.Factorizations`](../../../Spec/Core/Tensor/Factorizations.lean), +the increment promised in `NN.Proofs.Tensor.Basic.Factorizations`. Unlike the iterative Jacobi/SVD +routines (whose reconstruction is only an a-posteriori residual certificate), Cholesky is a *finite* +construction, so over `ℝ` it reconstructs its input on the nose under the success hypothesis. + +## Main result + +`isCholesky_of_pos`: for a symmetric `A : Fin n → Fin n → ℝ` whose executable Cholesky pivots are all +positive (`0 < choleskyFn A j j`, the exact condition under which the algorithm succeeds over `ℝ`), +the factor `L = choleskyFn A` satisfies the specification `Spec.Factorization.IsCholesky`: +it is lower-triangular and `A = L · Lᵀ`. `choleskySpec_reconstruction` is the tensor-level corollary. + +## Method + +The executable factor is built by a `List.foldl` that snocs one column per index. The core technical +device is `getD_foldl_snoc_read`, a general lemma reading the `j`-th element of such a fold as the +step function applied to the length-`j` prefix. From it, `prefix_eq_map` identifies the prefix of +columns with the first `j` columns of the final factor `L`, and `take_map_sum_eq` turns the code's +`List.foldl` sums into masked `Finset` partial sums. The positive-pivot hypothesis discharges the two +side conditions (`√` radicand `> 0` for the diagonal, divisor `≠ 0` for the below-diagonal entries). + +## Scope + +The QR factorization's exact reconstruction (`A = Q · R` from `gramSchmidtFn`, plus the orthonormality +`Qᵀ Q = 1`) is the remaining finite-fold increment. It needs analogous read lemmas for the +`GSState` *dual-list* structure-fold (the step writes both `qs` and `rcols`), and `Qᵀ Q = 1` +additionally requires the Gram–Schmidt orthogonality invariant, which Mathlib only provides for its +own `gramSchmidt`, not for this executable variant. +-/ + +@[expose] public section + +namespace Spec.Factorization.Reconstruction + +open Matrix +open scoped BigOperators + +variable {n : Nat} + +/-! ## List/Finset bridges -/ + +/-- A left `+`-fold accumulates the list sum. -/ +theorem foldl_add_eq_sum (l : List ℝ) (a : ℝ) : + l.foldl (· + ·) a = a + l.sum := by + induction l generalizing a with + | nil => simp + | cons x t ih => rw [List.foldl_cons, ih, List.sum_cons]; ring + +/-- A left `s + x*x`-fold accumulates the sum of squares. -/ +theorem foldl_addsq_eq_sum (l : List ℝ) (a : ℝ) : + l.foldl (fun s x => s + x * x) a = a + (l.map (fun x => x * x)).sum := by + induction l generalizing a with + | nil => simp + | cons x t ih => rw [List.foldl_cons, ih, List.map_cons, List.sum_cons]; ring + +/-- A `Fin n` sum is the foldl-sum over `finRange n`. -/ +theorem finsum_eq_finRange_sum (h : Fin n → ℝ) : + ∑ i, h i = ((List.finRange n).map h).sum := by + rw [← List.sum_toFinset _ (List.nodup_finRange n)] + · simp [List.toFinset_finRange] + +/-! ## General snoc-fold read lemmas -/ + +section FoldSnoc + +variable {β : Type _} {ι : Type _} + +/-- A left fold that appends one element per input grows the accumulator by `l.length`. -/ +theorem length_foldl_snoc (g : List β → ι → β) (l : List ι) (acc : List β) : + (l.foldl (fun s a => s ++ [g s a]) acc).length = acc.length + l.length := by + induction l generalizing acc with + | nil => simp + | cons a t ih => + rw [List.foldl_cons, ih] + simp only [List.length_append, List.length_cons, List.length_nil] + omega + +/-- A fold that only appends never changes an index already inside the accumulator. -/ +theorem getD_foldl_snoc_lt (g : List β → ι → β) (d : β) (l : List ι) (acc : List β) + (k : Nat) (hk : k < acc.length) : + (l.foldl (fun s a => s ++ [g s a]) acc).getD k d = acc.getD k d := by + induction l generalizing acc with + | nil => simp + | cons a t ih => + rw [List.foldl_cons, + ih (acc ++ [g acc a]) (by rw [List.length_append]; omega), + List.getD_append _ _ _ _ hk] + +/-- The element at position `k` of the snoc-fold over an arbitrary list `l` is `g` applied to the +fold of the length-`k` prefix and the `k`-th element. -/ +theorem getD_foldl_snoc_read (g : List β → ι → β) (d : β) (l : List ι) (k : Nat) + (hk : k < l.length) : + (l.foldl (fun s a => s ++ [g s a]) []).getD k d + = g ((l.take k).foldl (fun s a => s ++ [g s a]) []) (l[k]'hk) := by + have htake : l.take (k + 1) = l.take k ++ [l[k]'hk] := List.take_succ_eq_append_getElem hk + have hplen : ((l.take k).foldl (fun s a => s ++ [g s a]) []).length = k := by + rw [length_foldl_snoc, List.length_nil, List.length_take, Nat.zero_add, + Nat.min_eq_left (le_of_lt hk)] + calc + (l.foldl (fun s a => s ++ [g s a]) []).getD k d + = ((l.drop (k + 1)).foldl (fun s a => s ++ [g s a]) + ((l.take (k + 1)).foldl (fun s a => s ++ [g s a]) [])).getD k d := by + conv_lhs => rw [show l = l.take (k + 1) ++ l.drop (k + 1) from + (List.take_append_drop _ _).symm] + rw [List.foldl_append] + _ = ((l.take (k + 1)).foldl (fun s a => s ++ [g s a]) []).getD k d := by + apply getD_foldl_snoc_lt + rw [length_foldl_snoc, List.length_nil, List.length_take, Nat.zero_add] + omega + _ = g ((l.take k).foldl (fun s a => s ++ [g s a]) []) (l[k]'hk) := by + rw [htake, List.foldl_append, List.foldl_cons, List.foldl_nil] + rw [List.getD_append_right _ _ _ _ (le_of_eq hplen), hplen, Nat.sub_self] + rfl + +end FoldSnoc + +/-! ## Cholesky: the column-building step + +`choleskyColsFn` is a left fold that snocs one column per index. `cholStep` names the function it +appends, so that the read lemmas above can be specialized to it. -/ + +/-- The column appended at index `j` of the Cholesky fold, given the columns `cols` built so far. -/ +noncomputable def cholStep (A : Fin n → Fin n → ℝ) (cols : List (Fin n → ℝ)) (j : Fin n) : + Fin n → ℝ := + let sumsq := (cols.map (fun ck => ck j)).foldl (fun s x => s + x * x) 0 + let Ljj := MathFunctions.sqrt (A j j - sumsq) + fun i => + if i.val < j.val then 0 + else if i.val == j.val then Ljj + else + let s := (cols.map (fun ck => ck i * ck j)).foldl (fun acc x => acc + x) 0 + (A i j - s) / Ljj + +/-- `choleskyColsFn` is the snoc-fold appending `cholStep`. -/ +theorem choleskyColsFn_eq (A : Fin n → Fin n → ℝ) : + Spec.choleskyColsFn A + = (List.finRange n).foldl (fun cols j => cols ++ [cholStep A cols j]) [] := rfl + +/-- The diagonal value produced by `cholStep`. -/ +theorem cholStep_diag (A : Fin n → Fin n → ℝ) (cols : List (Fin n → ℝ)) (j : Fin n) : + cholStep A cols j j + = MathFunctions.sqrt (A j j - (cols.map (fun ck => ck j)).foldl (fun s x => s + x * x) 0) := by + simp only [cholStep] + rw [if_neg (lt_irrefl _), if_pos (beq_self_eq_true _)] + +/-- The below-diagonal value produced by `cholStep`. -/ +theorem cholStep_offdiag (A : Fin n → Fin n → ℝ) (cols : List (Fin n → ℝ)) {i j : Fin n} + (hij : j.val < i.val) : + cholStep A cols j i + = (A i j - (cols.map (fun ck => ck i * ck j)).foldl (fun acc x => acc + x) 0) + / MathFunctions.sqrt (A j j - (cols.map (fun ck => ck j)).foldl (fun s x => s + x * x) 0) := by + simp only [cholStep] + rw [if_neg (by omega), if_neg (by rw [beq_iff_eq]; omega)] + +/-- The length-`j` prefix of Cholesky columns built before index `j`. -/ +noncomputable def prefixCols (A : Fin n → Fin n → ℝ) (j : Fin n) : List (Fin n → ℝ) := + ((List.finRange n).take j.val).foldl (fun cols k => cols ++ [cholStep A cols k]) [] + +/-- Entry `(i, j)` of the executable Cholesky factor equals `cholStep` evaluated on the prefix. -/ +theorem choleskyFn_eq_step (A : Fin n → Fin n → ℝ) (i j : Fin n) : + Spec.choleskyFn A i j = cholStep A (prefixCols A j) j i := by + have hlen : j.val < (List.finRange n).length := by rw [List.length_finRange]; exact j.isLt + show (Spec.choleskyColsFn A).getD j.val (fun _ => 0) i = _ + rw [choleskyColsFn_eq, getD_foldl_snoc_read (fun cols k => cholStep A cols k) (fun _ => 0) + (List.finRange n) j.val hlen] + have hj : (List.finRange n)[j.val]'hlen = j := by simp [List.getElem_finRange] + rw [hj] + rfl + +/-- The prefix of Cholesky columns is exactly the first `j` columns of the final factor `L`, +each presented as the function `r ↦ L r k`. -/ +theorem prefix_eq_map (A : Fin n → Fin n → ℝ) (j : Fin n) : + prefixCols A j + = ((List.finRange n).take j.val).map (fun k => fun r => Spec.choleskyFn A r k) := by + have hjval : ((List.finRange n).take j.val).length = j.val := by + rw [List.length_take, List.length_finRange, Nat.min_eq_left (le_of_lt j.isLt)] + apply List.ext_getElem + · unfold prefixCols + rw [length_foldl_snoc (fun cols k => cholStep A cols k), List.length_nil, Nat.zero_add, + List.length_map] + · intro p h1 h2 + rw [List.length_map, hjval] at h2 + have hpn : p < n := lt_trans h2 j.isLt + rw [List.getElem_map] + have hidx : ((List.finRange n).take j.val)[p]'(by rw [hjval]; exact h2) = (⟨p, hpn⟩ : Fin n) := by + rw [List.getElem_take, List.getElem_finRange]; exact Fin.ext rfl + rw [show (prefixCols A j)[p]'h1 = (prefixCols A j).getD p (fun _ => 0) from + (List.getD_eq_getElem _ _ h1).symm] + unfold prefixCols + rw [getD_foldl_snoc_read (fun cols k => cholStep A cols k) (fun _ => 0) + ((List.finRange n).take j.val) p (by rw [hjval]; exact h2)] + rw [List.take_take, Nat.min_eq_left (le_of_lt h2), hidx] + funext r + rw [choleskyFn_eq_step] + rfl + +/-! ### List/Finset partial-sum bridges -/ + +/-- Every element of a `finRange` prefix has index below the cut. -/ +theorem mem_take_finRange {m : Nat} {x : Fin n} (hx : x ∈ (List.finRange n).take m) : + x.val < m := by + obtain ⟨p, hp, hpx⟩ := List.getElem_of_mem hx + rw [List.length_take, List.length_finRange] at hp + rw [List.getElem_take, List.getElem_finRange] at hpx + subst hpx + exact lt_of_lt_of_le hp (Nat.min_le_left m n) + +/-- Every element of a `finRange` tail has index at least the cut. -/ +theorem mem_drop_finRange {m : Nat} {x : Fin n} (hx : x ∈ (List.finRange n).drop m) : + m ≤ x.val := by + obtain ⟨p, hp, hpx⟩ := List.getElem_of_mem hx + rw [List.getElem_drop, List.getElem_finRange] at hpx + subst hpx + exact Nat.le_add_right m p + +/-- Mapping `f` over a `finRange` prefix and summing equals the masked full sum. -/ +theorem take_map_sum_eq (m : Nat) (f : Fin n → ℝ) : + (((List.finRange n).take m).map f).sum = ∑ k : Fin n, if k.val < m then f k else 0 := by + rw [finsum_eq_finRange_sum] + conv_rhs => rw [show (List.finRange n) + = (List.finRange n).take m ++ (List.finRange n).drop m from (List.take_append_drop _ _).symm] + rw [List.map_append, List.sum_append] + have htake : ((List.finRange n).take m).map (fun k => if k.val < m then f k else 0) + = ((List.finRange n).take m).map f := + List.map_congr_left (fun x hx => if_pos (mem_take_finRange hx)) + have hdrop : (((List.finRange n).drop m).map (fun k => if k.val < m then f k else 0)).sum = 0 := by + rw [List.sum_eq_zero] + intro y hy + rw [List.mem_map] at hy + obtain ⟨x, hx, rfl⟩ := hy + exact if_neg (by have := mem_drop_finRange hx; omega) + rw [htake, hdrop, add_zero] + +/-- The Cholesky cross-sum equals the masked partial dot product of rows `i` and `j` of `L`. -/ +theorem cross_sum_eq (A : Fin n → Fin n → ℝ) (i j : Fin n) : + ((prefixCols A j).map (fun ck => ck i * ck j)).foldl (fun acc x => acc + x) 0 + = ∑ k : Fin n, if k.val < j.val then Spec.choleskyFn A i k * Spec.choleskyFn A j k else 0 := by + rw [prefix_eq_map, List.map_map, foldl_add_eq_sum, zero_add, + show ((fun ck : Fin n → ℝ => ck i * ck j) ∘ fun k => fun r => Spec.choleskyFn A r k) + = (fun k => Spec.choleskyFn A i k * Spec.choleskyFn A j k) from rfl] + exact take_map_sum_eq j.val (fun k => Spec.choleskyFn A i k * Spec.choleskyFn A j k) + +/-- The Cholesky diagonal sum-of-squares equals the masked partial squared norm of row `j` of `L`. -/ +theorem sumsq_eq (A : Fin n → Fin n → ℝ) (j : Fin n) : + ((prefixCols A j).map (fun ck => ck j)).foldl (fun s x => s + x * x) 0 + = ∑ k : Fin n, if k.val < j.val then Spec.choleskyFn A j k * Spec.choleskyFn A j k else 0 := by + rw [prefix_eq_map, List.map_map, foldl_addsq_eq_sum, zero_add, List.map_map, + show ((fun x : ℝ => x * x) ∘ ((fun ck : Fin n → ℝ => ck j) ∘ fun k => fun r => Spec.choleskyFn A r k)) + = (fun k => Spec.choleskyFn A j k * Spec.choleskyFn A j k) from rfl] + exact take_map_sum_eq j.val (fun k => Spec.choleskyFn A j k * Spec.choleskyFn A j k) + +/-! ### Closed-form entries of the executable Cholesky factor -/ + +/-- Over `ℝ`, the `Context` square root is `Real.sqrt`. -/ +theorem mfsqrt_eq (x : ℝ) : MathFunctions.sqrt x = Real.sqrt x := rfl + +/-- The diagonal entry of `L` in closed form: `L[j,j] = √(A[j,j] − Σ_{k j`. -/ +theorem choleskyFn_offdiag_eq (A : Fin n → Fin n → ℝ) {i j : Fin n} (hij : j.val < i.val) : + Spec.choleskyFn A i j + = (A i j - ∑ k, if k.val < j.val then Spec.choleskyFn A i k * Spec.choleskyFn A j k else 0) + / Spec.choleskyFn A j j := by + rw [choleskyFn_eq_step A i j, cholStep_offdiag _ _ hij, cross_sum_eq, sumsq_eq, mfsqrt_eq, + ← choleskyFn_diag_eq] + +/-! ### Reconstruction `A = L · Lᵀ` + +The diagonal of the rotated/peeled product is reconstructed using the closed-form entries and the +positive-pivot hypothesis (`0 < L[j,j]`), which is exactly the condition under which the executable +Cholesky succeeds over `ℝ`. -/ + +/-- Per-entry reconstruction for the lower part (`j ≤ i`): the `(i, j)` entry of `L · Lᵀ` is `A i j`. -/ +theorem choleskyFn_dot_eq (A : Fin n → Fin n → ℝ) + (hpos : ∀ j : Fin n, 0 < Spec.choleskyFn A j j) {i j : Fin n} (hji : j.val ≤ i.val) : + (∑ k, Spec.choleskyFn A i k * Spec.choleskyFn A j k) = A i j := by + set L := Spec.choleskyFn A with hL + have key : ∀ k : Fin n, L i k * L j k + = (if k.val < j.val then L i k * L j k else 0) + (if k = j then L i j * L j j else 0) := by + intro k + rcases lt_trichotomy k.val j.val with h | h | h + · have hne : k ≠ j := fun hk => by rw [hk] at h; exact lt_irrefl _ h + rw [if_pos h, if_neg hne, add_zero] + · have hkj : k = j := Fin.ext h + rw [if_neg (by omega), if_pos hkj, zero_add, hkj] + · have hne : k ≠ j := fun hk => by rw [hk] at h; exact lt_irrefl _ h + rw [if_neg (by omega), if_neg hne, add_zero, + show L j k = 0 from Spec.Factorization.choleskyFn_lower_triangular A h, mul_zero] + rw [show (∑ k, L i k * L j k) + = ∑ k, ((if k.val < j.val then L i k * L j k else 0) + (if k = j then L i j * L j j else 0)) + from Finset.sum_congr rfl (fun k _ => key k), + Finset.sum_add_distrib, Finset.sum_ite_eq' Finset.univ j (fun _ => L i j * L j j)] + simp only [Finset.mem_univ, if_true] + rcases eq_or_lt_of_le hji with heq | hlt + · have hij' : i = j := Fin.ext heq.symm + subst hij' + have hrad : 0 < A i i - (∑ k, if k.val < i.val then L i k * L i k else 0) := by + have hp := hpos i + rw [hL, choleskyFn_diag_eq] at hp + exact Real.sqrt_pos.mp hp + have hsq : L i i * L i i = A i i - (∑ k, if k.val < i.val then L i k * L i k else 0) := by + conv_lhs => rw [hL, choleskyFn_diag_eq A i] + exact Real.mul_self_sqrt hrad.le + rw [hsq]; ring + · have hne : L j j ≠ 0 := ne_of_gt (hpos j) + have hmul : L i j * L j j + = A i j - (∑ k, if k.val < j.val then L i k * L j k else 0) := by + rw [hL, choleskyFn_offdiag_eq A hlt, div_mul_eq_mul_div, mul_div_assoc, div_self hne, mul_one] + rw [hmul]; ring + +/-- Per-entry reconstruction for all `(i, j)`, using symmetry of `A`. -/ +theorem choleskyFn_dot (A : Fin n → Fin n → ℝ) (hsymm : ∀ i j, A i j = A j i) + (hpos : ∀ j : Fin n, 0 < Spec.choleskyFn A j j) (i j : Fin n) : + (∑ k, Spec.choleskyFn A i k * Spec.choleskyFn A j k) = A i j := by + rcases le_total j.val i.val with h | h + · exact choleskyFn_dot_eq A hpos h + · rw [show (∑ k, Spec.choleskyFn A i k * Spec.choleskyFn A j k) + = ∑ k, Spec.choleskyFn A j k * Spec.choleskyFn A i k + from Finset.sum_congr rfl (fun k _ => mul_comm _ _), + choleskyFn_dot_eq A hpos h, hsymm j i] + +/-- **Exact Cholesky reconstruction.** For a symmetric `A` whose executable Cholesky pivots are all +positive (`0 < L[j,j]`, the success condition over `ℝ`), the factor `L = choleskyFn A` is a genuine +Cholesky factor: lower-triangular with `A = L · Lᵀ`. -/ +theorem isCholesky_of_pos (A : Fin n → Fin n → ℝ) (hsymm : ∀ i j, A i j = A j i) + (hpos : ∀ j : Fin n, 0 < Spec.choleskyFn A j j) : + Spec.Factorization.IsCholesky (Matrix.of A) (Matrix.of (Spec.choleskyFn A)) := by + refine ⟨?_, ?_⟩ + · intro a b hab + show Spec.choleskyFn A a b = 0 + exact Spec.Factorization.choleskyFn_lower_triangular A (Fin.lt_def.mp hab) + · ext i j + rw [Matrix.mul_apply] + simp only [Matrix.of_apply, Matrix.transpose_apply] + exact (choleskyFn_dot A hsymm hpos i j).symm + +/-- **Tensor-level Cholesky reconstruction.** For a symmetric tensor `A` whose `choleskySpec` pivots +are positive, every entry of `A` is reconstructed by `L · Lᵀ`: +`A[i,j] = Σ_k L[i,k] · L[j,k]`, with `L = choleskySpec A`. -/ +theorem choleskySpec_reconstruction (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) + (hsymm : ∀ i j, Spec.get2 A i j = Spec.get2 A j i) + (hpos : ∀ j : Fin n, 0 < Spec.get2 (Spec.choleskySpec A) j j) (i j : Fin n) : + Spec.get2 A i j + = ∑ k, Spec.get2 (Spec.choleskySpec A) i k * Spec.get2 (Spec.choleskySpec A) j k := by + have hg : ∀ a b, Spec.get2 (Spec.choleskySpec A) a b = Spec.choleskyFn (Spec.toMatFn A) a b := by + intro a b + rw [show Spec.choleskySpec A = Spec.ofMatFn (Spec.choleskyFn (Spec.toMatFn A)) from rfl, + Spec.Factorization.get2_ofMatFn] + simp only [hg] + show Spec.toMatFn A i j = _ + refine (choleskyFn_dot (Spec.toMatFn A) (fun a b => hsymm a b) (fun b => ?_) i j).symm + rw [← hg b b]; exact hpos b + +end Spec.Factorization.Reconstruction diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 13f8df3..0297902 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -83,6 +83,29 @@ For the finite Cholesky construction, `choleskyFn_lower_triangular` (and its ten vanish by construction. The proof reads the column produced at each position out of the `List.foldl` that builds the factor, via the reusable indexing lemma `getD_foldl_finRange`. +# Exact Cholesky reconstruction + +Cholesky is a _finite_ construction, so unlike the iterative routines it admits an exact +reconstruction theorem — no residual, no convergence caveat. In +[`NN.Proofs.Tensor.Basic.FactorizationsReconstruction`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean), +`isCholesky_of_pos` proves that for a symmetric `A` whose executable pivots are all positive +(`0 < L[j,j]`, exactly the condition under which the algorithm succeeds over the reals) the factor +`L = choleskyFn A` is a genuine Cholesky factor: + +$$`L \text{ lower-triangular} \quad\text{and}\quad A = L\,L^\top.` + +The tensor-level corollary `choleskySpec_reconstruction` states the same per entry: +`A[i,j] = Σ_k L[i,k]·L[j,k]`. + +The proof turns the executable algorithm — a `List.foldl` that snocs one column per index — into +per-entry algebra. The reusable lemma `getD_foldl_snoc_read` reads the `j`-th column as the step +function applied to the length-`j` prefix; `prefix_eq_map` then identifies that prefix with the first +`j` columns of the final `L`, and `take_map_sum_eq` rewrites the code's `List.foldl` sums as masked +`Finset` partial sums. Lower-triangularity collapses the matrix product to a partial sum plus a single +pivot term, and the positive-pivot hypothesis discharges the two side conditions: `√` of a positive +radicand for the diagonal (`Real.mul_self_sqrt`) and a non-zero divisor for the below-diagonal +entries. Symmetry of `A` extends the lower-triangular reconstruction to the whole matrix. + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact @@ -107,9 +130,12 @@ mass on specific matrices. # What remains -The exact algebraic reconstruction of the *finite* executable factorizations — `A = L · Lᵀ` for the -Cholesky column fold under positive pivots, and `A = Q · R` with `Qᵀ Q = 1` for Gram–Schmidt under -full column rank — is the natural next increment. It needs an induction relating the `List.foldl` -prefix at step `j` to the first `j` produced columns (a strengthening of `getD_foldl_finRange`) -together with the per-pivot positivity discharge from `Matrix.PosDef`. The specification-level facts -the kernel methods rely on are independent of that step, so the CHD foundation is already in place. +With Cholesky's exact reconstruction in place, the remaining finite-fold increment is the QR +factorization: `A = Q · R` from modified Gram–Schmidt under full column rank, and the orthonormality +`Qᵀ Q = 1`. The `A = Q · R` part is within reach of the same machinery, but `gramSchmidtFn` threads a +`GSState` that snocs onto _two_ lists at once (the `Q` columns and the `R` columns), so it needs read +lemmas for that dual-list structure-fold rather than the single-list `getD_foldl_snoc_read` used for +Cholesky. The orthonormality `Qᵀ Q = 1` is harder still: it rests on the Gram–Schmidt orthogonality +invariant, which Mathlib provides for its own `gramSchmidt` but not for this executable variant. The +specification-level facts the kernel methods rely on are independent of these steps, so the CHD +foundation is already in place. From eaee2bc1127ee0cd22d081be5b56d540fed9d2c9 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sat, 30 May 2026 16:10:44 -0700 Subject: [PATCH 04/22] =?UTF-8?q?Add=20exact=20QR=20reconstruction=20A=20?= =?UTF-8?q?=3D=20Q=C2=B7R=20(finite-fold=20increment)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prove the exact algebraic reconstruction of the finite executable QR (modified Gram–Schmidt) factorization over ℝ, extending the Cholesky work. `qr_mul_eq`: for A : Fin m → Fin n → ℝ whose executable R-pivots are positive (0 < Rmat A j j — full column rank, the success condition), the factors satisfy A = Q·R with R upper-triangular (`Rmat_upper_triangular`). `qrSpec_reconstruction` is the tensor-level corollary. Method: gramSchmidtFn threads a GSState that snocs onto both the Q-list and the R-list at once. The appended values depend only on the Q-history, so the Q-list is a single-list snoc-fold (`gs_proj_qs`, read by `getD_foldl_snoc_read`) and the R-list is the Q-prefix tail `rTail` (read by `gs_fold_split` + `rTail_getD`). The orthogonalization sum v = a − Σ rₖⱼqₖ (a List.zip fold) collapses to a map-fold (`cross_fold_eq`) and then a masked Finset partial sum (`take_map_sum_eq`); the positive pivot cancels the v/rⱼⱼ normalization exactly. Not done (documented, not sorry): orthonormality Qᵀ Q = 1, which rests on the Gram–Schmidt orthogonality invariant Mathlib only has for its own gramSchmidt. Blueprint: new "Exact QR reconstruction" section; "What remains" narrowed to the orthonormality invariant. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../Basic/FactorizationsReconstruction.lean | 362 +++++++++++++++++- .../Ch4_Verification/Factorizations.lean | 31 +- 2 files changed, 363 insertions(+), 30 deletions(-) diff --git a/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean b/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean index 386d77c..90cd5cb 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean @@ -12,37 +12,43 @@ public import Mathlib.Data.List.GetD public import Mathlib.Algebra.BigOperators.Fin /-! -# Exact reconstruction of the finite Cholesky factorization +# Exact reconstruction of the finite factorizations (Cholesky and QR) -This file proves the *exact* algebraic reconstruction of the finite executable Cholesky -factorization from [`NN.Spec.Core.Tensor.Factorizations`](../../../Spec/Core/Tensor/Factorizations.lean), +This file proves the *exact* algebraic reconstruction of the finite executable Cholesky and QR +factorizations from [`NN.Spec.Core.Tensor.Factorizations`](../../../Spec/Core/Tensor/Factorizations.lean), the increment promised in `NN.Proofs.Tensor.Basic.Factorizations`. Unlike the iterative Jacobi/SVD -routines (whose reconstruction is only an a-posteriori residual certificate), Cholesky is a *finite* -construction, so over `ℝ` it reconstructs its input on the nose under the success hypothesis. +routines (whose reconstruction is only an a-posteriori residual certificate), Cholesky and Gram–Schmidt +are *finite* constructions, so over `ℝ` they reconstruct their input on the nose under the success +hypotheses. -## Main result +## Main results -`isCholesky_of_pos`: for a symmetric `A : Fin n → Fin n → ℝ` whose executable Cholesky pivots are all -positive (`0 < choleskyFn A j j`, the exact condition under which the algorithm succeeds over `ℝ`), -the factor `L = choleskyFn A` satisfies the specification `Spec.Factorization.IsCholesky`: -it is lower-triangular and `A = L · Lᵀ`. `choleskySpec_reconstruction` is the tensor-level corollary. +* `isCholesky_of_pos`: for a symmetric `A : Fin n → Fin n → ℝ` whose executable Cholesky pivots are all + positive (`0 < choleskyFn A j j`, the exact condition under which the algorithm succeeds over `ℝ`), + the factor `L = choleskyFn A` satisfies the spec `Spec.Factorization.IsCholesky`: lower-triangular + and `A = L · Lᵀ`. `choleskySpec_reconstruction` is the tensor-level corollary. +* `qr_mul_eq`: for `A : Fin m → Fin n → ℝ` whose executable Gram–Schmidt `R`-pivots are positive + (`0 < Rmat A j j`, full column rank), the factors `Q = gramSchmidtFn A` and `R` satisfy `A = Q · R`, + with `R` upper-triangular (`Rmat_upper_triangular`). `qrSpec_reconstruction` is the tensor-level + corollary. ## Method -The executable factor is built by a `List.foldl` that snocs one column per index. The core technical -device is `getD_foldl_snoc_read`, a general lemma reading the `j`-th element of such a fold as the -step function applied to the length-`j` prefix. From it, `prefix_eq_map` identifies the prefix of -columns with the first `j` columns of the final factor `L`, and `take_map_sum_eq` turns the code's -`List.foldl` sums into masked `Finset` partial sums. The positive-pivot hypothesis discharges the two -side conditions (`√` radicand `> 0` for the diagonal, divisor `≠ 0` for the below-diagonal entries). +Each executable factor is built by a `List.foldl` that snocs one column per index. The core technical +device is `getD_foldl_snoc_read`, a general lemma reading the `j`-th element of such a fold as the step +function applied to the length-`j` prefix. From it, `prefix_eq_map`/`qsPrefix_eq_map` identify the +prefix with the first `j` columns of the final factor, and `take_map_sum_eq` turns the code's +`List.foldl` sums into masked `Finset` partial sums. The QR fold threads a `GSState` that snocs onto +*both* the `Q`-list and the `R`-list at once; `gs_proj_qs` and `gs_fold_split`/`rTail_getD` recover the +single-list read lemmas for each projection (the step depends only on the `Q`-history). The +positive-pivot hypotheses discharge the `√`-radicand and divisor side conditions. ## Scope -The QR factorization's exact reconstruction (`A = Q · R` from `gramSchmidtFn`, plus the orthonormality -`Qᵀ Q = 1`) is the remaining finite-fold increment. It needs analogous read lemmas for the -`GSState` *dual-list* structure-fold (the step writes both `qs` and `rcols`), and `Qᵀ Q = 1` -additionally requires the Gram–Schmidt orthogonality invariant, which Mathlib only provides for its -own `gramSchmidt`, not for this executable variant. +The one piece *not* proved is the orthonormality of the QR factor, `Qᵀ Q = 1`. Unlike `A = Q · R` +(which is a purely algebraic consequence of the orthogonalization step), it rests on the Gram–Schmidt +orthogonality invariant, which Mathlib provides for its own `gramSchmidt` but not for this executable +variant — so it stays the documented remaining increment, never a `sorry`. -/ @[expose] public section @@ -374,4 +380,318 @@ theorem choleskySpec_reconstruction (A : Spec.Tensor ℝ (.dim n (.dim n .scalar refine (choleskyFn_dot (Spec.toMatFn A) (fun a b => hsymm a b) (fun b => ?_) i j).symm rw [← hg b b]; exact hpos b +/-! ## QR (modified Gram–Schmidt): exact reconstruction `A = Q · R` + +`gramSchmidtFn` threads a `GSState` that snocs a column onto *both* the `Q`-list and the `R`-list at +each index. Crucially the appended values depend only on the `Q`-history (`st.qs`), never on the +`R`-history, so the `Q`-list is itself a single-list snoc-fold (`gs_proj_qs`) and the `R`-list is the +`Q`-prefix-indexed tail `rTail`. -/ + +section QR + +variable {m : Nat} + +open Spec (GSState) + +/-- Column `j` of `A` as a function of the row. -/ +noncomputable def gsA (A : Fin m → Fin n → ℝ) (j : Fin n) : Fin m → ℝ := fun i => A i j + +/-- The `R` off-diagonal entries `rₖⱼ = qₖ · a` for the columns `qs` built so far. -/ +noncomputable def gsRkjs (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) : List ℝ := + qs.map (fun qk => Spec.dotFn qk (gsA A j)) + +/-- The orthogonalized (not-yet-normalized) vector `v = a − Σ rₖⱼ qₖ`. -/ +noncomputable def gsV (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) : Fin m → ℝ := + fun i => gsA A j i + - (List.zip qs (gsRkjs A qs j)).foldl (fun acc (qk, r) => acc + r * qk i) 0 + +/-- The diagonal `R` entry `rⱼⱼ = ‖v‖`. -/ +noncomputable def gsRjj (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) : ℝ := + Spec.normFn (gsV A qs j) + +/-- The `Q` column appended at index `j`: `v / rⱼⱼ` (or `0` when `rⱼⱼ = 0`). -/ +noncomputable def qStep (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) : Fin m → ℝ := + fun i => if Context.gtBool (gsRjj A qs j) 0 then gsV A qs j i / gsRjj A qs j else 0 + +/-- The `R` column appended at index `j`: `rₖⱼ` below the diagonal, `rⱼⱼ` on it, `0` above. -/ +noncomputable def rStep (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) : Fin n → ℝ := + fun k => if k.val < j.val then (gsRkjs A qs j).getD k.val 0 + else if k.val == j.val then gsRjj A qs j else 0 + +/-- `gramSchmidtFn` as the dual-list snoc-fold appending `qStep`/`rStep`. -/ +theorem gramSchmidtFn_eq (A : Fin m → Fin n → ℝ) : + Spec.gramSchmidtFn A + = (List.finRange n).foldl + (fun st j => (⟨st.qs ++ [qStep A st.qs j], st.rcols ++ [rStep A st.qs j]⟩ : GSState m n ℝ)) + ⟨[], []⟩ := rfl + +/-- The `Q`-list projection of the structure fold is the single-list `qStep` snoc-fold. -/ +theorem gs_proj_qs (A : Fin m → Fin n → ℝ) (l : List (Fin n)) (q0 : List (Fin m → ℝ)) + (r0 : List (Fin n → ℝ)) : + (l.foldl (fun st j => (⟨st.qs ++ [qStep A st.qs j], st.rcols ++ [rStep A st.qs j]⟩ : GSState m n ℝ)) + ⟨q0, r0⟩).qs + = l.foldl (fun qs j => qs ++ [qStep A qs j]) q0 := by + induction l generalizing q0 r0 with + | nil => rfl + | cons a t ih => simp only [List.foldl_cons]; exact ih _ _ + +/-- The `Q` columns built before index `j`. -/ +noncomputable def qsPrefix (A : Fin m → Fin n → ℝ) (j : Fin n) : List (Fin m → ℝ) := + ((List.finRange n).take j.val).foldl (fun qs k => qs ++ [qStep A qs k]) [] + +/-- The `R`-list tail: the `R` columns produced from `Q`-prefix `q0` over the indices `l`. -/ +noncomputable def rTail (A : Fin m → Fin n → ℝ) (q0 : List (Fin m → ℝ)) : List (Fin n) → + List (Fin n → ℝ) + | [] => [] + | j :: rest => rStep A q0 j :: rTail A (q0 ++ [qStep A q0 j]) rest + +/-- The structure fold splits into the `qStep` snoc-fold (`Q`-list) and the `rTail` (`R`-list). -/ +theorem gs_fold_split (A : Fin m → Fin n → ℝ) (l : List (Fin n)) (q0 : List (Fin m → ℝ)) + (r0 : List (Fin n → ℝ)) : + (l.foldl (fun st j => (⟨st.qs ++ [qStep A st.qs j], st.rcols ++ [rStep A st.qs j]⟩ : GSState m n ℝ)) + ⟨q0, r0⟩) + = ⟨l.foldl (fun qs j => qs ++ [qStep A qs j]) q0, r0 ++ rTail A q0 l⟩ := by + induction l generalizing q0 r0 with + | nil => simp [rTail] + | cons j rest ih => + simp only [List.foldl_cons, rTail] + rw [ih] + simp [List.append_assoc] + +/-- Reading the `k`-th element of `rTail` recovers `rStep` applied to the length-`k` `Q`-prefix. -/ +theorem rTail_getD (A : Fin m → Fin n → ℝ) (q0 : List (Fin m → ℝ)) (l : List (Fin n)) (k : Nat) + (hk : k < l.length) (d : Fin n → ℝ) : + (rTail A q0 l).getD k d + = rStep A ((l.take k).foldl (fun qs j => qs ++ [qStep A qs j]) q0) (l[k]'hk) := by + induction l generalizing q0 k with + | nil => simp at hk + | cons j rest ih => + cases k with + | zero => simp [rTail] + | succ k' => + simp only [rTail, List.getD_cons_succ, List.take_succ_cons, List.foldl_cons, + List.getElem_cons_succ] + exact ih (q0 ++ [qStep A q0 j]) k' (by simpa using hk) + +/-- Semantics of the `Context` `>` test over `ℝ`. -/ +theorem gtBool_true_iff {x y : ℝ} : Context.gtBool x y = true ↔ y < x := by + unfold Context.gtBool; exact decide_eq_true_iff + +/-- A left fold `acc + h x` accumulates the mapped list sum. -/ +theorem foldl_addf_eq_sum {β : Type _} (h : β → ℝ) (l : List β) (a : ℝ) : + l.foldl (fun acc x => acc + h x) a = a + (l.map h).sum := by + induction l generalizing a with + | nil => simp + | cons x t ih => rw [List.foldl_cons, ih, List.map_cons, List.sum_cons]; ring + +/-! ### Entries of the executable `Q` and `R` factors -/ + +/-- Entry `(i, k)` of the `Q` factor produced by `gramSchmidtFn`. -/ +noncomputable def Qmat (A : Fin m → Fin n → ℝ) (i : Fin m) (k : Fin n) : ℝ := + (Spec.gramSchmidtFn A).qs.getD k.val (fun _ => 0) i + +/-- Entry `(k, j)` of the `R` factor produced by `gramSchmidtFn`. -/ +noncomputable def Rmat (A : Fin m → Fin n → ℝ) (k j : Fin n) : ℝ := + (Spec.gramSchmidtFn A).rcols.getD j.val (fun _ => 0) k + +/-- Column `k` of `Q` as a function of the row. -/ +noncomputable def Qcol (A : Fin m → Fin n → ℝ) (k : Fin n) : Fin m → ℝ := fun r => Qmat A r k + +/-- Closed form of a `Q` entry: `qStep` evaluated on the `Q`-prefix. -/ +theorem Qmat_eq (A : Fin m → Fin n → ℝ) (i : Fin m) (k : Fin n) : + Qmat A i k = qStep A (qsPrefix A k) k i := by + have hqs : (Spec.gramSchmidtFn A).qs + = (List.finRange n).foldl (fun qs j => qs ++ [qStep A qs j]) [] := by + rw [gramSchmidtFn_eq]; exact gs_proj_qs A (List.finRange n) [] [] + unfold Qmat + rw [hqs, getD_foldl_snoc_read (fun qs j => qStep A qs j) (fun _ => 0) (List.finRange n) k.val + (by rw [List.length_finRange]; exact k.isLt)] + have hk : (List.finRange n)[k.val]'(by rw [List.length_finRange]; exact k.isLt) = k := by + simp [List.getElem_finRange] + rw [hk]; rfl + +/-- Closed form of an `R` entry: `rStep` evaluated on the `Q`-prefix. -/ +theorem Rmat_eq (A : Fin m → Fin n → ℝ) (k j : Fin n) : + Rmat A k j = rStep A (qsPrefix A j) j k := by + have hrc : (Spec.gramSchmidtFn A).rcols = rTail A [] (List.finRange n) := by + rw [gramSchmidtFn_eq, gs_fold_split]; simp + unfold Rmat + rw [hrc, rTail_getD A [] (List.finRange n) j.val (by rw [List.length_finRange]; exact j.isLt)] + have hk : (List.finRange n)[j.val]'(by rw [List.length_finRange]; exact j.isLt) = j := by + simp [List.getElem_finRange] + rw [hk]; rfl + +/-- `R` is upper-triangular: entries strictly below the diagonal vanish. -/ +theorem rStep_above (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) {j k : Fin n} + (hjk : j.val < k.val) : rStep A qs j k = 0 := by + simp only [rStep]; rw [if_neg (by omega), if_neg (by rw [beq_iff_eq]; omega)] + +/-- The diagonal `R` entry is `rⱼⱼ`. -/ +theorem rStep_diag (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) : + rStep A qs j j = gsRjj A qs j := by + simp only [rStep]; rw [if_neg (lt_irrefl _), if_pos (beq_self_eq_true _)] + +/-- The `Q` column when the pivot is positive: `qⱼ = v / rⱼⱼ`. -/ +theorem qStep_pos (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) + (h : 0 < gsRjj A qs j) (i : Fin m) : + qStep A qs j i = gsV A qs j i / gsRjj A qs j := by + simp only [qStep]; rw [if_pos (gtBool_true_iff.mpr h)] + +/-! ### The orthogonalization sum as a `Finset` sum -/ + +set_option linter.unusedSimpArgs false in +/-- The zip-fold defining `v` collapses to a single map-fold over the `Q` columns. -/ +theorem cross_fold_eq (qs : List (Fin m → ℝ)) (g : (Fin m → ℝ) → ℝ) (i : Fin m) (a : ℝ) : + (List.zip qs (qs.map g)).foldl (fun acc (qk, r) => acc + r * qk i) a + = a + (qs.map (fun qk => g qk * qk i)).sum := by + induction qs generalizing a with + | nil => simp + | cons x xs ih => + simp only [List.map_cons, List.zip_cons_cons, List.foldl_cons] + rw [ih]; simp only [List.map_cons, List.sum_cons]; ring + +/-- Closed form of `v i`: `A i j` minus the partial projection sum. -/ +theorem gsV_eq (A : Fin m → Fin n → ℝ) (qs : List (Fin m → ℝ)) (j : Fin n) (i : Fin m) : + gsV A qs j i = gsA A j i - (qs.map (fun qk => Spec.dotFn qk (gsA A j) * qk i)).sum := by + unfold gsV gsRkjs + rw [cross_fold_eq qs (fun qk => Spec.dotFn qk (gsA A j)) i 0, zero_add] + +/-- Length of the `Q`-prefix list. -/ +theorem qsPrefix_length (A : Fin m → Fin n → ℝ) (j : Fin n) : (qsPrefix A j).length = j.val := by + unfold qsPrefix + rw [length_foldl_snoc (fun qs k => qStep A qs k), List.length_nil, Nat.zero_add, List.length_take, + List.length_finRange, Nat.min_eq_left (le_of_lt j.isLt)] + +/-- The `Q`-prefix is exactly the first `j` columns of the final factor `Q`. -/ +theorem qsPrefix_eq_map (A : Fin m → Fin n → ℝ) (j : Fin n) : + qsPrefix A j = ((List.finRange n).take j.val).map (fun k => Qcol A k) := by + have hjval : ((List.finRange n).take j.val).length = j.val := by + rw [List.length_take, List.length_finRange, Nat.min_eq_left (le_of_lt j.isLt)] + apply List.ext_getElem + · unfold qsPrefix + rw [length_foldl_snoc (fun qs k => qStep A qs k), List.length_nil, Nat.zero_add, + List.length_map] + · intro p h1 h2 + rw [List.length_map, hjval] at h2 + have hpn : p < n := lt_trans h2 j.isLt + rw [List.getElem_map] + have hidx : ((List.finRange n).take j.val)[p]'(by rw [hjval]; exact h2) = (⟨p, hpn⟩ : Fin n) := by + rw [List.getElem_take, List.getElem_finRange]; exact Fin.ext rfl + rw [show (qsPrefix A j)[p]'h1 = (qsPrefix A j).getD p (fun _ => 0) from + (List.getD_eq_getElem _ _ h1).symm] + unfold qsPrefix + rw [getD_foldl_snoc_read (fun qs k => qStep A qs k) (fun _ => 0) + ((List.finRange n).take j.val) p (by rw [hjval]; exact h2)] + rw [List.take_take, Nat.min_eq_left (le_of_lt h2), hidx] + funext r + rw [show Qcol A (⟨p, hpn⟩ : Fin n) r = Qmat A r ⟨p, hpn⟩ from rfl, Qmat_eq] + rfl + +/-- `getD` commutes with `dotFn`-mapping when the index is in range. -/ +theorem getD_map_dotFn (qs : List (Fin m → ℝ)) (a : Fin m → ℝ) (k : Nat) (hk : k < qs.length) : + (qs.map (fun qk => Spec.dotFn qk a)).getD k 0 = Spec.dotFn (qs.getD k (fun _ => 0)) a := by + rw [List.getD_eq_getElem _ _ (by rw [List.length_map]; exact hk), List.getElem_map, + List.getD_eq_getElem _ _ hk] + +/-- A `Q`-prefix entry equals the final `Q` column at that index. -/ +theorem qsPrefix_getD (A : Fin m → Fin n → ℝ) {k j : Fin n} (hkj : k.val < j.val) : + (qsPrefix A j).getD k.val (fun _ => 0) = Qcol A k := by + rw [qsPrefix_eq_map, + List.getD_eq_getElem _ _ (by rw [List.length_map, List.length_take, List.length_finRange, + Nat.min_eq_left (le_of_lt j.isLt)]; exact hkj), + List.getElem_map] + congr 1 + rw [List.getElem_take, List.getElem_finRange]; exact Fin.ext rfl + +/-- The below-diagonal `R` entry is the inner product of the corresponding `Q` column with column `j`. -/ +theorem R_below (A : Fin m → Fin n → ℝ) {k j : Fin n} (hkj : k.val < j.val) : + Rmat A k j = Spec.dotFn (Qcol A k) (gsA A j) := by + rw [Rmat_eq]; simp only [rStep]; rw [if_pos hkj]; unfold gsRkjs + rw [getD_map_dotFn (qsPrefix A j) (gsA A j) k.val (by rw [qsPrefix_length]; exact hkj), + qsPrefix_getD A hkj] + +/-- The projection sum equals the masked partial sum `Σ_{k Spec.dotFn qk (gsA A j) * qk i)).sum + = ∑ k, if k.val < j.val then Rmat A k j * Qmat A i k else 0 := by + rw [qsPrefix_eq_map] + rw [List.map_map] + rw [take_map_sum_eq] + apply Finset.sum_congr rfl + intro k _ + by_cases hkj : k.val < j.val + · rw [if_pos hkj, if_pos hkj] + show Spec.dotFn (Qcol A k) (gsA A j) * Qmat A i k = Rmat A k j * Qmat A i k + rw [R_below A hkj] + · rw [if_neg hkj, if_neg hkj] + +/-! ### Exact reconstruction `A = Q · R` -/ + +/-- `R` is upper-triangular: entries strictly below the diagonal vanish. -/ +theorem Rmat_upper_triangular (A : Fin m → Fin n → ℝ) {k j : Fin n} (hjk : j.val < k.val) : + Rmat A k j = 0 := by + rw [Rmat_eq]; exact rStep_above A (qsPrefix A j) hjk + +/-- **Per-entry QR reconstruction.** When every `R` pivot is positive (`0 < R[j,j]`, the full +column-rank success condition), `A[i,j] = Σ_k Q[i,k]·R[k,j]`. -/ +theorem qr_reconstruction (A : Fin m → Fin n → ℝ) (hrank : ∀ j : Fin n, 0 < Rmat A j j) + (i : Fin m) (j : Fin n) : + A i j = ∑ k, Qmat A i k * Rmat A k j := by + have key : ∀ k : Fin n, Qmat A i k * Rmat A k j + = (if k.val < j.val then Qmat A i k * Rmat A k j else 0) + + (if k = j then Qmat A i j * Rmat A j j else 0) := by + intro k + rcases lt_trichotomy k.val j.val with h | h | h + · have hne : k ≠ j := fun hk => by rw [hk] at h; exact lt_irrefl _ h + rw [if_pos h, if_neg hne, add_zero] + · have hkj : k = j := Fin.ext h + rw [if_neg (by omega), if_pos hkj, zero_add, hkj] + · have hne : k ≠ j := fun hk => by rw [hk] at h; exact lt_irrefl _ h + rw [if_neg (by omega), if_neg hne, add_zero, Rmat_upper_triangular A h, mul_zero] + rw [show (∑ k, Qmat A i k * Rmat A k j) + = ∑ k, ((if k.val < j.val then Qmat A i k * Rmat A k j else 0) + + (if k = j then Qmat A i j * Rmat A j j else 0)) + from Finset.sum_congr rfl (fun k _ => key k), + Finset.sum_add_distrib, Finset.sum_ite_eq' Finset.univ j (fun _ => Qmat A i j * Rmat A j j)] + simp only [Finset.mem_univ, if_true] + have hρpos : 0 < gsRjj A (qsPrefix A j) j := by + have h := hrank j; rwa [Rmat_eq, rStep_diag] at h + have hdiag : Qmat A i j * Rmat A j j = gsV A (qsPrefix A j) j i := by + rw [Qmat_eq, qStep_pos A (qsPrefix A j) j hρpos, + show Rmat A j j = gsRjj A (qsPrefix A j) j from by rw [Rmat_eq]; exact rStep_diag _ _ j, + div_mul_eq_mul_div, mul_div_assoc, div_self (ne_of_gt hρpos), mul_one] + rw [hdiag, gsV_eq, cross_sum_qr, + show gsA A j i = A i j from rfl, + show (∑ k, if k.val < j.val then Qmat A i k * Rmat A k j else 0) + = (∑ k, if k.val < j.val then Rmat A k j * Qmat A i k else 0) + from Finset.sum_congr rfl (fun k _ => by + by_cases hkj : k.val < j.val + · rw [if_pos hkj, if_pos hkj, mul_comm] + · rw [if_neg hkj, if_neg hkj])] + ring + +/-- **Matrix-level QR reconstruction.** `A = Q · R` for the executable Gram–Schmidt factors, +under positive `R` pivots (full column rank). -/ +theorem qr_mul_eq (A : Fin m → Fin n → ℝ) (hrank : ∀ j : Fin n, 0 < Rmat A j j) : + Matrix.of A = Matrix.of (fun i k => Qmat A i k) * Matrix.of (fun k j => Rmat A k j) := by + ext i j + rw [Matrix.mul_apply] + simp only [Matrix.of_apply] + exact qr_reconstruction A hrank i j + +/-- **Tensor-level QR reconstruction.** For a tensor `A` whose `qrSpec` `R`-pivots are positive +(full column rank), every entry of `A` is reconstructed by `Q · R`: +`A[i,j] = Σ_k Q[i,k]·R[k,j]`, with `Q = qrQSpec A`, `R = qrRSpec A`. -/ +theorem qrSpec_reconstruction (A : Spec.Tensor ℝ (.dim m (.dim n .scalar))) + (hrank : ∀ j : Fin n, 0 < Spec.get2 (Spec.qrRSpec A) j j) (i : Fin m) (j : Fin n) : + Spec.get2 A i j + = ∑ k, Spec.get2 (Spec.qrQSpec A) i k * Spec.get2 (Spec.qrRSpec A) k j := by + have hQ : ∀ a b, Spec.get2 (Spec.qrQSpec A) a b = Qmat (Spec.toMatFn A) a b := fun _ _ => rfl + have hR : ∀ a b, Spec.get2 (Spec.qrRSpec A) a b = Rmat (Spec.toMatFn A) a b := fun _ _ => rfl + simp only [hQ, hR] + show Spec.toMatFn A i j = _ + exact qr_reconstruction (Spec.toMatFn A) (fun b => by rw [← hR b b]; exact hrank b) i j + +end QR + end Spec.Factorization.Reconstruction diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 0297902..2fdaf92 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -128,14 +128,27 @@ zero-residual limit, `isSymEig_of_diagonal` shows the solver output `(diag Af, V `NN/Examples/Factorization` are concrete instances of this certificate: they bound the off-diagonal mass on specific matrices. +# Exact QR reconstruction + +The QR factorization admits the same treatment. `qr_mul_eq` (in the same file) proves that for an +`A` whose executable Gram–Schmidt `R`-pivots are all positive (`0 < R[j,j]`, the full-column-rank +success condition) the factors satisfy + +$$`R \text{ upper-triangular} \quad\text{and}\quad A = Q\,R,` + +with `qrSpec_reconstruction` the tensor-level corollary. The new wrinkle is that `gramSchmidtFn` +threads a `GSState` that snocs onto _two_ lists at once — the `Q` columns and the `R` columns. Because +the appended values depend only on the `Q`-history, the `Q`-list is itself a single-list snoc-fold +(`gs_proj_qs`, read by `getD_foldl_snoc_read` as for Cholesky), and the `R`-list is the `Q`-prefix +tail `rTail`, read by `gs_fold_split` together with `rTail_getD`. The orthogonalization sum +`v = a − Σ rₖⱼ qₖ`, a fold over `List.zip`, collapses to a single map-fold (`cross_fold_eq`) and then +to a masked `Finset` partial sum, after which the positive-pivot hypothesis cancels the `v / rⱼⱼ` +normalization exactly. + # What remains -With Cholesky's exact reconstruction in place, the remaining finite-fold increment is the QR -factorization: `A = Q · R` from modified Gram–Schmidt under full column rank, and the orthonormality -`Qᵀ Q = 1`. The `A = Q · R` part is within reach of the same machinery, but `gramSchmidtFn` threads a -`GSState` that snocs onto _two_ lists at once (the `Q` columns and the `R` columns), so it needs read -lemmas for that dual-list structure-fold rather than the single-list `getD_foldl_snoc_read` used for -Cholesky. The orthonormality `Qᵀ Q = 1` is harder still: it rests on the Gram–Schmidt orthogonality -invariant, which Mathlib provides for its own `gramSchmidt` but not for this executable variant. The -specification-level facts the kernel methods rely on are independent of these steps, so the CHD -foundation is already in place. +The one finite-fold property still open is the orthonormality of the QR factor, `Qᵀ Q = 1`. Unlike +`A = Q · R` — a purely algebraic consequence of the orthogonalization step, proved above — it rests on +the Gram–Schmidt orthogonality invariant, which Mathlib provides for its own `gramSchmidt` but not for +this executable variant. The specification-level facts the kernel methods rely on are independent of +that step, so the CHD foundation is already in place. From 31450f3ff8263dd77e11ca7198792cfe808fc330 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sat, 30 May 2026 16:48:17 -0700 Subject: [PATCH 05/22] =?UTF-8?q?Add=20QR=20orthonormality=20Q=E1=B5=80Q?= =?UTF-8?q?=3D1=20via=20Mathlib=20gramSchmidt=20bridge?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the last open finite-fold property: orthonormality of the executable Gram–Schmidt Q factor. Rather than re-derive the orthogonality induction, the new file NN/Proofs/Tensor/Basic/FactorizationsOrthonormal.lean unifies the executable variant with Mathlib's gramSchmidt. Reading columns of A as EuclideanSpace ℝ (Fin m) vectors (gsCol), Qcol_bridge proves by strong induction that the j-th executable Q column equals gramSchmidtNormed ℝ (gsCol A) j; orthonormality then follows from Mathlib's gramSchmidtNormed_orthonormal'. Yields Q_orthonormal (qₐ·q_b = δₐᵦ), QT_mul_Q_eq_one, the full IsQR predicate isQR_of_pos, and the tensor-level qrSpec_orthonormal. Three reusable connectors over ℝ — dotFn_eq_inner, normFn_eq_norm, proj_normalize — are stated generally enough to lift into a future Mathlib matrix-level QR contribution. Blueprint chapter and reconstruction docstring updated: only the iterative Jacobi/SVD convergence now remains (residual certificate only). Sorry-free; NN.Examples.Factorization still reconstructs at err 0.000000, with the empirical Qᵀ·Q = I check now formally backed. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Proofs/Tensor/Basic.lean | 1 + .../Basic/FactorizationsOrthonormal.lean | 252 ++++++++++++++++++ .../Basic/FactorizationsReconstruction.lean | 9 +- .../Ch4_Verification/Factorizations.lean | 29 +- 4 files changed, 282 insertions(+), 9 deletions(-) create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsOrthonormal.lean diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index 9e44d53..fb2f877 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -11,6 +11,7 @@ public import NN.Proofs.Tensor.Basic.Folds public import NN.Proofs.Tensor.Basic.LinearAlgebra public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction +public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.BoundsNorms public import NN.Proofs.Tensor.Basic.Algebra diff --git a/NN/Proofs/Tensor/Basic/FactorizationsOrthonormal.lean b/NN/Proofs/Tensor/Basic/FactorizationsOrthonormal.lean new file mode 100644 index 0000000..f45a8f6 --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsOrthonormal.lean @@ -0,0 +1,252 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction +public import Mathlib.Analysis.InnerProductSpace.GramSchmidtOrtho +public import Mathlib.Analysis.InnerProductSpace.PiL2 + +/-! +# Orthonormality of the executable Gram–Schmidt `Q` factor (`Qᵀ Q = 1`) + +This file closes the one finite-fold property left open by +[`NN.Proofs.Tensor.Basic.FactorizationsReconstruction`](FactorizationsReconstruction.lean): the +orthonormality of the `Q` factor produced by the executable modified Gram–Schmidt `gramSchmidtFn`. + +The strategy is to **unify the executable variant with Mathlib's `gramSchmidt`** rather than re-derive +the orthogonality induction by hand. Reading the columns of `A` as vectors of +`EuclideanSpace ℝ (Fin m)`, the `j`-th executable `Q` column equals Mathlib's `gramSchmidtNormed ℝ` +of the column map (`Qcol_bridge`), so the orthonormality follows from Mathlib's +`gramSchmidtNormed_orthonormal'`. + +## Main results + +* `Qcol_bridge`: `WithLp.toLp 2 (Qcol A k) = gramSchmidtNormed ℝ (gsCol A) k` — the executable `Q` + column is Mathlib's normalized Gram–Schmidt vector, proved by strong induction on `k`. +* `Q_orthonormal`: `dotFn (Qcol A a) (Qcol A b) = if a = b then 1 else 0` under positive `R` pivots. +* `QT_mul_Q_eq_one` and `isQR_of_pos`: the matrix-level `Qᵀ Q = 1` and the full + `Spec.Factorization.IsQR` predicate for the executable factors (combining with the reconstruction + `A = Q · R` and `R` upper-triangular from the companion file). +* `qrSpec_orthonormal`: the tensor-level corollary. + +## Method + +The bridge rests on three connectors over `ℝ`: `dotFn = ⟪·,·⟫` and `normFn = ‖·‖` on +`EuclideanSpace ℝ (Fin m)`, and the projection identity `proj_normalize` showing the un-normalized +Gram–Schmidt projection term equals the normalized one. The strong induction feeds the partial +identification of the earlier `Q` columns into `gramSchmidt_def''`, term by term. +-/ + +@[expose] public section + +namespace Spec.Factorization.Reconstruction + +open Matrix +open scoped BigOperators RealInnerProductSpace +open InnerProductSpace + +/-! ## Connectors between the executable scalar ops and the Euclidean inner product -/ + +/-- `dotFn` as a `Finset` sum. -/ +theorem dotFn_eq_sum {p : Nat} (u v : Fin p → ℝ) : Spec.dotFn u v = ∑ i, u i * v i := by + unfold Spec.dotFn + rw [foldl_addf_eq_sum (fun i => u i * v i) (List.finRange p) 0, zero_add, + ← finsum_eq_finRange_sum (fun i => u i * v i)] + +/-- The executable dot product is the Euclidean inner product over `ℝ`. -/ +theorem dotFn_eq_inner {p : Nat} (u v : Fin p → ℝ) : + Spec.dotFn u v + = ⟪(WithLp.toLp 2 u : EuclideanSpace ℝ (Fin p)), WithLp.toLp 2 v⟫_ℝ := by + rw [dotFn_eq_sum, PiLp.inner_apply] + apply Finset.sum_congr rfl + intro i _ + rw [RCLike.inner_apply', PiLp.toLp_apply, PiLp.toLp_apply] + simp + +/-- The executable Euclidean norm is the `EuclideanSpace` norm over `ℝ`. -/ +theorem normFn_eq_norm {p : Nat} (v : Fin p → ℝ) : + Spec.normFn v = ‖(WithLp.toLp 2 v : EuclideanSpace ℝ (Fin p))‖ := by + rw [Spec.normFn, mfsqrt_eq, EuclideanSpace.norm_eq] + congr 1 + rw [dotFn_eq_sum] + apply Finset.sum_congr rfl + intro i _ + rw [PiLp.toLp_apply, Real.norm_eq_abs, sq_abs, sq] + +/-- The Gram–Schmidt projection term, with the normalized vector pulled out. Holds with no +non-degeneracy hypothesis (both sides vanish when `gramSchmidt = 0`). -/ +theorem proj_normalize {F : Type*} [NormedAddCommGroup F] [InnerProductSpace ℝ F] (w x : F) : + (⟪w, x⟫_ℝ / ‖w‖ ^ 2) • w = ⟪‖w‖⁻¹ • w, x⟫_ℝ • (‖w‖⁻¹ • w) := by + rw [real_inner_smul_left, smul_smul] + congr 1 + rw [div_eq_mul_inv, ← inv_pow, sq] + ring + +/-- `gramSchmidtNormed` over `ℝ`, with the scalar coercion removed. -/ +theorem gn_eq {n : Nat} {F : Type*} [NormedAddCommGroup F] [InnerProductSpace ℝ F] + (f : Fin n → F) (i : Fin n) : + gramSchmidtNormed ℝ f i = ‖gramSchmidt ℝ f i‖⁻¹ • gramSchmidt ℝ f i := by + rw [gramSchmidtNormed] + norm_num + +/-- A masked full sum equals the sum over `Iio`. -/ +theorem sum_Iio_eq_mask {n : Nat} (k : Fin n) (h : Fin n → ℝ) : + ∑ i ∈ Finset.Iio k, h i = ∑ i, if i.val < k.val then h i else 0 := by + rw [← Finset.sum_filter] + congr 1 + ext i + simp only [Finset.mem_Iio, Finset.mem_filter, Finset.mem_univ, true_and, Fin.lt_def] + +/-! ## The bridge to Mathlib's `gramSchmidt` -/ + +section QR + +variable {m n : Nat} + +/-- Column `j` of `A` as a vector of `EuclideanSpace ℝ (Fin m)`. -/ +noncomputable def gsCol (A : Fin m → Fin n → ℝ) (j : Fin n) : EuclideanSpace ℝ (Fin m) := + WithLp.toLp 2 (gsA A j) + +/-- `gsCol A k` reads as the executable column `gsA A k`. -/ +theorem gsCol_apply (A : Fin m → Fin n → ℝ) (k : Fin n) (r : Fin m) : + gsCol A k r = gsA A k r := rfl + +/-- **Orthogonalized-vector bridge.** Given that the earlier `Q` columns coincide with Mathlib's +normalized Gram–Schmidt vectors, the executable orthogonalized vector `v` at index `k` equals +Mathlib's (un-normalized) `gramSchmidt` vector. -/ +theorem gsV_bridge (A : Fin m → Fin n → ℝ) (k : Fin n) + (ih : ∀ i : Fin n, i.val < k.val → + (WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)) = gramSchmidtNormed ℝ (gsCol A) i) : + gramSchmidt ℝ (gsCol A) k = WithLp.toLp 2 (gsV A (qsPrefix A k) k) := by + -- Rewrite Mathlib's vector via the explicit recurrence. + rw [show gramSchmidt ℝ (gsCol A) k + = gsCol A k - ∑ i ∈ Finset.Iio k, + (⟪gramSchmidt ℝ (gsCol A) i, gsCol A k⟫_ℝ / ‖gramSchmidt ℝ (gsCol A) i‖ ^ 2) + • gramSchmidt ℝ (gsCol A) i + from eq_sub_of_add_eq (gramSchmidt_def'' ℝ (gsCol A) k).symm] + -- Replace each projection term by the normalized form, then by the executable `Q` column. + have hproj : ∀ i ∈ Finset.Iio k, + (⟪gramSchmidt ℝ (gsCol A) i, gsCol A k⟫_ℝ / ‖gramSchmidt ℝ (gsCol A) i‖ ^ 2) + • gramSchmidt ℝ (gsCol A) i + = ⟪(WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)), gsCol A k⟫_ℝ + • (WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)) := by + intro i hi + have hik : i < k := Finset.mem_Iio.mp hi + rw [proj_normalize (gramSchmidt ℝ (gsCol A) i) (gsCol A k), ← gn_eq, ih i hik] + rw [Finset.sum_congr rfl hproj] + -- Compare entrywise. + ext r + rw [PiLp.sub_apply] + show gsCol A k r - _ = gsV A (qsPrefix A k) k r + rw [gsV_eq, gsCol_apply] + congr 1 + -- The Euclidean `Iio` sum, applied at `r`, equals the executable list projection sum. + rw [WithLp.ofLp_sum, Finset.sum_apply] + rw [show ∑ i ∈ Finset.Iio k, + (WithLp.ofLp (⟪(WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)), gsCol A k⟫_ℝ + • (WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)))) r + = ∑ i ∈ Finset.Iio k, Spec.dotFn (Qcol A i) (gsA A k) * Qcol A i r from by + apply Finset.sum_congr rfl + intro i _ + rw [show WithLp.ofLp (⟪(WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)), gsCol A k⟫_ℝ + • (WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m))) r + = ⟪(WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)), gsCol A k⟫_ℝ • Qcol A i r + from rfl, smul_eq_mul, gsCol, ← dotFn_eq_inner]] + rw [sum_Iio_eq_mask, qsPrefix_eq_map, List.map_map, take_map_sum_eq] + rfl + +/-- **Normalized-column bridge.** The executable `Q` column at index `k` equals Mathlib's +`gramSchmidtNormed`. Proved by strong induction on `k`, under positive `R` pivots (full column rank). -/ +theorem Qcol_bridge (A : Fin m → Fin n → ℝ) (hrank : ∀ j : Fin n, 0 < Rmat A j j) : + ∀ k : Fin n, + (WithLp.toLp 2 (Qcol A k) : EuclideanSpace ℝ (Fin m)) = gramSchmidtNormed ℝ (gsCol A) k := by + have main : ∀ N : Nat, ∀ k : Fin n, k.val = N → + (WithLp.toLp 2 (Qcol A k) : EuclideanSpace ℝ (Fin m)) = gramSchmidtNormed ℝ (gsCol A) k := by + intro N + induction N using Nat.strong_induction_on with + | _ N ih => + intro k hk + have IH : ∀ i : Fin n, i.val < k.val → + (WithLp.toLp 2 (Qcol A i) : EuclideanSpace ℝ (Fin m)) = gramSchmidtNormed ℝ (gsCol A) i := + fun i hi => ih i.val (hk ▸ hi) i rfl + have hρpos : 0 < gsRjj A (qsPrefix A k) k := by + have h := hrank k; rwa [Rmat_eq, rStep_diag] at h + have hgsV := gsV_bridge A k IH + rw [gn_eq, hgsV] + ext r + rw [PiLp.smul_apply, PiLp.toLp_apply, PiLp.toLp_apply, smul_eq_mul] + show Qcol A k r = _ + rw [show Qcol A k r = Qmat A r k from rfl, Qmat_eq, qStep_pos A (qsPrefix A k) k hρpos, + ← normFn_eq_norm] + show gsV A (qsPrefix A k) k r / gsRjj A (qsPrefix A k) k + = (Spec.normFn (gsV A (qsPrefix A k) k))⁻¹ * gsV A (qsPrefix A k) k r + rw [show Spec.normFn (gsV A (qsPrefix A k) k) = gsRjj A (qsPrefix A k) k from rfl, + div_eq_mul_inv, mul_comm] + exact fun k => main k.val k rfl + +/-! ## Orthonormality `Qᵀ Q = 1` -/ + +/-- Each normalized Gram–Schmidt vector is non-zero (the pivot is positive). -/ +theorem gn_ne_zero (A : Fin m → Fin n → ℝ) (hrank : ∀ j : Fin n, 0 < Rmat A j j) (j : Fin n) : + gramSchmidtNormed ℝ (gsCol A) j ≠ 0 := by + have hpos : 0 < ‖gramSchmidt ℝ (gsCol A) j‖ := by + have h := hrank j + rw [Rmat_eq, rStep_diag] at h + rwa [gsV_bridge A j (fun i _ => Qcol_bridge A hrank i), ← normFn_eq_norm] + rw [gn_eq] + exact smul_ne_zero (inv_ne_zero (ne_of_gt hpos)) (norm_pos_iff.mp hpos) + +/-- **Orthonormality of the executable `Q` columns.** Under positive `R` pivots, +`qₐ · q_b = δₐᵦ`. -/ +theorem Q_orthonormal (A : Fin m → Fin n → ℝ) (hrank : ∀ j : Fin n, 0 < Rmat A j j) (a b : Fin n) : + Spec.dotFn (Qcol A a) (Qcol A b) = if a = b then 1 else 0 := by + rw [dotFn_eq_inner] + show ⟪(WithLp.toLp 2 (Qcol A a) : EuclideanSpace ℝ (Fin m)), WithLp.toLp 2 (Qcol A b)⟫_ℝ = _ + rw [Qcol_bridge A hrank a, Qcol_bridge A hrank b] + have horth := orthonormal_iff_ite.mp (gramSchmidtNormed_orthonormal' (gsCol A)) + ⟨a, gn_ne_zero A hrank a⟩ ⟨b, gn_ne_zero A hrank b⟩ + rw [horth] + simp only [Subtype.mk.injEq] + +/-- **Matrix-level orthonormality.** `Qᵀ Q = 1` for the executable Gram–Schmidt `Q` factor. -/ +theorem QT_mul_Q_eq_one (A : Fin m → Fin n → ℝ) (hrank : ∀ j : Fin n, 0 < Rmat A j j) : + (Matrix.of (fun i k => Qmat A i k))ᵀ * Matrix.of (fun i k => Qmat A i k) = 1 := by + ext a b + rw [Matrix.mul_apply] + simp only [Matrix.transpose_apply, Matrix.of_apply, Matrix.one_apply] + rw [show (∑ i, Qmat A i a * Qmat A i b) = Spec.dotFn (Qcol A a) (Qcol A b) from by + rw [dotFn_eq_sum]; rfl, + Q_orthonormal A hrank a b] + +/-- **Full QR specification.** For `A` with positive executable `R`-pivots (full column rank), the +executable Gram–Schmidt factors satisfy `Spec.Factorization.IsQR`: `Qᵀ Q = 1`, `R` upper-triangular, +and `A = Q · R`. -/ +theorem isQR_of_pos (A : Fin m → Fin n → ℝ) (hrank : ∀ j : Fin n, 0 < Rmat A j j) : + Spec.Factorization.IsQR (Matrix.of A) (Matrix.of (fun i k => Qmat A i k)) + (Matrix.of (fun k j => Rmat A k j)) := by + refine ⟨QT_mul_Q_eq_one A hrank, ?_, qr_mul_eq A hrank⟩ + intro i j hji + show Rmat A i j = 0 + exact Rmat_upper_triangular A (Fin.lt_def.mp hji) + +/-- **Tensor-level orthonormality.** For a tensor `A` with positive `qrRSpec` pivots, the `Q` factor +`qrQSpec A` has orthonormal columns: `Σ_i Q[i,a]·Q[i,b] = δₐᵦ`. -/ +theorem qrSpec_orthonormal (A : Spec.Tensor ℝ (.dim m (.dim n .scalar))) + (hrank : ∀ j : Fin n, 0 < Spec.get2 (Spec.qrRSpec A) j j) (a b : Fin n) : + (∑ i, Spec.get2 (Spec.qrQSpec A) i a * Spec.get2 (Spec.qrQSpec A) i b) + = if a = b then 1 else 0 := by + have hQ : ∀ x y, Spec.get2 (Spec.qrQSpec A) x y = Qmat (Spec.toMatFn A) x y := fun _ _ => rfl + have hR : ∀ x y, Spec.get2 (Spec.qrRSpec A) x y = Rmat (Spec.toMatFn A) x y := fun _ _ => rfl + simp only [hQ] + rw [show (∑ i, Qmat (Spec.toMatFn A) i a * Qmat (Spec.toMatFn A) i b) + = Spec.dotFn (Qcol (Spec.toMatFn A) a) (Qcol (Spec.toMatFn A) b) from by + rw [dotFn_eq_sum]; rfl] + exact Q_orthonormal (Spec.toMatFn A) (fun j => by rw [← hR]; exact hrank j) a b + +end QR + +end Spec.Factorization.Reconstruction diff --git a/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean b/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean index 90cd5cb..6aa6e73 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsReconstruction.lean @@ -45,10 +45,11 @@ positive-pivot hypotheses discharge the `√`-radicand and divisor side conditio ## Scope -The one piece *not* proved is the orthonormality of the QR factor, `Qᵀ Q = 1`. Unlike `A = Q · R` -(which is a purely algebraic consequence of the orthogonalization step), it rests on the Gram–Schmidt -orthogonality invariant, which Mathlib provides for its own `gramSchmidt` but not for this executable -variant — so it stays the documented remaining increment, never a `sorry`. +This file proves `A = L · Lᵀ` and `A = Q · R` purely algebraically. The remaining QR property — +orthonormality of the `Q` factor, `Qᵀ Q = 1` — is proved in the companion file +[`NN.Proofs.Tensor.Basic.FactorizationsOrthonormal`](FactorizationsOrthonormal.lean) by bridging the +executable Gram–Schmidt to Mathlib's `gramSchmidt`, completing the full `Spec.Factorization.IsQR` +predicate (`isQR_of_pos`). -/ @[expose] public section diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 2fdaf92..d8e51b5 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -145,10 +145,29 @@ tail `rTail`, read by `gs_fold_split` together with `rTail_getD`. The orthogonal to a masked `Finset` partial sum, after which the positive-pivot hypothesis cancels the `v / rⱼⱼ` normalization exactly. +# Orthonormality of the QR factor (`Qᵀ Q = 1`) + +The remaining finite-fold property — orthonormality of the `Q` factor, `Qᵀ Q = 1` — is proved in +[`NN.Proofs.Tensor.Basic.FactorizationsOrthonormal`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsOrthonormal.lean) +by *unifying the executable variant with Mathlib's `gramSchmidt`* rather than re-deriving the +orthogonality induction by hand. Reading the columns of `A` as vectors of `EuclideanSpace ℝ (Fin m)`, +`Qcol_bridge` proves by strong induction that the `j`-th executable `Q` column equals Mathlib's +`gramSchmidtNormed ℝ` of the column map. The orthonormality then follows from Mathlib's +`gramSchmidtNormed_orthonormal'`, giving `Q_orthonormal` (`qₐ · q_b = δₐᵦ`), the matrix-level +`QT_mul_Q_eq_one`, and the full `IsQR` predicate `isQR_of_pos` (orthonormal `Q`, upper-triangular `R`, +`A = Q · R`). + +The bridge rests on three small connectors over `ℝ`: the executable `dotFn`/`normFn` are the Euclidean +inner product and norm (`dotFn_eq_inner`, `normFn_eq_norm`), and `proj_normalize` shows the +un-normalized Gram–Schmidt projection term equals the normalized one (with no non-degeneracy +hypothesis). The positive-pivot assumption (`0 < R[j,j]`, full column rank) supplies the non-vanishing +of each `gramSchmidt` vector via `gn_ne_zero`. These connectors are stated generally enough to lift +into a future Mathlib matrix-level QR contribution. + # What remains -The one finite-fold property still open is the orthonormality of the QR factor, `Qᵀ Q = 1`. Unlike -`A = Q · R` — a purely algebraic consequence of the orthogonalization step, proved above — it rests on -the Gram–Schmidt orthogonality invariant, which Mathlib provides for its own `gramSchmidt` but not for -this executable variant. The specification-level facts the kernel methods rely on are independent of -that step, so the CHD foundation is already in place. +With Cholesky and QR fully reconstructed (`A = L · Lᵀ`, `A = Q · R`, `Qᵀ Q = 1`), the only properties +not available as a-priori theorems are the *iterative* ones: full diagonalization for the cyclic Jacobi +eigensolver and the SVD built on it. Mathlib v4.30.0 has no Jacobi convergence theory, so those remain +captured by the exact a-posteriori residual certificate above, never by `sorry`. The specification-level +facts the kernel methods rely on are independent of that step, so the CHD foundation is complete. From f6717b5693d9817195b96476887b58419bbba075 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sat, 30 May 2026 22:03:05 -0700 Subject: [PATCH 06/22] Make Jacobi residual certificate unconditional + reviewer examples MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Discharge the two hypotheses the symmetric-eigendecomposition residual certificate assumed (`Vᵀ V = 1` and `A = V·Af·Vᵀ`) for the real `symEigJacobiSpec` output, so the certificate holds outright. NN/Proofs/Tensor/Basic/FactorizationsJacobi.lean (sorry-free, warning-free): - toM bridge from `Array (Array ℝ)` to Mathlib `Matrix`; toM_matMul/tr/id show the array ops realise the matrix ops (unconditionally). - givens_orthogonal: each Givens rotation with c²+s²=1 is orthogonal (Jᵀ J = 1), via the three column shapes and a 9-case dot-product split. - JacInv loop invariant preserved by jacInv_rotate/_sweep/_run (List.foldlRecOn over jacobiPairs; base case (A, I)). - jacobi_orthogonal, jacobi_similarity (no hypotheses) ⟹ unconditional symEigJacobi_{reconstruction,frobenius}_residual and symEigJacobi_isSymEig_of_diagonal, with worked examples. Blueprint Ch4: new "Faithfulness of the Jacobi run" section; "What remains" narrowed from the iterative properties to just the convergence rate. Examples (positive + negative controls, compiled #eval assertions): - Cholesky: indefinite A correctly fails (NaN; uses summed Frobenius error). - QR: rank-deficient A reconstructs but Qᵀ Q ≠ I (full rank needed). - SVD: Vᵀ V = I; permuted σ fails to reconstruct. - SymEig: orthogonality exact at 1 sweep; off-diagonal residual asymptotic; exact residual certificate verified numerically (lhs = rhs, |Δ| = 0). Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 23 +- NN/Examples/Factorization/Cholesky.lean | 19 + NN/Examples/Factorization/Common.lean | 55 ++- NN/Examples/Factorization/QR.lean | 25 + NN/Examples/Factorization/SVD.lean | 20 +- NN/Examples/Factorization/SymEig.lean | 78 ++- NN/Proofs/Tensor/Basic.lean | 1 + .../Tensor/Basic/FactorizationsJacobi.lean | 458 ++++++++++++++++++ .../Ch4_Verification/Factorizations.lean | 46 +- 9 files changed, 695 insertions(+), 30 deletions(-) create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsJacobi.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 7f7be49..5bd7a28 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -16,15 +16,24 @@ public import NN.Examples.Factorization.SVD # Matrix factorization examples Executable sanity checks for the spec-layer matrix factorizations in -`NN.Spec.Core.Tensor.Factorizations`: +`NN.Spec.Core.Tensor.Factorizations`, designed to corroborate the formal correctness theorems in +`NN.Proofs.Tensor.Basic.{Factorizations, FactorizationsReconstruction, FactorizationsOrthonormal, +FactorizationsJacobi}`. Each check runs through compiled `#eval` assertions, so the build fails if a +factorization misbehaves. -- `Cholesky` — `A = L · Lᵀ` -- `QR` — `A = Q · R`, `Qᵀ·Q = I` -- `SymEig` — full symmetric eigendecomposition `A = V · diag(λ) · Vᵀ` -- `SVD` — `A = U · diag(σ) · Vᵀ` +- `Cholesky` — `A = L · Lᵀ`; **negative control**: an indefinite `A` correctly fails (no SPD factor). +- `QR` — `A = Q · R`, `Qᵀ·Q = I`; **negative control**: a rank-deficient `A` still reconstructs + but `Qᵀ Q ≠ I`, separating the two guarantees and showing full column rank is needed. +- `SymEig` — `A = V · diag(λ) · Vᵀ`; orthogonality `Vᵀ V = I` is exact at *any* sweep count (witness + of the a-priori `jacobi_orthogonal`), diagonalization is asymptotic, and the **exact residual + certificate** `‖A − V·diag(λ)·Vᵀ‖² = ‖offDiag(VᵀAV)‖²` (`symEigJacobi_frobenius_residual`) is + verified numerically. +- `SVD` — `A = U · diag(σ) · Vᵀ`, `Vᵀ V = I`; **negative control**: a permuted `σ` fails to + reconstruct. -Each example reconstructs the original matrix and asserts (via `#guard`) that the maximum -reconstruction error is below `tol`, so the build fails if a factorization is incorrect. +Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** +(the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a +reviewer can see the checks are not vacuous. -/ @[expose] public section diff --git a/NN/Examples/Factorization/Cholesky.lean b/NN/Examples/Factorization/Cholesky.lean index fc23255..da351a2 100644 --- a/NN/Examples/Factorization/Cholesky.lean +++ b/NN/Examples/Factorization/Cholesky.lean @@ -39,4 +39,23 @@ def reconErr : Float := maxMatErr A (mm L (tr L)) -- Compiled assertion: the factorization reconstructs A (fails the build otherwise). #eval assertLt "Cholesky A = L·Lᵀ" reconErr +/-! ## Negative control: the SPD hypothesis is necessary + +`isCholesky_of_pos` requires the executable pivots `L[j,j]` to be positive (`0 < choleskyFn A j j`), +which is exactly the success condition over the reals. The matrix below is symmetric but *not* +positive-definite (eigenvalues `3` and `-1`), so the diagonal step takes `√(negative)` and the +reconstruction is `NaN` — never a small error. This documents that the hypothesis genuinely bites. -/ + +/-- A symmetric but **indefinite** matrix (eigenvalues `{3, -1}`), outside Cholesky's domain. -/ +def Abad : Spec.Tensor Float (.dim 2 (.dim 2 .scalar)) := + mkMat [[1, 2], + [2, 1]] + +def Lbad : Spec.Tensor Float (.dim 2 (.dim 2 .scalar)) := Spec.choleskySpec Abad +-- Use the *summed* Frobenius error here, not `maxMatErr`: IEEE `max` ignores `NaN`, whereas the sum +-- propagates the `NaN` produced by `√(negative)`, faithfully reporting that no factor exists. +def reconErrBad : Float := frobSqErr Abad (mm Lbad (tr Lbad)) + +#eval assertReconFails "Cholesky on indefinite A correctly fails (no SPD ⇒ no factor)" reconErrBad + end NN.Examples.Factorization.Cholesky diff --git a/NN/Examples/Factorization/Common.lean b/NN/Examples/Factorization/Common.lean index c970e32..a5d2074 100644 --- a/NN/Examples/Factorization/Common.lean +++ b/NN/Examples/Factorization/Common.lean @@ -50,15 +50,32 @@ def diagFromVec {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : Spec.Tensor Float (.dim n (.dim n .scalar)) := Spec.ofMatFn (fun i j => if i.val == j.val then Spec.Tensor.toScalar (Spec.get v i) else 0.0) +/-- Extract the diagonal of a square matrix as a length-`n` vector. -/ +def diagOf {n : Nat} (M : Spec.Tensor Float (.dim n (.dim n .scalar))) : + Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => Spec.get2 M i i) + /-- Read a vector tensor back out as a `List Float` (for display). -/ def vecToList {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : List Float := (List.finRange n).map (fun i => Spec.Tensor.toScalar (Spec.get v i)) +/-- Squared Frobenius distance `Σ_{i,j} (A_ij - B_ij)²` between two `m × n` matrices. -/ +def frobSqErr {m n : Nat} (A B : Spec.Tensor Float (.dim m (.dim n .scalar))) : Float := + (List.finRange m).foldl (fun acc i => + (List.finRange n).foldl + (fun a j => let d := Spec.get2 A i j - Spec.get2 B i j; a + d * d) acc) 0.0 + +/-- Squared Frobenius off-diagonal mass `Σ_{i≠j} M_ij²` of a square matrix. -/ +def offDiagFrobSq {n : Nat} (M : Spec.Tensor Float (.dim n (.dim n .scalar))) : Float := + (List.finRange n).foldl (fun acc i => + (List.finRange n).foldl + (fun a j => if i.val == j.val then a else let x := Spec.get2 M i j; a + x * x) acc) 0.0 + /-- Shared tolerance for reconstruction-error assertions. -/ def tol : Float := 1e-6 /-- -Compiled assertion used by the examples: print `name: OK (err)` when `err < tol`, otherwise raise an +Compiled **positive** assertion: print `name: OK (err)` when `err < tol`, otherwise raise an `IO` error so the build/`#eval` fails. Running this through `#eval` evaluates with the compiler (fast), unlike `#guard`, which forces slow kernel reduction of the whole factorization. -/ @@ -68,4 +85,40 @@ def assertLt (name : String) (err : Float) (tolerance : Float := tol) : IO Unit else throw (IO.userError s!"{name}: FAIL (err = {err} ≥ tol = {tolerance})") +/-- +Compiled **negative-control** assertion: succeeds only when `err ≥ threshold`, i.e. when a property +that *should not* hold is correctly detected as violated. Gives the metric teeth — a reviewer can see +the same `maxMatErr`/residual that reports `0` on a valid factorization reports a large value on an +invalid one, so the positive checks are not vacuous. +-/ +def assertGe (name : String) (err : Float) (threshold : Float := 0.5) : IO Unit := + if err ≥ threshold then + IO.println s!"{name}: OK (correctly rejected, err = {err} ≥ {threshold})" + else + throw (IO.userError s!"{name}: FAIL (err = {err} < {threshold}; expected the property to fail)") + +/-- +Compiled **negative-control** assertion that a reconstruction *fails*: succeeds when the error is not +below `tol` — including the `NaN` produced when a hypothesis is violated (e.g. Cholesky of a +non-positive-definite matrix takes `√(negative)`). Documents that the success hypotheses (SPD pivots, +full column rank) are genuinely necessary. +-/ +def assertReconFails (name : String) (err : Float) (tolerance : Float := tol) : IO Unit := + if err < tolerance then + throw (IO.userError s!"{name}: FAIL (unexpectedly reconstructed, err = {err} < {tolerance})") + else + IO.println s!"{name}: OK (correctly failed, err = {err})" + +/-- +Compiled assertion that two scalars agree to `tolerance`. Used to verify the *exact* residual +identity numerically: the reconstruction error and the off-diagonal mass it equals are computed by +independent routines and shown to match, so the identity `symEigJacobi_frobenius_residual` is not a +tautology of the code. +-/ +def assertApproxEq (name : String) (a b : Float) (tolerance : Float := tol) : IO Unit := + if Float.abs (a - b) < tolerance then + IO.println s!"{name}: OK (lhs = {a}, rhs = {b}, |Δ| = {Float.abs (a - b)})" + else + throw (IO.userError s!"{name}: FAIL (lhs = {a}, rhs = {b}, |Δ| = {Float.abs (a - b)} ≥ {tolerance})") + end NN.Examples.Factorization diff --git a/NN/Examples/Factorization/QR.lean b/NN/Examples/Factorization/QR.lean index 0080de7..b2e549a 100644 --- a/NN/Examples/Factorization/QR.lean +++ b/NN/Examples/Factorization/QR.lean @@ -41,4 +41,29 @@ def orthoErr : Float := maxMatErr (mm (tr Q) Q) (Spec.identityTensorSpec 3) #eval assertLt "QR A = Q·R" reconErr #eval assertLt "QR Qᵀ·Q = I" orthoErr +/-! ## Negative control: full column rank is necessary for orthonormality + +`qrSpec_orthonormal` (`Qᵀ Q = 1`) requires full column rank — positive `R`-pivots +(`0 < R[j,j]`). The matrix below has a dependent column (`col₂ = 2·col₁`), so Gram–Schmidt produces a +**zero** `Q` column where the pivot vanishes: `A = Q·R` still holds, but `Qᵀ Q` has a `0` on the +diagonal, so orthonormality fails. This separates the two guarantees and shows the rank hypothesis +genuinely bites. -/ + +/-- A rank-2 matrix (`col₂ = 2·col₁`): reconstructs, but `Q` cannot be orthonormal. -/ +def Adef : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[1, 2, 0], + [2, 4, 1], + [1, 2, 0]] + +def Qdef : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := Spec.qrQSpec Adef +def Rdef : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := Spec.qrRSpec Adef + +/-- Reconstruction still holds even without full rank. -/ +def reconErrDef : Float := maxMatErr Adef (mm Qdef Rdef) +/-- Orthonormality fails: `Qᵀ·Q` has a zero diagonal entry, so it is far from `I`. -/ +def orthoErrDef : Float := maxMatErr (mm (tr Qdef) Qdef) (Spec.identityTensorSpec 3) + +#eval assertLt "QR(rank-deficient) A = Q·R still reconstructs" reconErrDef +#eval assertGe "QR(rank-deficient) Qᵀ·Q = I correctly fails (needs full column rank)" orthoErrDef + end NN.Examples.Factorization.QR diff --git a/NN/Examples/Factorization/SVD.lean b/NN/Examples/Factorization/SVD.lean index 0b2369c..399b1fb 100644 --- a/NN/Examples/Factorization/SVD.lean +++ b/NN/Examples/Factorization/SVD.lean @@ -41,10 +41,28 @@ def V : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := svd.2.2 /-- Reconstruction error `‖A - U·diag(σ)·Vᵀ‖_max`. -/ def reconErr : Float := maxMatErr A (mm (mm U (diagFromVec σ)) (tr V)) +/-- Orthogonality error `‖Vᵀ·V - I‖_max` for the right singular vectors. -/ +def orthoErrV : Float := maxMatErr (mm (tr V) V) (Spec.identityTensorSpec 3) #eval vecToList σ --- Compiled assertion (fails the build otherwise). +-- Compiled assertions (fail the build otherwise). #eval assertLt "SVD A = U·diag(σ)·Vᵀ" reconErr +-- `V` are the eigenvectors of `Aᵀ A` (see `IsSVD.gram_isSymEig`), hence orthogonal a-priori — the +-- numeric witness of `jacobi_orthogonal` applied to the Gram matrix, even though `σ₃ = 0` (rank 2). +#eval assertLt "SVD Vᵀ·V = I" orthoErrV + +/-! ## Negative control: a wrong factor is rejected + +Permuting the singular values (so they no longer pair with their vectors) must break the +reconstruction — otherwise the `maxMatErr` reconstruction check would be vacuous. -/ + +/-- A deliberately mismatched singular-value vector (permuted, and nonzero where the true `σ₃ = 0`). -/ +def σbad : Spec.Tensor Float (.dim 3 .scalar) := + Spec.ofVecFn (fun i => ([3.0, 5.0, 1.0] : List Float).getD i.val 0.0) +/-- Reconstruction with the mismatched `σ` (should be far from `A`). -/ +def reconErrBad : Float := maxMatErr A (mm (mm U (diagFromVec σbad)) (tr V)) + +#eval assertGe "SVD with permuted σ correctly fails to reconstruct" reconErrBad end NN.Examples.Factorization.SVD diff --git a/NN/Examples/Factorization/SymEig.lean b/NN/Examples/Factorization/SymEig.lean index 5426e07..a896a99 100644 --- a/NN/Examples/Factorization/SymEig.lean +++ b/NN/Examples/Factorization/SymEig.lean @@ -14,8 +14,23 @@ meta import NN.Examples.Factorization.Common `symEigJacobiSpec A sweeps` returns `(eigenvalues, V)` for a symmetric `A`, where the columns of `V` are the (orthonormal) eigenvectors. Unlike the power-iteration `eigendecompSpec`, this recovers -**all** eigenpairs. We check the spectral reconstruction `A = V · diag(λ) · Vᵀ` and orthogonality -`Vᵀ · V = I`. +**all** eigenpairs. + +These checks are designed to give a reviewer confidence in the matching formal development +(`NN.Proofs.Tensor.Basic.FactorizationsJacobi`), and in particular to exhibit the precise boundary +between what is proved *exactly / a-priori* and what is only *asymptotic*: + +* **Spectral reconstruction** `A = V · diag(λ) · Vᵀ` and orthogonality `Vᵀ V = I` hold at high sweep + counts (positive checks). +* **Orthogonality is exact at *any* sweep count** — even after a single sweep `Vᵀ V = I` to machine + precision. This is the numeric witness of `jacobi_orthogonal`, which is an a-priori theorem (no + convergence hypothesis). +* **Diagonalization is only asymptotic**: one sweep leaves a genuine off-diagonal residual that more + sweeps drive to zero. This is the "rate" that remains a-posteriori (`What remains` in the blueprint). +* **The exact residual certificate** `‖A − V·diag(λ)·Vᵀ‖_F² = ‖offDiag(VᵀAV)‖_F²` + (`symEigJacobi_frobenius_residual`) is checked numerically at a *low* sweep count, where both sides + are large and equal — the two sides are computed by independent routines, so the match is evidence + the identity is real and not a tautology of the code. -/ @[expose] public section @@ -29,24 +44,55 @@ def A : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := [1, 3, 1], [1, 1, 4]] -/-- Eigenvalues (diagonal after Jacobi sweeps) and eigenvector matrix `V`. -/ -def eig : Spec.Tensor Float (.dim 3 .scalar) × Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := +/-- The 3×3 identity (target for the orthogonality checks). -/ +def I3 : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := Spec.identityTensorSpec 3 + +/-- Eigendecomposition after 8 sweeps (converged) and after 1 sweep (not yet converged). -/ +def eig8 : Spec.Tensor Float (.dim 3 .scalar) × Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := Spec.symEigJacobiSpec A 8 +def eig1 : Spec.Tensor Float (.dim 3 .scalar) × Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + Spec.symEigJacobiSpec A 1 + +/-- Eigenvalues and eigenvector matrix `V` (columns are eigenvectors) at 8 sweeps. -/ +def evals8 : Spec.Tensor Float (.dim 3 .scalar) := eig8.1 +def V8 : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := eig8.2 +/-- Eigenvector matrix after a single sweep. -/ +def V1 : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := eig1.2 + +/-- Rotated matrices `Af = Vᵀ A V` after 1 and 8 sweeps (diagonal in the limit). -/ +def Af1 : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := mm (mm (tr V1) A) V1 +def Af8 : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := mm (mm (tr V8) A) V8 + +/-- Spectral reconstruction error `‖A - V·diag(λ)·Vᵀ‖_max` at 8 sweeps. -/ +def reconErr8 : Float := maxMatErr A (mm (mm V8 (diagFromVec evals8)) (tr V8)) +/-- Orthogonality error `‖Vᵀ·V - I‖_max` at 8 and at 1 sweep. -/ +def orthoErr8 : Float := maxMatErr (mm (tr V8) V8) I3 +def orthoErr1 : Float := maxMatErr (mm (tr V1) V1) I3 + +/-- Off-diagonal mass of `Af` after 1 and 8 sweeps (the squared reconstruction residual). -/ +def offResid1 : Float := offDiagFrobSq Af1 +def offResid8 : Float := offDiagFrobSq Af8 + +/-- Reconstruction side of the exact certificate, computed independently at 1 sweep. -/ +def reconFrobSq1 : Float := frobSqErr A (mm (mm V1 (diagFromVec (diagOf Af1))) (tr V1)) + +#eval vecToList evals8 +#eval IO.println s!"off-diagonal mass: 1 sweep = {offResid1}, 8 sweeps = {offResid8}" -/-- Eigenvalues. -/ -def evals : Spec.Tensor Float (.dim 3 .scalar) := eig.1 -/-- Eigenvector matrix (columns are eigenvectors). -/ -def V : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := eig.2 +-- Positive checks at convergence. +#eval assertLt "SymEig(8) A = V·diag(λ)·Vᵀ" reconErr8 +#eval assertLt "SymEig(8) Vᵀ·V = I" orthoErr8 -/-- Spectral reconstruction error `‖A - V·diag(λ)·Vᵀ‖_max`. -/ -def reconErr : Float := maxMatErr A (mm (mm V (diagFromVec evals)) (tr V)) -/-- Orthogonality error `‖Vᵀ·V - I‖_max`. -/ -def orthoErr : Float := maxMatErr (mm (tr V) V) (Spec.identityTensorSpec 3) +-- Orthogonality is EXACT after a single sweep (numeric witness of the a-priori `jacobi_orthogonal`). +#eval assertLt "SymEig(1) Vᵀ·V = I (orthogonality exact at any sweep count)" orthoErr1 -#eval vecToList evals +-- Diagonalization is only asymptotic: 1 sweep leaves a real residual, 8 sweeps remove it. +#eval assertGe "SymEig(1) off-diagonal residual is non-negligible" offResid1 0.01 +#eval assertLt "SymEig(8) off-diagonal residual ≈ 0" offResid8 --- Compiled assertions (fail the build otherwise). -#eval assertLt "SymEig A = V·diag(λ)·Vᵀ" reconErr -#eval assertLt "SymEig Vᵀ·V = I" orthoErr +-- The EXACT residual certificate `‖A - V·diag(λ)·Vᵀ‖² = ‖offDiag(VᵀAV)‖²`, at a sweep count where +-- both sides are large — independent computations agree (witness of `symEigJacobi_frobenius_residual`). +#eval assertApproxEq "SymEig residual certificate ‖A-V·diagΛ·Vᵀ‖² = ‖offDiag(VᵀAV)‖²" + reconFrobSq1 offResid1 end NN.Examples.Factorization.SymEig diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index fb2f877..33ec81a 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -12,6 +12,7 @@ public import NN.Proofs.Tensor.Basic.LinearAlgebra public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal +public import NN.Proofs.Tensor.Basic.FactorizationsJacobi public import NN.Proofs.Tensor.Basic.BoundsNorms public import NN.Proofs.Tensor.Basic.Algebra diff --git a/NN/Proofs/Tensor/Basic/FactorizationsJacobi.lean b/NN/Proofs/Tensor/Basic/FactorizationsJacobi.lean new file mode 100644 index 0000000..dafec5f --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsJacobi.lean @@ -0,0 +1,458 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.Factorizations + +/-! +# The cyclic Jacobi run is faithful (orthogonality + orthogonal similarity) + +The a-posteriori residual certificate in +[`NN.Proofs.Tensor.Basic.Factorizations`](./Factorizations.lean) +(`symEig_reconstruction_residual`, `symEig_frobenius_residual`, `isSymEig_of_diagonal`) is stated +*conditionally*: it assumes the two algebraic premises `Vᵀ V = 1` (the accumulated eigenvector matrix +is orthogonal) and `A = V · Af · Vᵀ` (the rotated matrix is an orthogonal similarity of the input). +Both are *exact, finite, a-priori* facts about the executable `Spec.arrJacobiRun` — they need no +Jacobi convergence theory. This file proves them and thereby discharges the hypotheses, turning the +certificate into an **unconditional** statement about the real solver output. + +The development is a refinement bridge from the strict `Array (Array ℝ)` representation the iteration +runs over to Mathlib `Matrix (Fin n) (Fin n) ℝ`: + +* `toM` reads an array matrix as a `Matrix`; `toM_matMul`/`toM_tr`/`toM_id` show the array operations + realise the corresponding matrix operations. +* `givens_orthogonal` is the one genuinely-new piece: each Givens rotation `arrGivens n p q c s` with + `c² + s² = 1` is an orthogonal matrix (`Jᵀ J = 1`). +* `JacInv` is the loop invariant `Vᵀ V = 1 ∧ A₀ = V · A · Vᵀ`; `jacInv_rotate`/`jacInv_sweep`/ + `jacInv_run` propagate it through one rotation, one sweep, and the whole run. +* `jacobi_orthogonal` and `jacobi_similarity` are the discharged premises for the actual + `symEigJacobiSpec` output, and `symEigJacobi_*` re-state the residual certificate unconditionally. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Matrix +open scoped BigOperators + +variable {n : Nat} + +/-! ## Reading array matrices as Mathlib matrices -/ + +/-- Reading position `i` (in bounds) of `Array.ofFn f` returns `f ⟨i, _⟩`. -/ +theorem getD_ofFn {β : Type} (f : Fin n → β) (i : Nat) (hi : i < n) (d : β) : + (Array.ofFn f).getD i d = f ⟨i, hi⟩ := by + rw [Array.getD_eq_getD_getElem?, Array.getElem?_eq_getElem (by simpa using hi), + Option.getD_some, Array.getElem_ofFn] + +/-- Reading entry `(i, j)` of a doubly-`ofFn` array matrix returns the underlying function value. -/ +theorem arrGet_ofFn₂ (F : Fin n → Fin n → ℝ) (i j : Fin n) : + Spec.arrGet (Array.ofFn (fun a : Fin n => Array.ofFn (fun b : Fin n => F a b))) i.val j.val + = F i j := by + unfold Spec.arrGet + rw [getD_ofFn (fun a : Fin n => Array.ofFn (fun b : Fin n => F a b)) i.val i.isLt #[], + getD_ofFn (fun b : Fin n => F ⟨i.val, i.isLt⟩ b) j.val j.isLt 0] + +/-- View an `Array (Array ℝ)` as a `Matrix (Fin n) (Fin n) ℝ`. -/ +noncomputable def toM (n : Nat) (M : Array (Array ℝ)) : Matrix (Fin n) (Fin n) ℝ := + Matrix.of (fun i j => Spec.arrGet M i.val j.val) + +@[simp] theorem toM_apply (M : Array (Array ℝ)) (i j : Fin n) : + toM n M i j = Spec.arrGet M i.val j.val := rfl + +/-- The array matrix product realises the matrix product. -/ +theorem toM_matMul (X Y : Array (Array ℝ)) : + toM n (Spec.arrMatMul n X Y) = toM n X * toM n Y := by + ext i j + rw [Matrix.mul_apply] + simp only [toM_apply] + unfold Spec.arrMatMul + rw [arrGet_ofFn₂] + exact Spec.finRange_foldl_add_eq_finset_sum + (fun k => Spec.arrGet X i.val k.val * Spec.arrGet Y k.val j.val) + +/-- The array transpose realises the matrix transpose. -/ +theorem toM_tr (X : Array (Array ℝ)) : toM n (Spec.arrTr n X) = (toM n X)ᵀ := by + ext i j + rw [Matrix.transpose_apply] + simp only [toM_apply] + unfold Spec.arrTr + rw [arrGet_ofFn₂] + +/-- The array identity realises the matrix identity. -/ +theorem toM_id : toM n (Spec.arrId n) = 1 := by + ext i j + simp only [toM_apply] + unfold Spec.arrId + rw [arrGet_ofFn₂] + by_cases h : i = j + · subst h; simp + · rw [Matrix.one_apply_ne h] + simp [Fin.val_ne_of_ne h] + +/-! ## The Givens rotation is orthogonal -/ + +/-- Entrywise value of the Givens array matrix (boolean conditions). -/ +theorem toM_givens_apply (p q : Nat) (c s : ℝ) (a b : Fin n) : + toM n (Spec.arrGivens n p q c s) a b + = (if a.val == p && b.val == p then c + else if a.val == q && b.val == q then c + else if a.val == p && b.val == q then s + else if a.val == q && b.val == p then -s + else if a.val == b.val then 1 else 0) := by + simp only [toM_apply] + unfold Spec.arrGivens + rw [arrGet_ofFn₂] + +/-- Entrywise value of the Givens array matrix (propositional conditions). -/ +theorem toM_givens_apply' (p q : Nat) (c s : ℝ) (a b : Fin n) : + toM n (Spec.arrGivens n p q c s) a b + = (if a.val = p ∧ b.val = p then c + else if a.val = q ∧ b.val = q then c + else if a.val = p ∧ b.val = q then s + else if a.val = q ∧ b.val = p then -s + else if a.val = b.val then 1 else 0) := by + rw [toM_givens_apply] + simp only [Bool.and_eq_true, beq_iff_eq] + +/-- Column `p` of the Givens matrix: `c` at row `p`, `-s` at row `q`, `0` elsewhere. -/ +theorem givens_col_fp (p q : Nat) (hp : p < n) (hq : q < n) (hpq : p ≠ q) (c s : ℝ) (k : Fin n) : + toM n (Spec.arrGivens n p q c s) k ⟨p, hp⟩ + = (if k = ⟨p, hp⟩ then c else if k = ⟨q, hq⟩ then -s else 0) := by + rw [toM_givens_apply'] + by_cases hkp : k.val = p + · simp [hkp, Fin.ext_iff] + · by_cases hkq : k.val = q + · simp [hkq, hpq, Ne.symm hpq, Fin.ext_iff] + · simp [hkp, hkq, hpq, Fin.ext_iff] + +/-- Column `q` of the Givens matrix: `s` at row `p`, `c` at row `q`, `0` elsewhere. -/ +theorem givens_col_fq (p q : Nat) (hp : p < n) (hq : q < n) (hpq : p ≠ q) (c s : ℝ) (k : Fin n) : + toM n (Spec.arrGivens n p q c s) k ⟨q, hq⟩ + = (if k = ⟨p, hp⟩ then s else if k = ⟨q, hq⟩ then c else 0) := by + rw [toM_givens_apply'] + by_cases hkp : k.val = p + · simp [hkp, hpq, Ne.symm hpq, Fin.ext_iff] + · by_cases hkq : k.val = q + · simp [hkq, Ne.symm hpq, Fin.ext_iff] + · simp [hkp, hkq, Ne.symm hpq, Fin.ext_iff] + +/-- Any other column `o ∉ {p, q}` of the Givens matrix is the `o`-th standard basis vector. -/ +theorem givens_col_other (p q : Nat) (c s : ℝ) (o k : Fin n) + (hop : o.val ≠ p) (hoq : o.val ≠ q) : + toM n (Spec.arrGivens n p q c s) k o = (if k = o then 1 else 0) := by + rw [toM_givens_apply'] + by_cases hko : k = o + · simp [hko, hop, hoq] + · simp [hop, hoq, hko, Fin.val_ne_of_ne hko] + +/-- A sum of products of two indicator functions is the Kronecker delta. -/ +private theorem sum_ite_mul_ite (i j : Fin n) : + ∑ k : Fin n, (if k = i then (1 : ℝ) else 0) * (if k = j then 1 else 0) + = if i = j then 1 else 0 := by + by_cases hij : i = j + · subst hij + have hterm : ∀ k : Fin n, + (if k = i then (1 : ℝ) else 0) * (if k = i then 1 else 0) = if k = i then 1 else 0 := + fun k => by by_cases hk : k = i <;> simp [hk] + rw [if_pos rfl, Finset.sum_congr rfl (fun k _ => hterm k), Finset.sum_ite_eq'] + simp + · rw [if_neg hij] + refine Finset.sum_eq_zero (fun k _ => ?_) + by_cases hki : k = i + · subst hki; simp [hij] + · rw [if_neg hki, zero_mul] + +/-- A sum of products of two functions each supported on `{fp, fq}` (with `fp ≠ fq`). -/ +private theorem sum_two_supp (fp fq : Fin n) (hfpq : fp ≠ fq) (x1 y1 x2 y2 : ℝ) : + ∑ k : Fin n, (if k = fp then x1 else if k = fq then y1 else 0) + * (if k = fp then x2 else if k = fq then y2 else 0) + = x1 * x2 + y1 * y2 := by + have hterm : ∀ k : Fin n, + (if k = fp then x1 else if k = fq then y1 else 0) + * (if k = fp then x2 else if k = fq then y2 else 0) + = (if k = fp then x1 * x2 else 0) + (if k = fq then y1 * y2 else 0) := by + intro k + by_cases hkp : k = fp + · subst hkp; simp [hfpq] + · by_cases hkq : k = fq + · subst hkq; simp [hkp] + · simp [hkp, hkq] + rw [Finset.sum_congr rfl (fun k _ => hterm k), Finset.sum_add_distrib, + Finset.sum_ite_eq', Finset.sum_ite_eq'] + simp + +/-- A function supported on `{fp, fq}` times an indicator at `o ∉ {fp, fq}` sums to zero. -/ +private theorem sum_two_supp_mul_ite (fp fq o : Fin n) (hop : o ≠ fp) (hoq : o ≠ fq) (x1 y1 : ℝ) : + ∑ k : Fin n, (if k = fp then x1 else if k = fq then y1 else 0) * (if k = o then (1 : ℝ) else 0) + = 0 := by + refine Finset.sum_eq_zero (fun k _ => ?_) + by_cases hko : k = o + · subst hko; rw [if_neg hop, if_neg hoq, zero_mul] + · rw [if_neg hko, mul_zero] + +/-- **Givens rotation is orthogonal.** For `c² + s² = 1` and `p ≠ q`, `Jᵀ J = 1`. -/ +theorem givens_orthogonal (p q : Nat) (hp : p < n) (hq : q < n) (hpq : p ≠ q) (c s : ℝ) + (hcs : c ^ 2 + s ^ 2 = 1) : + (toM n (Spec.arrGivens n p q c s))ᵀ * toM n (Spec.arrGivens n p q c s) = 1 := by + have hfpq : (⟨p, hp⟩ : Fin n) ≠ ⟨q, hq⟩ := fun h => hpq (Fin.ext_iff.mp h) + ext i j + rw [Matrix.mul_apply, Matrix.one_apply] + simp only [Matrix.transpose_apply] + by_cases hip : i = ⟨p, hp⟩ + · subst hip + by_cases hjp : j = ⟨p, hp⟩ + · -- (p, p) + subst hjp + rw [Finset.sum_congr rfl (fun k _ => by rw [givens_col_fp p q hp hq hpq c s k]), + sum_two_supp _ _ hfpq c (-s) c (-s), if_pos rfl] + nlinarith [hcs] + · by_cases hjq : j = ⟨q, hq⟩ + · -- (p, q) + subst hjq + rw [Finset.sum_congr rfl (fun k _ => by + rw [givens_col_fp p q hp hq hpq c s k, givens_col_fq p q hp hq hpq c s k]), + sum_two_supp _ _ hfpq c (-s) s c, if_neg hfpq] + ring + · -- (p, other) + have hjp' : j.val ≠ p := fun h => hjp (Fin.ext h) + have hjq' : j.val ≠ q := fun h => hjq (Fin.ext h) + rw [Finset.sum_congr rfl (fun k _ => by + rw [givens_col_fp p q hp hq hpq c s k, givens_col_other p q c s j k hjp' hjq']), + sum_two_supp_mul_ite _ _ j hjp hjq c (-s), if_neg (Ne.symm hjp)] + · by_cases hiq : i = ⟨q, hq⟩ + · subst hiq + by_cases hjp : j = ⟨p, hp⟩ + · -- (q, p) + subst hjp + rw [Finset.sum_congr rfl (fun k _ => by + rw [givens_col_fq p q hp hq hpq c s k, givens_col_fp p q hp hq hpq c s k]), + sum_two_supp _ _ hfpq s c c (-s), if_neg (Ne.symm hfpq)] + ring + · by_cases hjq : j = ⟨q, hq⟩ + · -- (q, q) + subst hjq + rw [Finset.sum_congr rfl (fun k _ => by rw [givens_col_fq p q hp hq hpq c s k]), + sum_two_supp _ _ hfpq s c s c, if_pos rfl] + nlinarith [hcs] + · -- (q, other) + have hjp' : j.val ≠ p := fun h => hjp (Fin.ext h) + have hjq' : j.val ≠ q := fun h => hjq (Fin.ext h) + rw [Finset.sum_congr rfl (fun k _ => by + rw [givens_col_fq p q hp hq hpq c s k, givens_col_other p q c s j k hjp' hjq']), + sum_two_supp_mul_ite _ _ j hjp hjq s c, if_neg (Ne.symm hjq)] + · -- i other + have hip' : i.val ≠ p := fun h => hip (Fin.ext h) + have hiq' : i.val ≠ q := fun h => hiq (Fin.ext h) + by_cases hjp : j = ⟨p, hp⟩ + · -- (other, p) + subst hjp + rw [Finset.sum_congr rfl (fun k _ => by + rw [givens_col_other p q c s i k hip' hiq', givens_col_fp p q hp hq hpq c s k, + mul_comm]), + sum_two_supp_mul_ite _ _ i hip hiq c (-s), if_neg hip] + · by_cases hjq : j = ⟨q, hq⟩ + · -- (other, q) + subst hjq + rw [Finset.sum_congr rfl (fun k _ => by + rw [givens_col_other p q c s i k hip' hiq', givens_col_fq p q hp hq hpq c s k, + mul_comm]), + sum_two_supp_mul_ite _ _ i hip hiq s c, if_neg hiq] + · -- (other, other) + have hjp' : j.val ≠ p := fun h => hjp (Fin.ext h) + have hjq' : j.val ≠ q := fun h => hjq (Fin.ext h) + rw [Finset.sum_congr rfl (fun k _ => by + rw [givens_col_other p q c s i k hip' hiq', givens_col_other p q c s j k hjp' hjq']), + sum_ite_mul_ite] + +/-- The Golub–Van Loan rotation parameters the implementation uses satisfy `c² + s² = 1` for any +intermediate value `t`: this is `givens_normSq` with `MathFunctions.sqrt = Real.sqrt` and `t·t = t²`. -/ +theorem code_givens_normSq (t : ℝ) : + (1 / MathFunctions.sqrt (1 + t * t)) ^ 2 + (t * (1 / MathFunctions.sqrt (1 + t * t))) ^ 2 = 1 := by + have h1 : MathFunctions.sqrt (1 + t * t) = Real.sqrt (1 + t ^ 2) := by + rw [show (1 : ℝ) + t * t = 1 + t ^ 2 from by ring]; rfl + rw [h1] + exact givens_normSq t + +/-! ## The loop invariant -/ + +/-- The Jacobi loop invariant relative to the input `A₀`: the running `V` is orthogonal and the +running pair `(A, V)` satisfies the orthogonal-similarity identity `A₀ = V · A · Vᵀ`. -/ +def JacInv (A0 : Matrix (Fin n) (Fin n) ℝ) (st : Array (Array ℝ) × Array (Array ℝ)) : Prop := + (toM n st.2)ᵀ * toM n st.2 = 1 ∧ A0 = toM n st.2 * toM n st.1 * (toM n st.2)ᵀ + +/-- One orthogonal-similarity update by an orthogonal `J` preserves the invariant. -/ +theorem jacInv_step {A0 : Matrix (Fin n) (Fin n) ℝ} {A V J : Array (Array ℝ)} + (hJ : (toM n J)ᵀ * toM n J = 1) (h : JacInv A0 (A, V)) : + JacInv A0 (Spec.arrMatMul n (Spec.arrTr n J) (Spec.arrMatMul n A J), Spec.arrMatMul n V J) := by + obtain ⟨hVo, hsim⟩ := h + simp only [JacInv] at hVo hsim ⊢ + have hJJ : toM n J * (toM n J)ᵀ = 1 := mul_eq_one_comm.mp hJ + refine ⟨?_, ?_⟩ + · rw [toM_matMul, Matrix.transpose_mul] + calc (toM n J)ᵀ * (toM n V)ᵀ * (toM n V * toM n J) + = (toM n J)ᵀ * ((toM n V)ᵀ * toM n V) * toM n J := by + simp only [Matrix.mul_assoc] + _ = (toM n J)ᵀ * toM n J := by rw [hVo, Matrix.mul_one] + _ = 1 := hJ + · simp only [toM_matMul, toM_tr, Matrix.transpose_mul] + rw [hsim] + have e1 : (toM n V * toM n J) * ((toM n J)ᵀ * (toM n A * toM n J)) * ((toM n J)ᵀ * (toM n V)ᵀ) + = toM n V * (toM n J * (toM n J)ᵀ) * toM n A * (toM n J * (toM n J)ᵀ) * (toM n V)ᵀ := by + simp only [Matrix.mul_assoc] + rw [e1, hJJ] + simp only [Matrix.mul_one, Matrix.mul_assoc] + +/-- One Jacobi rotation preserves the invariant (the parameters always give an orthogonal `J`, and +the no-op branch is trivial). -/ +theorem jacInv_rotate {A0 : Matrix (Fin n) (Fin n) ℝ} (p q : Nat) (hp : p < n) (hq : q < n) + (hpq : p ≠ q) {st : Array (Array ℝ) × Array (Array ℝ)} (h : JacInv A0 st) : + JacInv A0 (Spec.arrJacobiRotate n st.1 st.2 p q) := by + unfold Spec.arrJacobiRotate + extract_lets apq + split + · exact jacInv_step (givens_orthogonal p q hp hq hpq _ _ (code_givens_normSq _)) h + · exact h + +/-- Every pair produced by `jacobiPairs n` has `p < q < n`. -/ +theorem jacobiPairs_spec {pq : Nat × Nat} (h : pq ∈ Spec.jacobiPairs n) : + pq.1 < pq.2 ∧ pq.2 < n := by + unfold Spec.jacobiPairs at h + simp only [List.mem_flatMap, List.mem_filterMap, List.mem_range] at h + obtain ⟨p, _, q, hq, hcond⟩ := h + split at hcond + · rename_i hlt + simp only [Option.some.injEq] at hcond + rw [← hcond] + exact ⟨hlt, hq⟩ + · simp at hcond + +/-- One Jacobi sweep preserves the invariant. -/ +theorem jacInv_sweep {A0 : Matrix (Fin n) (Fin n) ℝ} {st : Array (Array ℝ) × Array (Array ℝ)} + (h : JacInv A0 st) : JacInv A0 (Spec.arrJacobiSweep n st) := by + unfold Spec.arrJacobiSweep + refine List.foldlRecOn _ _ h ?_ + intro b hb pq hmem + obtain ⟨hlt, hqn⟩ := jacobiPairs_spec hmem + exact jacInv_rotate pq.1 pq.2 (Nat.lt_trans hlt hqn) hqn (Nat.ne_of_lt hlt) hb + +/-- **The whole Jacobi run preserves the invariant.** Starting from `(A, I)`, after any number of +sweeps the accumulated `V` is orthogonal and `toM A = V · Af · Vᵀ`. -/ +theorem jacInv_run (A : Array (Array ℝ)) (sweeps : Nat) : + JacInv (toM n A) (Spec.arrJacobiRun n A sweeps) := by + unfold Spec.arrJacobiRun + refine List.foldlRecOn _ _ ?_ ?_ + · refine ⟨?_, ?_⟩ + · show (toM n (Spec.arrId n))ᵀ * toM n (Spec.arrId n) = 1 + rw [toM_id]; simp + · show toM n A = toM n (Spec.arrId n) * toM n A * (toM n (Spec.arrId n))ᵀ + rw [toM_id]; simp + · intro b hb _ _ + exact jacInv_sweep hb + +/-! ## Discharging the residual-certificate hypotheses for the real solver output -/ + +/-- View of the input tensor `A` as a `Matrix`. -/ +noncomputable def inputMat (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) : + Matrix (Fin n) (Fin n) ℝ := + Matrix.of (Spec.toMatFn A) + +/-- The eigenvector matrix `V` produced by the Jacobi run on `A` (columns are the eigenvectors). -/ +noncomputable def jacobiV (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) (sweeps : Nat) : + Matrix (Fin n) (Fin n) ℝ := + toM n (Spec.arrJacobiRun n (Spec.matToArr (Spec.toMatFn A)) sweeps).2 + +/-- The rotated matrix `Af = Vᵀ A V` produced by the Jacobi run (diagonal in the zero-residual +limit; its diagonal holds the eigenvalues). -/ +noncomputable def jacobiAf (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) (sweeps : Nat) : + Matrix (Fin n) (Fin n) ℝ := + toM n (Spec.arrJacobiRun n (Spec.matToArr (Spec.toMatFn A)) sweeps).1 + +/-- `toM` of the materialised input function is the input matrix. -/ +theorem toM_matToArr (X : Fin n → Fin n → ℝ) : toM n (Spec.matToArr X) = Matrix.of X := by + ext i j + simp only [toM_apply, Matrix.of_apply] + unfold Spec.matToArr + rw [arrGet_ofFn₂] + +/-- **Discharged premise 1 — orthogonality.** The eigenvector matrix the Jacobi solver returns is +orthogonal, with no convergence hypothesis. -/ +theorem jacobi_orthogonal (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) (sweeps : Nat) : + (jacobiV A sweeps)ᵀ * jacobiV A sweeps = 1 := + (jacInv_run (Spec.matToArr (Spec.toMatFn A)) sweeps).1 + +/-- **Discharged premise 2 — orthogonal similarity.** The input equals `V · Af · Vᵀ` exactly, with no +convergence hypothesis. -/ +theorem jacobi_similarity (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) (sweeps : Nat) : + inputMat A = jacobiV A sweeps * jacobiAf A sweeps * (jacobiV A sweeps)ᵀ := by + have h := (jacInv_run (n := n) (Spec.matToArr (Spec.toMatFn A)) sweeps).2 + rw [toM_matToArr] at h + exact h + +/-- **Unconditional residual identity.** Reconstructing with the diagonal of `Af` leaves exactly the +orthogonal conjugation of `Af`'s off-diagonal part — now stated about the real `symEigJacobiSpec` +output rather than under a hypothesis. -/ +theorem symEigJacobi_reconstruction_residual (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) + (sweeps : Nat) : + inputMat A + - jacobiV A sweeps * Matrix.diagonal (fun i => jacobiAf A sweeps i i) * (jacobiV A sweeps)ᵀ + = jacobiV A sweeps * offDiagonal (jacobiAf A sweeps) * (jacobiV A sweeps)ᵀ := + symEig_reconstruction_residual (jacobi_similarity A sweeps) + +/-- **Unconditional Frobenius residual certificate.** The squared reconstruction error equals the +squared off-diagonal mass of `Af` — unconditionally for the real solver output. -/ +theorem symEigJacobi_frobenius_residual (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) + (sweeps : Nat) : + ((inputMat A + - jacobiV A sweeps * Matrix.diagonal (fun i => jacobiAf A sweeps i i) + * (jacobiV A sweeps)ᵀ)ᵀ + * (inputMat A + - jacobiV A sweeps * Matrix.diagonal (fun i => jacobiAf A sweeps i i) + * (jacobiV A sweeps)ᵀ)).trace + = ((offDiagonal (jacobiAf A sweeps))ᵀ * offDiagonal (jacobiAf A sweeps)).trace := + symEig_frobenius_residual (jacobi_orthogonal A sweeps) (jacobi_similarity A sweeps) + +/-- **Unconditional correctness in the zero-residual limit.** When the rotated matrix is diagonal, +the solver output is an exact symmetric eigendecomposition of the input — no hypotheses beyond +diagonality. -/ +theorem symEigJacobi_isSymEig_of_diagonal (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) + (sweeps : Nat) + (hdiag : jacobiAf A sweeps = Matrix.diagonal (fun i => jacobiAf A sweeps i i)) : + IsSymEig (inputMat A) (fun i => jacobiAf A sweeps i i) (jacobiV A sweeps) := + isSymEig_of_diagonal (jacobi_orthogonal A sweeps) (jacobi_similarity A sweeps) hdiag + +/-- The eigenvector matrix read back from the public `symEigJacobiSpec` output is `jacobiV`, so the +theorems above are statements about the actual returned `V`. -/ +theorem symEigJacobiSpec_V_eq (A : Spec.Tensor ℝ (.dim n (.dim n .scalar))) (sweeps : Nat) : + Matrix.of (fun i j => Spec.get2 (Spec.symEigJacobiSpec A sweeps).2 i j) = jacobiV A sweeps := + rfl + +/-! ## Example: the residual certificate is now unconditional + +`symEig_frobenius_residual` and `isSymEig_of_diagonal` used to *take* `Vᵀ V = 1` and +`A = V · Af · Vᵀ` as hypotheses. For the real `symEigJacobiSpec` output those are now theorems +(`jacobi_orthogonal`, `jacobi_similarity`), so the certificate follows from the input and sweep +count alone — no premises to discharge at the call site. -/ + +/-- The Frobenius residual identity for a `3×3` Jacobi run with `8` sweeps, with no hypotheses. -/ +example (A : Spec.Tensor ℝ (.dim 3 (.dim 3 .scalar))) : + ((inputMat A + - jacobiV A 8 * Matrix.diagonal (fun i => jacobiAf A 8 i i) * (jacobiV A 8)ᵀ)ᵀ + * (inputMat A + - jacobiV A 8 * Matrix.diagonal (fun i => jacobiAf A 8 i i) * (jacobiV A 8)ᵀ)).trace + = ((offDiagonal (jacobiAf A 8))ᵀ * offDiagonal (jacobiAf A 8)).trace := + symEigJacobi_frobenius_residual A 8 + +/-- In the zero-residual limit the output is a genuine eigendecomposition; the only hypothesis is +diagonality of the rotated matrix — orthogonality and the orthogonal similarity come for free. -/ +example (A : Spec.Tensor ℝ (.dim 3 (.dim 3 .scalar))) + (h : jacobiAf A 8 = Matrix.diagonal (fun i => jacobiAf A 8 i i)) : + IsSymEig (inputMat A) (fun i => jacobiAf A 8 i i) (jacobiV A 8) := + symEigJacobi_isSymEig_of_diagonal A 8 h + +end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index d8e51b5..de1522d 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -128,6 +128,36 @@ zero-residual limit, `isSymEig_of_diagonal` shows the solver output `(diag Af, V `NN/Examples/Factorization` are concrete instances of this certificate: they bound the off-diagonal mass on specific matrices. +# Faithfulness of the Jacobi run: orthogonality and orthogonal similarity + +The three certificate theorems above are stated *conditionally* — they take the orthogonality +`Vᵀ V = 1` and the orthogonal-similarity identity `A = V · Af · Vᵀ` as hypotheses. Both are +*exact, finite, a-priori* facts about the executable `arrJacobiRun`, needing no convergence theory, +and +[`NN.Proofs.Tensor.Basic.FactorizationsJacobi`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsJacobi.lean) +proves them, discharging the hypotheses for the real solver output. + +The development bridges the strict `Array (Array ℝ)` representation the loop runs over to Mathlib +`Matrix` via `toM`, with `toM_matMul`/`toM_tr`/`toM_id` showing the array operations realise the +matrix ones. The single genuinely-new ingredient is `givens_orthogonal`: each rotation +`arrGivens n p q c s` with `c² + s² = 1` is an orthogonal matrix (`Jᵀ J = 1`), proved by reducing the +column dot products to the `c² + s² = 1` identity (`givens_normSq`) for the diagonal blocks and to +orthogonality of distinct standard basis vectors elsewhere. From it, the loop invariant +`JacInv A₀ (A, V) := Vᵀ V = 1 ∧ A₀ = V · A · Vᵀ` is preserved by one rotation (`jacInv_rotate` — the +no-op branch trivially, the rotating branch because conjugating by an orthogonal `J` cancels in +`J Jᵀ = 1`), hence by a whole sweep (`jacInv_sweep`, a `List.foldlRecOn` over `jacobiPairs`) and the +whole run (`jacInv_run`, starting from `(A, I)` where the invariant is immediate). + +Specialised to the `symEigJacobiSpec` output, this gives the two premises as theorems with no +hypotheses: `jacobi_orthogonal` (`Vᵀ V = 1`) and `jacobi_similarity` (`A = V · Af · Vᵀ`). +Feeding them into the certificate yields the *unconditional* restatements +`symEigJacobi_reconstruction_residual`, `symEigJacobi_frobenius_residual`, and +`symEigJacobi_isSymEig_of_diagonal`: the residual identity and the zero-residual-limit correctness now +hold for the actual returned `(Λ, V)` outright. So the returned `V` is a genuine orthogonal matrix and +`Af` a genuine orthogonal similarity of the input *regardless of how far the sweeps have converged* — +the only thing the residual certificate still defers to runtime is the *size* of the off-diagonal +mass, never the algebraic faithfulness of the decomposition. + # Exact QR reconstruction The QR factorization admits the same treatment. `qr_mul_eq` (in the same file) proves that for an @@ -166,8 +196,14 @@ into a future Mathlib matrix-level QR contribution. # What remains -With Cholesky and QR fully reconstructed (`A = L · Lᵀ`, `A = Q · R`, `Qᵀ Q = 1`), the only properties -not available as a-priori theorems are the *iterative* ones: full diagonalization for the cyclic Jacobi -eigensolver and the SVD built on it. Mathlib v4.30.0 has no Jacobi convergence theory, so those remain -captured by the exact a-posteriori residual certificate above, never by `sorry`. The specification-level -facts the kernel methods rely on are independent of that step, so the CHD foundation is complete. +With Cholesky and QR fully reconstructed (`A = L · Lᵀ`, `A = Q · R`, `Qᵀ Q = 1`), and the Jacobi run +now proved faithful — `V` orthogonal and `A = V · Af · Vᵀ` exactly, so the residual certificate holds +*unconditionally* for the real solver output — the single property still not available as an a-priori +theorem is the *rate*: that finitely many cyclic-Jacobi sweeps drive `Af`'s off-diagonal mass to zero. +That is the research-grade Forsythe–Henrici / Schönhage convergence result for cyclic (rather than +classical, largest-pivot) Jacobi, and Mathlib v4.30.0 has no Jacobi convergence theory, so it remains +captured by the exact a-posteriori residual certificate above — bounded numerically by the `assertLt` +checks on concrete inputs — never by `sorry`. Everything else is exact: the algebraic faithfulness of +the decomposition (orthogonality, orthogonal similarity, the residual identity, and correctness in the +zero-residual limit) is proved, and the specification-level facts the kernel methods rely on are +independent of the convergence step, so the CHD foundation is complete. From dad1aa54aedcd57a07b6bb6f3403bdf607a3c9b4 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 08:27:52 -0700 Subject: [PATCH 07/22] Prove per-rotation Jacobi off-diagonal decrease (Tier 2) + reviewer examples MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add the classical Jacobi progress identity as an exact, finite theorem over ℝ: conjugating a symmetric A by the Givens rotation that annihilates pivot (p,q) drops the squared off-diagonal mass by exactly 2·A[p,q]² (`jacobi_off_decrease`). New: NN/Proofs/Tensor/Basic/FactorizationsJacobiDecrease.lean - frobSq/diagSq/offSq mass machinery + frobSq_eq_diagSq_add_offSq. - frobSq_orthogonal_conj: orthogonal similarity preserves total Frobenius mass, so driving the off-diagonal down ≡ driving the diagonal up. - givens_conj_pp/qq/pq/other: explicit conjugation entries via bilinear support lemmas; the 2×2 block-Frobenius identity is frobSq_orthogonal_conj specialised to Fin 2 (no hand-tuned linear_combination coefficients). - jacobi_off_decrease: the per-rotation decrease, under symmetry + the pivot annihilation (the defining equation the Golub–Van Loan angle solves; the explicit pivot is givens_conj_pq). Examples: NN/Examples/Factorization/JacobiDecrease.lean (+ Common helpers) - Positive: one rotation takes off-diagonal mass 6 → 4 = 6 − 2·1², pivot annihilated to 0, total mass conserved at 35 — all to |Δ| = 0. - Negative controls: a wrong-angle (orthogonal) rotation misses the decrease; a non-orthogonal conjugation breaks Frobenius-mass invariance. Blueprint: new "Per-rotation progress" section; "What remains" narrowed to the aggregate cyclic rate (Forsythe–Henrici/Schönhage), since per-rotation progress is now exact. Sorry-free; builds green (NN.Proofs.Tensor.Basic, NN.Examples.Factorization, blueprint); repo_lint clean on all new/changed files. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 4 + NN/Examples/Factorization/Common.lean | 28 ++ NN/Examples/Factorization/JacobiDecrease.lean | 111 ++++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Basic/FactorizationsJacobiDecrease.lean | 316 ++++++++++++++++++ .../Ch4_Verification/Factorizations.lean | 54 ++- 6 files changed, 503 insertions(+), 11 deletions(-) create mode 100644 NN/Examples/Factorization/JacobiDecrease.lean create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsJacobiDecrease.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 5bd7a28..0dc3a6d 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -11,6 +11,7 @@ public import NN.Examples.Factorization.Cholesky public import NN.Examples.Factorization.QR public import NN.Examples.Factorization.SymEig public import NN.Examples.Factorization.SVD +public import NN.Examples.Factorization.JacobiDecrease /-! # Matrix factorization examples @@ -30,6 +31,9 @@ factorization misbehaves. verified numerically. - `SVD` — `A = U · diag(σ) · Vᵀ`, `Vᵀ V = I`; **negative control**: a permuted `σ` fails to reconstruct. +- `JacobiDecrease` — the per-rotation progress identity `‖offDiag(Jᵀ A J)‖² = ‖offDiag A‖² − 2·A[p,q]²` + (`jacobi_off_decrease`) and Frobenius-mass invariance; **negative controls**: a wrong-angle rotation + misses the decrease, a non-orthogonal one breaks mass invariance. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Common.lean b/NN/Examples/Factorization/Common.lean index a5d2074..94a6d4a 100644 --- a/NN/Examples/Factorization/Common.lean +++ b/NN/Examples/Factorization/Common.lean @@ -71,6 +71,34 @@ def offDiagFrobSq {n : Nat} (M : Spec.Tensor Float (.dim n (.dim n .scalar))) : (List.finRange n).foldl (fun a j => if i.val == j.val then a else let x := Spec.get2 M i j; a + x * x) acc) 0.0 +/-- Total squared Frobenius mass `Σ_{i,j} M_ij²` of a square matrix (off-diagonal + diagonal mass). -/ +def totalFrobSq {n : Nat} (M : Spec.Tensor Float (.dim n (.dim n .scalar))) : Float := + (List.finRange n).foldl (fun acc i => + (List.finRange n).foldl + (fun a j => let x := Spec.get2 M i j; a + x * x) acc) 0.0 + +/-- View a square `Float` matrix tensor as a strict array matrix (the representation the Jacobi +iteration runs over). -/ +def arrOfMat {n : Nat} (A : Spec.Tensor Float (.dim n (.dim n .scalar))) : Array (Array Float) := + Spec.matToArr (Spec.toMatFn A) + +/-- Read a strict array matrix back as a square `Float` matrix tensor. -/ +def matOfArr {n : Nat} (M : Array (Array Float)) : Spec.Tensor Float (.dim n (.dim n .scalar)) := + Spec.ofMatFn (fun i j => Spec.arrGet M i.val j.val) + +/-- Apply the **annihilating** Jacobi rotation at pivot `(p, q)`: returns `A' = Jᵀ A J` for the +Givens rotation whose angle zeroes `A'[p,q]` (the rotation the solver actually performs). -/ +def jacobiRotateAt {n : Nat} (A : Spec.Tensor Float (.dim n (.dim n .scalar))) (p q : Nat) : + Spec.Tensor Float (.dim n (.dim n .scalar)) := + matOfArr (Spec.arrJacobiRotate n (arrOfMat A) (Spec.arrId n) p q).1 + +/-- Apply an **arbitrary** Givens conjugation `A' = Jᵀ A J` with caller-chosen `(c, s)` at `(p, q)` +(not necessarily the annihilating angle, nor even orthogonal). Used for negative controls. -/ +def givensConjAt {n : Nat} (A : Spec.Tensor Float (.dim n (.dim n .scalar))) (p q : Nat) + (c s : Float) : Spec.Tensor Float (.dim n (.dim n .scalar)) := + let J := Spec.arrGivens n p q c s + matOfArr (Spec.arrMatMul n (Spec.arrTr n J) (Spec.arrMatMul n (arrOfMat A) J)) + /-- Shared tolerance for reconstruction-error assertions. -/ def tol : Float := 1e-6 diff --git a/NN/Examples/Factorization/JacobiDecrease.lean b/NN/Examples/Factorization/JacobiDecrease.lean new file mode 100644 index 0000000..763aa18 --- /dev/null +++ b/NN/Examples/Factorization/JacobiDecrease.lean @@ -0,0 +1,111 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the cyclic Jacobi sweep makes progress (per-rotation off-diagonal decrease) + +These checks corroborate the **Tier 2** development in +`NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease`: the exact per-rotation identity behind Jacobi +convergence. For a symmetric `A`, conjugating by the Givens rotation that annihilates the pivot +`(p, q)` removes exactly `2 · A[p,q]²` of squared off-diagonal mass: + +`‖offDiag(Jᵀ A J)‖² = ‖offDiag A‖² − 2 · A[p,q]²` (`jacobi_off_decrease`) + +while preserving the total Frobenius mass `‖A‖²` (`frobSq_orthogonal_conj`). + +The checks exhibit both halves of the theorem, *and* its hypotheses biting (negative controls): + +* **Positive — exact decrease.** One annihilating rotation drops the off-diagonal mass by precisely + `2 · A[p,q]²` (independent computations of the two sides agree). +* **Positive — pivot annihilated.** The rotated `A'[p,q]` is `≈ 0` (the defining property of the + angle; this is the `hannih` hypothesis holding on the concrete matrix). +* **Positive — Frobenius mass preserved.** `‖A'‖² = ‖A‖²`: the orthogonal similarity moves mass from + the off-diagonal *onto the diagonal* without creating or destroying any. +* **Negative — the angle matters.** A *wrong-angle* (but still orthogonal) Givens rotation fails to + achieve the `2 · A[p,q]²` decrease: the annihilation hypothesis `hannih` is genuinely needed. +* **Negative — orthogonality matters.** A *non-orthogonal* conjugation (`c² + s² ≠ 1`) does **not** + preserve `‖A‖²`, so it is not a similarity and the whole argument collapses without + `givens_orthogonal`. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.JacobiDecrease + +/-- A symmetric `3×3` test matrix; the `(0,1)` pivot is `A[0,1] = 1`, so the predicted off-diagonal +drop from annihilating it is `2 · 1² = 2`. -/ +def A : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[2, 1, 1], + [1, 3, 1], + [1, 1, 4]] + +/-- The pivot we annihilate. -/ +def p : Nat := 0 +def q : Nat := 1 + +/-- `A' = Jᵀ A J` after the annihilating rotation at `(0,1)`. -/ +def A' : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := jacobiRotateAt A p q + +/-- Off-diagonal mass before and after the rotation. -/ +def offBefore : Float := offDiagFrobSq A +def offAfter : Float := offDiagFrobSq A' + +/-- The squared pivot `A[0,1]²` and the predicted post-rotation off-diagonal mass. -/ +def pivotSq : Float := let x := Spec.get2 A ⟨p, by decide⟩ ⟨q, by decide⟩; x * x +def offPredicted : Float := offBefore - 2 * pivotSq + +#eval IO.println s!"off-diagonal mass: before = {offBefore}, after = {offAfter}, predicted = {offPredicted}" +#eval IO.println s!"pivot A[0,1] = {Spec.get2 A ⟨p, by decide⟩ ⟨q, by decide⟩}, rotated A'[0,1] = {Spec.get2 A' ⟨p, by decide⟩ ⟨q, by decide⟩}" + +-- Positive — the exact per-rotation decrease `‖offDiag A'‖² = ‖offDiag A‖² − 2·A[p,q]²` +-- (`jacobi_off_decrease`). The two sides are computed independently and shown to agree. +#eval assertApproxEq "Jacobi(1 rot) off-diagonal decrease = 2·A[p,q]²" offAfter offPredicted + +-- Positive — the pivot really is annihilated (the `hannih` hypothesis holds here). +#eval assertLt "Jacobi rotation annihilates the pivot A'[p,q] ≈ 0" + (Float.abs (Spec.get2 A' ⟨p, by decide⟩ ⟨q, by decide⟩)) + +-- Positive — total Frobenius mass is preserved (`frobSq_orthogonal_conj`): the orthogonal similarity +-- shifts mass from the off-diagonal onto the diagonal but conserves the total. +#eval assertApproxEq "Jacobi rotation preserves total Frobenius mass ‖A'‖² = ‖A‖²" + (totalFrobSq A') (totalFrobSq A) + +/-! ## Negative control 1: the rotation angle matters + +A wrong-angle (but orthogonal, `c² + s² = 1`) Givens rotation does not annihilate the pivot, so the +exact `2 · A[p,q]²` decrease fails. This is the numerical teeth of the `hannih` hypothesis. -/ + +/-- A fixed orthogonal rotation with the *wrong* angle (`c = 0.6, s = 0.8`, so `c² + s² = 1`). -/ +def Awrong : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := givensConjAt A p q 0.6 0.8 +def offWrong : Float := offDiagFrobSq Awrong + +#eval IO.println s!"wrong-angle off-diagonal mass = {offWrong} (predicted-if-annihilating = {offPredicted})" + +-- The wrong angle misses the predicted decrease by a wide margin. +#eval assertGe "wrong-angle rotation fails the 2·A[p,q]² decrease (annihilation hypothesis needed)" + (Float.abs (offWrong - offPredicted)) 0.5 + +/-! ## Negative control 2: orthogonality matters + +A non-orthogonal conjugation (`c² + s² ≠ 1`) is not a similarity, so it does **not** preserve the +total Frobenius mass — `frobSq_orthogonal_conj` genuinely needs `givens_orthogonal`. -/ + +/-- A non-orthogonal "rotation" (`c = 0.6, s = 0.6`, so `c² + s² = 0.72 ≠ 1`). -/ +def Askew : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := givensConjAt A p q 0.6 0.6 + +#eval IO.println s!"non-orthogonal conj total mass = {totalFrobSq Askew} (original = {totalFrobSq A})" + +-- A non-orthogonal conjugation changes the total Frobenius mass. +#eval assertGe "non-orthogonal conjugation breaks Frobenius-mass invariance (orthogonality needed)" + (Float.abs (totalFrobSq Askew - totalFrobSq A)) 0.5 + +end NN.Examples.Factorization.JacobiDecrease diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index 33ec81a..f2e8ffa 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -13,6 +13,7 @@ public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi +public import NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease public import NN.Proofs.Tensor.Basic.BoundsNorms public import NN.Proofs.Tensor.Basic.Algebra diff --git a/NN/Proofs/Tensor/Basic/FactorizationsJacobiDecrease.lean b/NN/Proofs/Tensor/Basic/FactorizationsJacobiDecrease.lean new file mode 100644 index 0000000..9c86e70 --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsJacobiDecrease.lean @@ -0,0 +1,316 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.FactorizationsJacobi + +/-! +# The cyclic Jacobi sweep makes progress (per-rotation off-diagonal decrease) + +[`FactorizationsJacobi`](./FactorizationsJacobi.lean) made the residual certificate *unconditional*: +the solver output always satisfies orthogonality and the orthogonal-similarity `A = V · Af · Vᵀ`, so +the reconstruction error equals the off-diagonal mass `‖offDiag(Af)‖²` of the rotated matrix. What +that certificate does **not** say is that the off-diagonal mass actually *goes down*. This file +proves the classical Jacobi progress identity, which is exactly that statement at the level of a +single rotation: + +> If a symmetric `A` is conjugated by the Givens rotation that annihilates the pivot `(p, q)`, the +> squared off-diagonal mass drops by exactly `2 · A[p,q]²`. + +The two ingredients, both *exact* over `ℝ`: + +* `frobSq_orthogonal_conj` — orthogonal similarity preserves the total Frobenius mass + `‖A‖² = trace(Aᵀ A)`. Combined with `frobSq_eq_diagSq_add_offSq` (`‖A‖² = diag-mass + off-mass`), + driving the off-diagonal mass down is the *same thing* as driving the diagonal mass up. +* `givens_conj_*` — the explicit entries of `Jᵀ A J` in the rotation plane. A Givens conjugation only + touches rows/columns `p, q`, so the diagonal mass changes by `A'[p,p]² + A'[q,q]² − A[p,p]² − + A[q,q]²`, and the `2×2` block algebra (with `c² + s² = 1` and the annihilation `A'[p,q] = 0`) turns + that into `2 · A[p,q]²`. + +The pivot-annihilation is taken as a hypothesis (`hannih`): it is the defining property of the +rotation angle. `givens_conj_pq` gives the explicit value of that pivot entry after the rotation, so +`hannih` is the concrete equation `c·s·A[p,p] + c²·A[p,q] − s²·A[q,p] − c·s·A[q,q] = 0` that the +Golub–Van Loan parameters the code uses are chosen to solve. Scope, as elsewhere in this development: +this is the per-rotation decrease, the exact finite fact behind convergence; the *rate* over a whole +sweep (and hence the number of sweeps needed) remains the research-grade piece Mathlib has no theory +for. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Matrix +open scoped BigOperators + +variable {n : Nat} + +/-! ## Frobenius mass: total, diagonal, off-diagonal -/ + +/-- Total squared Frobenius mass `‖M‖² = trace(Mᵀ M) = ∑ᵢⱼ M[i,j]²`. -/ +def frobSq (M : Matrix (Fin n) (Fin n) ℝ) : ℝ := (Mᵀ * M).trace + +/-- Squared diagonal mass `∑ᵢ M[i,i]²`. -/ +def diagSq (M : Matrix (Fin n) (Fin n) ℝ) : ℝ := ∑ i, (M i i) ^ 2 + +/-- Squared off-diagonal mass `‖offDiag M‖² = trace((offDiag M)ᵀ (offDiag M))`. This is the quantity +the residual certificate equates with the reconstruction error. -/ +def offSq (M : Matrix (Fin n) (Fin n) ℝ) : ℝ := + ((offDiagonal M)ᵀ * offDiagonal M).trace + +/-- `‖M‖²` as the sum of all squared entries. -/ +theorem frobSq_eq_sum (M : Matrix (Fin n) (Fin n) ℝ) : + frobSq M = ∑ i, ∑ j, (M i j) ^ 2 := by + unfold frobSq + rw [Matrix.trace] + simp only [Matrix.diag_apply, Matrix.mul_apply, Matrix.transpose_apply] + rw [Finset.sum_comm] + exact Finset.sum_congr rfl (fun i _ => Finset.sum_congr rfl (fun j _ => by ring)) + +/-- The off-diagonal part has entries `M[i,j]` off the diagonal and `0` on it. -/ +theorem offDiagonal_apply (M : Matrix (Fin n) (Fin n) ℝ) (i j : Fin n) : + offDiagonal M i j = if i = j then 0 else M i j := by + unfold offDiagonal + rw [Matrix.sub_apply] + by_cases h : i = j + · subst h; simp + · rw [Matrix.diagonal_apply_ne _ h, sub_zero, if_neg h] + +/-- `‖offDiag M‖²` as the sum of squared off-diagonal entries. -/ +theorem offSq_eq_sum (M : Matrix (Fin n) (Fin n) ℝ) : + offSq M = ∑ i, ∑ j, if i = j then 0 else (M i j) ^ 2 := by + unfold offSq + rw [Matrix.trace] + simp only [Matrix.diag_apply, Matrix.mul_apply, Matrix.transpose_apply] + rw [Finset.sum_comm] + refine Finset.sum_congr rfl (fun i _ => Finset.sum_congr rfl (fun j _ => ?_)) + rw [offDiagonal_apply] + by_cases h : i = j + · subst h; simp + · simp only [if_neg h]; ring + +/-- **The Frobenius mass splits as diagonal mass plus off-diagonal mass.** -/ +theorem frobSq_eq_diagSq_add_offSq (M : Matrix (Fin n) (Fin n) ℝ) : + frobSq M = diagSq M + offSq M := by + rw [frobSq_eq_sum, offSq_eq_sum, diagSq, ← Finset.sum_add_distrib] + refine Finset.sum_congr rfl (fun i _ => ?_) + have hsplit : ∀ j : Fin n, + (M i j) ^ 2 = (if i = j then (M i j) ^ 2 else 0) + (if i = j then 0 else (M i j) ^ 2) := by + intro j; by_cases h : i = j <;> simp [h] + rw [Finset.sum_congr rfl (fun j _ => hsplit j), Finset.sum_add_distrib, Finset.sum_ite_eq] + simp + +/-- **Orthogonal similarity preserves the total Frobenius mass.** Every Jacobi step is such a +similarity, so `‖A‖²` is an exact invariant of the whole run. -/ +theorem frobSq_orthogonal_conj {J M : Matrix (Fin n) (Fin n) ℝ} (hJ : Jᵀ * J = 1) : + frobSq (Jᵀ * M * J) = frobSq M := by + have hJJ : J * Jᵀ = 1 := mul_eq_one_comm.mp hJ + unfold frobSq + have hprod : ((Jᵀ * M * J)ᵀ * (Jᵀ * M * J)) = Jᵀ * (Mᵀ * M) * J := by + rw [Matrix.transpose_mul, Matrix.transpose_mul, Matrix.transpose_transpose] + simp only [Matrix.mul_assoc] + rw [← Matrix.mul_assoc J Jᵀ (M * J), hJJ, Matrix.one_mul] + rw [hprod, Matrix.trace_mul_comm, ← Matrix.mul_assoc, hJJ, Matrix.one_mul] + +/-! ## The `2×2` block algebra + +The rotation only mixes rows/columns `p` and `q`, so all the analysis happens in a `2×2` block. The +key fact is that an orthogonal `2×2` conjugation preserves the block's Frobenius mass; we obtain it by +specialising `frobSq_orthogonal_conj` to `Fin 2`. -/ + +/-- **`2×2` block Frobenius preservation.** Conjugating the block `!![a, b; b', d]` by the orthogonal +rotation `!![c, s; -s, c]` (with `c² + s² = 1`) preserves the sum of squared entries. Proved by +specialising `frobSq_orthogonal_conj` to `Fin 2`. -/ +private theorem block_frob (a b b' d c s : ℝ) (hcs : c ^ 2 + s ^ 2 = 1) : + (c ^ 2 * a - c * s * (b + b') + s ^ 2 * d) ^ 2 + (s ^ 2 * a + c * s * (b + b') + c ^ 2 * d) ^ 2 + + (c * s * a + c ^ 2 * b - s ^ 2 * b' - c * s * d) ^ 2 + + (c * s * a + c ^ 2 * b' - s ^ 2 * b - c * s * d) ^ 2 + = a ^ 2 + b ^ 2 + b' ^ 2 + d ^ 2 := by + have hR : (!![c, s; -s, c] : Matrix (Fin 2) (Fin 2) ℝ)ᵀ * !![c, s; -s, c] = 1 := by + ext i j; fin_cases i <;> fin_cases j <;> + simp [Matrix.mul_apply, Fin.sum_univ_two] <;> nlinarith [hcs] + have hfrob := frobSq_orthogonal_conj (M := (!![a, b; b', d] : Matrix (Fin 2) (Fin 2) ℝ)) hR + rw [frobSq_eq_sum, frobSq_eq_sum] at hfrob + simp [Fin.sum_univ_two, Matrix.mul_apply, Matrix.transpose_apply] at hfrob + linear_combination hfrob + +/-- **The diagonal-mass increase.** Under the rotation parameters (`c² + s² = 1`), symmetry of the +pivot (`b' = b`), and the annihilation equation `c·s·(a − d) + (c² − s²)·b = 0`, the two rotated +diagonal squares exceed the originals by exactly `2 b²`. -/ +private theorem block_diag_algebra (a b d c s : ℝ) (hcs : c ^ 2 + s ^ 2 = 1) + (hann : c * s * (a - d) + (c ^ 2 - s ^ 2) * b = 0) : + (c ^ 2 * a - 2 * c * s * b + s ^ 2 * d) ^ 2 + (s ^ 2 * a + 2 * c * s * b + c ^ 2 * d) ^ 2 + - a ^ 2 - d ^ 2 = 2 * b ^ 2 := by + have hbf := block_frob a b b d c s hcs + have h0 : c * s * a + c ^ 2 * b - s ^ 2 * b - c * s * d = 0 := by linear_combination hann + linear_combination hbf - 2 * (c * s * a + c ^ 2 * b - s ^ 2 * b - c * s * d) * h0 + +/-! ## Sum helpers -/ + +/-- Sum of a function `f` against an indicator supported on the pair `{p', q'}` (with `p' ≠ q'`). -/ +private theorem sum_pair (p' q' : Fin n) (hpq : p' ≠ q') (vp vq : ℝ) (f : Fin n → ℝ) : + ∑ l, (if l = p' then vp else if l = q' then vq else 0) * f l = vp * f p' + vq * f q' := by + have hterm : ∀ l : Fin n, + (if l = p' then vp else if l = q' then vq else 0) * f l + = (if l = p' then vp * f l else 0) + (if l = q' then vq * f l else 0) := by + intro l + by_cases hlp : l = p' + · subst hlp; simp [hpq] + · by_cases hlq : l = q' + · subst hlq; simp [hlp] + · simp [hlp, hlq] + rw [Finset.sum_congr rfl (fun l _ => hterm l), Finset.sum_add_distrib, + Finset.sum_ite_eq', Finset.sum_ite_eq'] + simp + +/-- A fintype sum of a function supported on the pair `{p', q'}` collapses to the two values. -/ +private theorem sum_eq_pair (p' q' : Fin n) (hpq : p' ≠ q') (g : Fin n → ℝ) + (h0 : ∀ o, o ≠ p' → o ≠ q' → g o = 0) : ∑ o, g o = g p' + g q' := by + rw [← Finset.sum_pair hpq] + refine (Finset.sum_subset (Finset.subset_univ _) (fun o _ ho => ?_)).symm + simp only [Finset.mem_insert, Finset.mem_singleton, not_or] at ho + exact h0 o ho.1 ho.2 + +/-! ## Entries of `A · J` and of the conjugation `Jᵀ · A · J` + +`J = toM n (arrGivens n p q c s)` has columns supported on `{p, q}` (off `{p, q}` it is the identity), +so multiplying by it only combines columns `p`, `q`. -/ + +variable (A : Matrix (Fin n) (Fin n) ℝ) (p q : Nat) (hp : p < n) (hq : q < n) (hpq : p ≠ q) + (c s : ℝ) + +include hpq in +private theorem fin_pq_ne : (⟨p, hp⟩ : Fin n) ≠ ⟨q, hq⟩ := fun h => hpq (Fin.ext_iff.mp h) + +include hpq in +/-- Column `p` of `A · J`: `c · A[·,p] − s · A[·,q]`. -/ +theorem givens_AJ_p (k : Fin n) : + (A * toM n (Spec.arrGivens n p q c s)) k ⟨p, hp⟩ + = c * A k ⟨p, hp⟩ - s * A k ⟨q, hq⟩ := by + rw [Matrix.mul_apply, + Finset.sum_congr rfl (fun l _ => by rw [givens_col_fp p q hp hq hpq c s l, mul_comm]), + sum_pair ⟨p, hp⟩ ⟨q, hq⟩ (fin_pq_ne p q hp hq hpq) c (-s) (fun l => A k l)] + ring + +include hpq in +/-- Column `q` of `A · J`: `s · A[·,p] + c · A[·,q]`. -/ +theorem givens_AJ_q (k : Fin n) : + (A * toM n (Spec.arrGivens n p q c s)) k ⟨q, hq⟩ + = s * A k ⟨p, hp⟩ + c * A k ⟨q, hq⟩ := by + rw [Matrix.mul_apply, + Finset.sum_congr rfl (fun l _ => by rw [givens_col_fq p q hp hq hpq c s l, mul_comm]), + sum_pair ⟨p, hp⟩ ⟨q, hq⟩ (fin_pq_ne p q hp hq hpq) s c (fun l => A k l)] + +/-- Any other column `o ∉ {p, q}` of `A · J` is unchanged. -/ +theorem givens_AJ_other (o : Fin n) (hop : o.val ≠ p) (hoq : o.val ≠ q) (k : Fin n) : + (A * toM n (Spec.arrGivens n p q c s)) k o = A k o := by + rw [Matrix.mul_apply, + Finset.sum_congr rfl (fun l _ => by rw [givens_col_other p q c s o l hop hoq])] + simp + +include hpq in +/-- The `(p, p)` entry of the conjugation `Jᵀ · A · J`. -/ +theorem givens_conj_pp : + ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) ⟨p, hp⟩ ⟨p, hp⟩ + = c ^ 2 * A ⟨p, hp⟩ ⟨p, hp⟩ - c * s * (A ⟨p, hp⟩ ⟨q, hq⟩ + A ⟨q, hq⟩ ⟨p, hp⟩) + + s ^ 2 * A ⟨q, hq⟩ ⟨q, hq⟩ := by + rw [Matrix.mul_assoc, Matrix.mul_apply, + Finset.sum_congr rfl (fun k _ => by + rw [Matrix.transpose_apply, givens_col_fp p q hp hq hpq c s k, + givens_AJ_p A p q hp hq hpq c s k]), + sum_pair ⟨p, hp⟩ ⟨q, hq⟩ (fin_pq_ne p q hp hq hpq) c (-s) + (fun k => c * A k ⟨p, hp⟩ - s * A k ⟨q, hq⟩)] + ring + +include hpq in +/-- The `(q, q)` entry of the conjugation `Jᵀ · A · J`. -/ +theorem givens_conj_qq : + ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) ⟨q, hq⟩ ⟨q, hq⟩ + = s ^ 2 * A ⟨p, hp⟩ ⟨p, hp⟩ + c * s * (A ⟨p, hp⟩ ⟨q, hq⟩ + A ⟨q, hq⟩ ⟨p, hp⟩) + + c ^ 2 * A ⟨q, hq⟩ ⟨q, hq⟩ := by + rw [Matrix.mul_assoc, Matrix.mul_apply, + Finset.sum_congr rfl (fun k _ => by + rw [Matrix.transpose_apply, givens_col_fq p q hp hq hpq c s k, + givens_AJ_q A p q hp hq hpq c s k]), + sum_pair ⟨p, hp⟩ ⟨q, hq⟩ (fin_pq_ne p q hp hq hpq) s c + (fun k => s * A k ⟨p, hp⟩ + c * A k ⟨q, hq⟩)] + ring + +include hpq in +/-- The `(p, q)` entry of the conjugation `Jᵀ · A · J` — the entry the rotation is chosen to +annihilate. -/ +theorem givens_conj_pq : + ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) ⟨p, hp⟩ ⟨q, hq⟩ + = c * s * A ⟨p, hp⟩ ⟨p, hp⟩ + c ^ 2 * A ⟨p, hp⟩ ⟨q, hq⟩ - s ^ 2 * A ⟨q, hq⟩ ⟨p, hp⟩ + - c * s * A ⟨q, hq⟩ ⟨q, hq⟩ := by + rw [Matrix.mul_assoc, Matrix.mul_apply, + Finset.sum_congr rfl (fun k _ => by + rw [Matrix.transpose_apply, givens_col_fp p q hp hq hpq c s k, + givens_AJ_q A p q hp hq hpq c s k]), + sum_pair ⟨p, hp⟩ ⟨q, hq⟩ (fin_pq_ne p q hp hq hpq) c (-s) + (fun k => s * A k ⟨p, hp⟩ + c * A k ⟨q, hq⟩)] + ring + +/-- Any other diagonal entry `(o, o)` with `o ∉ {p, q}` is unchanged by the conjugation. -/ +theorem givens_conj_other (o : Fin n) (hop : o.val ≠ p) (hoq : o.val ≠ q) : + ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) o o = A o o := by + rw [Matrix.mul_assoc, Matrix.mul_apply, + Finset.sum_congr rfl (fun k _ => by + rw [Matrix.transpose_apply, givens_col_other p q c s o k hop hoq, + givens_AJ_other A p q c s o hop hoq k])] + simp + +/-! ## The per-rotation off-diagonal decrease -/ + +include hpq in +/-- **Per-rotation Jacobi progress.** For a *symmetric pivot* (`A[q,p] = A[p,q]`) and the Givens +rotation that *annihilates* it (`(Jᵀ A J)[p,q] = 0`), conjugating `A` by `J` decreases the squared +off-diagonal mass by exactly `2 · A[p,q]²`: + +`‖offDiag(Jᵀ A J)‖² = ‖offDiag A‖² − 2 · A[p,q]²`. + +This is the exact finite identity behind Jacobi convergence: each rotation removes `2 · A[p,q]²` of +off-diagonal mass. The *rate* over a sweep (how the pivots are chosen and how fast the total mass +falls) is the research-grade part Mathlib has no theory for. -/ +theorem jacobi_off_decrease (hcs : c ^ 2 + s ^ 2 = 1) + (hsym : A ⟨q, hq⟩ ⟨p, hp⟩ = A ⟨p, hp⟩ ⟨q, hq⟩) + (hannih : ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) + ⟨p, hp⟩ ⟨q, hq⟩ = 0) : + offSq ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) + = offSq A - 2 * (A ⟨p, hp⟩ ⟨q, hq⟩) ^ 2 := by + have hpq' : (⟨p, hp⟩ : Fin n) ≠ ⟨q, hq⟩ := fin_pq_ne p q hp hq hpq + set G := toM n (Spec.arrGivens n p q c s) with hGdef + have hJ : Gᵀ * G = 1 := givens_orthogonal p q hp hq hpq c s hcs + have hfrob : frobSq (Gᵀ * A * G) = frobSq A := frobSq_orthogonal_conj hJ + have hsumP := frobSq_eq_diagSq_add_offSq (Gᵀ * A * G) + have hsumA := frobSq_eq_diagSq_add_offSq A + -- The annihilation equation in explicit form (using symmetry). + have hann : c * s * (A ⟨p, hp⟩ ⟨p, hp⟩ - A ⟨q, hq⟩ ⟨q, hq⟩) + + (c ^ 2 - s ^ 2) * A ⟨p, hp⟩ ⟨q, hq⟩ = 0 := by + have hpq0 := givens_conj_pq A p q hp hq hpq c s + rw [hGdef] at hannih + rw [hannih] at hpq0 + rw [hsym] at hpq0 + linear_combination -hpq0 + -- The diagonal mass increases by exactly `2 A[p,q]²`. + have hdiag : diagSq (Gᵀ * A * G) - diagSq A = 2 * (A ⟨p, hp⟩ ⟨q, hq⟩) ^ 2 := by + unfold diagSq + rw [← Finset.sum_sub_distrib, + sum_eq_pair ⟨p, hp⟩ ⟨q, hq⟩ hpq' + (fun o => (Gᵀ * A * G) o o ^ 2 - A o o ^ 2) ?_] + · simp only [hGdef, givens_conj_pp A p q hp hq hpq c s, givens_conj_qq A p q hp hq hpq c s, + hsym] + have hba := block_diag_algebra (A ⟨p, hp⟩ ⟨p, hp⟩) (A ⟨p, hp⟩ ⟨q, hq⟩) (A ⟨q, hq⟩ ⟨q, hq⟩) c s + hcs hann + linear_combination hba + · intro o hop' hoq' + have hop : o.val ≠ p := fun h => hop' (Fin.ext h) + have hoq : o.val ≠ q := fun h => hoq' (Fin.ext h) + simp only [hGdef, givens_conj_other A p q c s o hop hoq, sub_self] + linarith [hfrob, hsumP, hsumA, hdiag] + +end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index de1522d..4003d8f 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -158,6 +158,34 @@ hold for the actual returned `(Λ, V)` outright. So the returned `V` is a genuin the only thing the residual certificate still defers to runtime is the *size* of the off-diagonal mass, never the algebraic faithfulness of the decomposition. +# Per-rotation progress: the off-diagonal mass decreases + +Faithfulness says the residual *equals* the off-diagonal mass of `Af`; it does not say that mass ever +goes *down*. The classical Jacobi progress identity, proved in +[`NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsJacobiDecrease.lean), +is exactly that statement at the level of a single rotation. For a symmetric `A`, conjugating by the +Givens rotation that *annihilates* the pivot `(p, q)` decreases the squared off-diagonal mass by +exactly `2 · A[p,q]²`: + +$$`\bigl\|\operatorname{offDiag}(J^\top A J)\bigr\|_F^2 = \bigl\|\operatorname{offDiag} A\bigr\|_F^2 - 2\,A[p,q]^2.` + +This is `jacobi_off_decrease`, and it rests on two exact facts. First, *orthogonal similarity +preserves the total Frobenius mass* (`frobSq_orthogonal_conj`): `‖Jᵀ A J‖² = ‖A‖²`, since +`trace((Jᵀ A J)ᵀ (Jᵀ A J)) = trace(Aᵀ A)` after the `J Jᵀ = 1` cancellation. Splitting that total as +diagonal-plus-off-diagonal mass (`frobSq_eq_diagSq_add_offSq`) shows that driving the off-diagonal +down is *the same thing* as driving the diagonal up. Second, the rotation only mixes rows and columns +`p, q`, so the diagonal mass changes by `A'[p,p]² + A'[q,q]² − A[p,p]² − A[q,q]²`; the explicit +conjugation entries (`givens_conj_pp`, `givens_conj_qq`, `givens_conj_pq`, computed from the Givens +columns via the support lemmas) plus the `2×2` block-Frobenius identity — itself just +`frobSq_orthogonal_conj` specialised to `Fin 2` — turn that, under `c² + s² = 1` and the annihilation +`A'[p,q] = 0`, into precisely `2 · A[p,q]²`. The annihilation is the defining equation the +Golub–Van Loan rotation angle solves, and `givens_conj_pq` exhibits the pivot entry whose vanishing +it is. The executable witnesses in +[`NN.Examples.Factorization.JacobiDecrease`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Examples/Factorization/JacobiDecrease.lean) +confirm the identity numerically (one rotation takes the off-diagonal mass `6 → 4 = 6 − 2·1²` with +total mass conserved at `35`) and show its hypotheses biting: a wrong-angle rotation misses the +decrease, a non-orthogonal one breaks mass invariance. + # Exact QR reconstruction The QR factorization admits the same treatment. `qr_mul_eq` (in the same file) proves that for an @@ -196,14 +224,18 @@ into a future Mathlib matrix-level QR contribution. # What remains -With Cholesky and QR fully reconstructed (`A = L · Lᵀ`, `A = Q · R`, `Qᵀ Q = 1`), and the Jacobi run -now proved faithful — `V` orthogonal and `A = V · Af · Vᵀ` exactly, so the residual certificate holds -*unconditionally* for the real solver output — the single property still not available as an a-priori -theorem is the *rate*: that finitely many cyclic-Jacobi sweeps drive `Af`'s off-diagonal mass to zero. -That is the research-grade Forsythe–Henrici / Schönhage convergence result for cyclic (rather than -classical, largest-pivot) Jacobi, and Mathlib v4.30.0 has no Jacobi convergence theory, so it remains -captured by the exact a-posteriori residual certificate above — bounded numerically by the `assertLt` -checks on concrete inputs — never by `sorry`. Everything else is exact: the algebraic faithfulness of -the decomposition (orthogonality, orthogonal similarity, the residual identity, and correctness in the -zero-residual limit) is proved, and the specification-level facts the kernel methods rely on are -independent of the convergence step, so the CHD foundation is complete. +With Cholesky and QR fully reconstructed (`A = L · Lᵀ`, `A = Q · R`, `Qᵀ Q = 1`), the Jacobi run +proved faithful — `V` orthogonal and `A = V · Af · Vᵀ` exactly, so the residual certificate holds +*unconditionally* for the real solver output — and the *per-rotation* progress proved exactly (each +annihilating rotation removes `2 · A[p,q]²` of off-diagonal mass), the single property still not +available as an a-priori theorem is the *aggregate rate*: that a full *cyclic* sweep, choosing its +pivots in fixed row-major order rather than always the largest, drives the off-diagonal mass to zero +fast enough that finitely many sweeps suffice. Summing the per-rotation decrease over a sweep is exact; +what is research-grade is bounding the *sum of the pivots* below in terms of the total off-diagonal +mass when the pivots are visited cyclically — the Forsythe–Henrici / Schönhage convergence result. +Mathlib v4.30.0 has no Jacobi convergence theory, so that aggregate rate remains captured by the exact +a-posteriori residual certificate above — bounded numerically by the `assertLt` checks on concrete +inputs — never by `sorry`. Everything else is exact: the algebraic faithfulness of the decomposition +(orthogonality, orthogonal similarity, the residual identity, the per-rotation decrease, and +correctness in the zero-residual limit) is proved, and the specification-level facts the kernel methods +rely on are independent of the convergence step, so the CHD foundation is complete. From fd1b505d79b37a67452667da08bbb705c275421d Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 09:10:45 -0700 Subject: [PATCH 08/22] Prove classical Jacobi linear convergence rate (Tier 3) + reviewer examples MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add the aggregate-rate development for the largest-pivot Jacobi strategy, building on the per-rotation decrease (Tier 2): - offSq_le_count_mul_max: the largest off-diagonal pivot carries at least the average share of the mass, ‖offDiag A‖² ≤ (n²−n)·A[p,q]². - jacobi_off_decrease_classical: substituting that into the exact per-rotation decrease yields a genuine linear contraction by 1 − 2/(n²−n) < 1. - geom_bound_of_contraction / tendsto_zero_of_contraction: a fixed-factor contraction iterates to ρᵏ·a₀ and (with offSq_nonneg, factor < 1) tends to 0, so the classical algorithm provably converges geometrically. Stated for an arbitrary per-step factor, so a future cyclic per-sweep bound plugs in directly. Honest scope: the cyclic ordering the solver uses does not satisfy the largest-pivot hypothesis (the research-grade Forsythe–Henrici/Schönhage rate); that gap stays captured by the exact a-posteriori residual certificate, never by sorry. Reviewer examples (NN/Examples/Factorization/JacobiRate.lean): largest pivot meets the rate (mass 50.04 → 0.04, far under the guaranteed 33.36); negative control — a tiny non-largest pivot misses the guaranteed factor. Blueprint gains an "Aggregate rate" section and a precise restatement of what remains. sorry-free, warning-free; repo_lint shows no new violations. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 5 + NN/Examples/Factorization/JacobiRate.lean | 102 +++++++++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Basic/FactorizationsJacobiRate.lean | 163 ++++++++++++++++++ .../Ch4_Verification/Factorizations.lean | 58 +++++-- 5 files changed, 318 insertions(+), 11 deletions(-) create mode 100644 NN/Examples/Factorization/JacobiRate.lean create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsJacobiRate.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 0dc3a6d..10be535 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -12,6 +12,7 @@ public import NN.Examples.Factorization.QR public import NN.Examples.Factorization.SymEig public import NN.Examples.Factorization.SVD public import NN.Examples.Factorization.JacobiDecrease +public import NN.Examples.Factorization.JacobiRate /-! # Matrix factorization examples @@ -34,6 +35,10 @@ factorization misbehaves. - `JacobiDecrease` — the per-rotation progress identity `‖offDiag(Jᵀ A J)‖² = ‖offDiag A‖² − 2·A[p,q]²` (`jacobi_off_decrease`) and Frobenius-mass invariance; **negative controls**: a wrong-angle rotation misses the decrease, a non-orthogonal one breaks mass invariance. +- `JacobiRate` — the *aggregate* linear-contraction rate of the classical largest-pivot strategy: + `‖offDiag(Jᵀ A J)‖² ≤ (1 − 2/(n²−n))·‖offDiag A‖²` (`jacobi_off_decrease_classical`); **negative + control**: annihilating a non-largest (tiny) pivot misses the guaranteed factor, so the rate is + specific to the largest-pivot choice. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/JacobiRate.lean b/NN/Examples/Factorization/JacobiRate.lean new file mode 100644 index 0000000..60b51aa --- /dev/null +++ b/NN/Examples/Factorization/JacobiRate.lean @@ -0,0 +1,102 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the classical Jacobi sweep contracts at a fixed linear rate + +These checks corroborate the **Tier 3** development in +`NN.Proofs.Tensor.Basic.FactorizationsJacobiRate`: the *aggregate* convergence rate of the classical +(largest-pivot) Jacobi strategy. Annihilating the **largest** off-diagonal pivot multiplies the +off-diagonal mass by at most `1 − 2/(n² − n) < 1`: + +`‖offDiag(Jᵀ A J)‖² ≤ (1 − 2/(n² − n)) · ‖offDiag A‖²` (`jacobi_off_decrease_classical`) + +because the largest pivot carries at least the average share `‖offDiag A‖²/(n² − n)` of the mass +(`offSq_le_count_mul_max`). The test matrix has one dominant off-diagonal entry, so the contrast +between annihilating it and annihilating a tiny one is stark. + +The checks exhibit the theorem *and* its largest-pivot hypothesis biting (negative control): + +* **Positive — pivot carries ≥ average share.** `‖offDiag A‖² ≤ (n² − n) · A[p,q]²` for the largest + pivot (`offSq_le_count_mul_max` on the concrete matrix). +* **Positive — largest pivot meets the rate.** Annihilating the dominant entry `A[0,1]` contracts the + off-diagonal mass below `(1 − 2/(n² − n)) · ‖offDiag A‖²` (in fact far below — it nearly diagonalises). +* **Negative — a non-largest pivot misses the rate.** Annihilating a *tiny* off-diagonal entry + `A[0,2]` still removes `2·A[0,2]²` of mass (the per-rotation identity always holds), but that is far + too little to meet the guaranteed factor: the off-diagonal mass stays *above* `(1 − 2/(n²−n))·‖offDiag A‖²`. + This is exactly why the rate is for the *largest-pivot* strategy and the cyclic sweep needs the + research-grade Forsythe–Henrici bound instead. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.JacobiRate + +/-- A symmetric `3×3` matrix with one dominant off-diagonal entry `A[0,1] = 5` and two tiny ones +`A[0,2] = A[1,2] = 0.1`. -/ +def A : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[1, 5, 0.1], + [5, 2, 0.1], + [0.1, 0.1, 3]] + +/-- Off-diagonal count `n² − n` for `n = 3`, and the guaranteed contraction factor `1 − 2/(n²−n)`. -/ +def offCount : Float := 3 * 3 - 3 -- = 6 +def factor : Float := 1 - 2 / offCount -- = 2/3 + +def offBefore : Float := offDiagFrobSq A + +/-- The largest off-diagonal entry is `A[0,1] = 5`; its square is the per-rotation drop budget. -/ +def bigSq : Float := let x := Spec.get2 A ⟨0, by decide⟩ ⟨1, by decide⟩; x * x -- = 25 +/-- A tiny off-diagonal entry `A[0,2] = 0.1`. -/ +def smallSq : Float := let x := Spec.get2 A ⟨0, by decide⟩ ⟨2, by decide⟩; x * x -- = 0.01 + +/-- Annihilate the **largest** pivot `(0,1)` — the classical choice. -/ +def Abig : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := jacobiRotateAt A 0 1 +def offBig : Float := offDiagFrobSq Abig + +/-- Annihilate a **tiny** pivot `(0,2)` — a non-largest (e.g. cyclic-order) choice. -/ +def Asmall : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := jacobiRotateAt A 0 2 +def offSmall : Float := offDiagFrobSq Asmall + +#eval IO.println s!"off-diagonal mass = {offBefore}; average share = {offBefore / offCount}; \ + guaranteed post-rotation bound (1-2/(n²-n))·mass = {factor * offBefore}" +#eval IO.println s!"largest pivot² = {bigSq} (≥ average) → off-mass after = {offBig}" +#eval IO.println s!"tiny pivot² = {smallSq} (< average) → off-mass after = {offSmall}" + +-- Positive — the largest pivot carries at least the average share: `‖offDiag A‖² ≤ (n²−n)·A[p,q]²` +-- (`offSq_le_count_mul_max`). The "violation amount" is `0` when the bound holds. +#eval assertLt "largest pivot carries ≥ average share: ‖offDiag A‖² ≤ (n²−n)·A[0,1]²" + (max (0.0 : Float) (offBefore - offCount * bigSq)) + +-- Positive — annihilating the largest pivot meets the linear rate (`jacobi_off_decrease_classical`). +#eval assertLt "classical contraction: ‖offDiag A'‖² ≤ (1−2/(n²−n))·‖offDiag A‖²" + (max (0.0 : Float) (offBig - factor * offBefore)) + +-- Positive — the largest pivot really is annihilated. +#eval assertLt "largest-pivot rotation annihilates A'[0,1] ≈ 0" + (Float.abs (Spec.get2 Abig ⟨0, by decide⟩ ⟨1, by decide⟩)) + +/-! ## Negative control: the largest-pivot hypothesis is necessary + +Annihilating a *tiny* off-diagonal entry obeys the per-rotation identity (mass drops by `2·A[0,2]²`) +but removes far too little to meet the guaranteed factor — the off-diagonal mass stays above +`(1 − 2/(n²−n))·‖offDiag A‖²`. -/ + +-- The tiny pivot is below the average share, so the count bound does *not* certify the rate for it. +#eval IO.println s!"tiny pivot still annihilated: A'[0,2] = {Spec.get2 Asmall ⟨0, by decide⟩ ⟨2, by decide⟩}, \ + and mass did drop ({offBefore} → {offSmall}) — just not by enough" + +-- Negative — a non-largest pivot misses the guaranteed contraction by a wide margin. +#eval assertGe "non-largest pivot fails the (1−2/(n²−n)) rate (largest-pivot hypothesis needed)" + (offSmall - factor * offBefore) 0.5 + +end NN.Examples.Factorization.JacobiRate diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index f2e8ffa..33718c6 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -14,6 +14,7 @@ public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi public import NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease +public import NN.Proofs.Tensor.Basic.FactorizationsJacobiRate public import NN.Proofs.Tensor.Basic.BoundsNorms public import NN.Proofs.Tensor.Basic.Algebra diff --git a/NN/Proofs/Tensor/Basic/FactorizationsJacobiRate.lean b/NN/Proofs/Tensor/Basic/FactorizationsJacobiRate.lean new file mode 100644 index 0000000..70bc78c --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsJacobiRate.lean @@ -0,0 +1,163 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease + +/-! +# The aggregate Jacobi convergence rate (classical largest-pivot strategy) + +[`FactorizationsJacobiDecrease`](./FactorizationsJacobiDecrease.lean) proved the *per-rotation* +identity `‖offDiag(Jᵀ A J)‖² = ‖offDiag A‖² − 2 · A[p,q]²` exactly. That is a statement about one +rotation; it says nothing on its own about how fast the off-diagonal mass falls over many rotations — +the *aggregate rate*, and hence how many sweeps are needed. + +This file proves the aggregate rate **for the classical (largest-pivot) strategy**, which is the +elementary, a-priori-provable part of the convergence story: + +* `offSq_le_count_mul_max` — the off-diagonal mass is at most the off-diagonal count `n² − n` times + the largest squared off-diagonal entry. So if the pivot `(p, q)` is chosen to be the *largest* + off-diagonal entry, `A[p,q]² ≥ ‖offDiag A‖² / (n² − n)`. +* `jacobi_off_decrease_classical` — combining that lower bound on the pivot with the exact + per-rotation decrease gives a genuine **linear contraction**: one largest-pivot rotation multiplies + the off-diagonal mass by at most `1 − 2/(n² − n) < 1`. +* `geom_bound_of_contraction` / `tendsto_zero_of_contraction` — any quantity that contracts by a fixed + factor `ρ < 1` at every step is bounded by `ρ^k` and tends to `0`. Composed with the single-step + contraction (with `ρ = 1 − 2/(n² − n)` and `offSq_nonneg`), this is an a-priori proof that the + classical Jacobi eigenvalue algorithm drives the off-diagonal mass to zero geometrically. + +## Honest scope: classical vs. cyclic + +The executable solver runs the **cyclic** sweep (pivots visited in fixed row-major order), *not* the +classical largest-pivot rule. The per-step contraction above genuinely fails for a cyclic pivot: a +fixed-order pivot need not be the largest off-diagonal entry, so `2 · A[p,q]²` can fall well short of +`2 · ‖offDiag A‖² / (n² − n)` (and a later rotation in the same sweep can even refill an entry an +earlier one zeroed). Bounding the *sum of the cyclic pivots* below — the per-sweep contraction factor +— is the Forsythe–Henrici / Schönhage result, which Mathlib v4.30.0 has no theory for and which is not +provable by this elementary argument. The abstract `geom_bound_of_contraction` is stated for an +*arbitrary* per-step factor `ρ`, so the moment such a cyclic per-sweep bound is available it plugs in +directly; until then the cyclic rate stays captured by the exact a-posteriori residual certificate of +[`FactorizationsJacobi`](./FactorizationsJacobi.lean), never by `sorry`. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Matrix +open scoped BigOperators + +variable {n : Nat} + +/-! ## The off-diagonal mass is nonnegative and bounded by the count times the largest entry -/ + +/-- The squared off-diagonal mass is nonnegative (it is a sum of squares). -/ +theorem offSq_nonneg (M : Matrix (Fin n) (Fin n) ℝ) : 0 ≤ offSq M := by + rw [offSq_eq_sum] + refine Finset.sum_nonneg (fun i _ => Finset.sum_nonneg (fun j _ => ?_)) + by_cases h : i = j + · simp [h] + · simp only [if_neg h]; positivity + +/-- The constant off-diagonal sum: there are exactly `n² − n` off-diagonal positions, so summing a +constant `K` over them gives `(n² − n) · K`. -/ +private theorem sum_const_offdiag (K : ℝ) : + ∑ i : Fin n, ∑ j : Fin n, (if i = j then (0 : ℝ) else K) = ((n : ℝ) ^ 2 - (n : ℝ)) * K := by + have hinner : ∀ i : Fin n, + ∑ j : Fin n, (if i = j then (0 : ℝ) else K) = ((n : ℝ) - 1) * K := by + intro i + have hsplit : ∀ j : Fin n, + (if i = j then (0 : ℝ) else K) = K - (if i = j then K else 0) := by + intro j; by_cases h : i = j <;> simp [h] + rw [Finset.sum_congr rfl (fun j _ => hsplit j), Finset.sum_sub_distrib, + Finset.sum_const, Finset.sum_ite_eq] + simp only [Finset.mem_univ, if_true, Finset.card_univ, Fintype.card_fin, nsmul_eq_mul] + ring + rw [Finset.sum_congr rfl (fun i _ => hinner i), Finset.sum_const] + simp only [Finset.card_univ, Fintype.card_fin, nsmul_eq_mul] + ring + +/-- **The off-diagonal mass is at most the off-diagonal count `n² − n` times the largest squared +off-diagonal entry.** With `(p', q')` achieving that maximum, this says +`‖offDiag M‖² ≤ (n² − n) · M[p',q']²`, i.e. the largest pivot carries at least an average share of the +mass — the bound the classical Jacobi strategy exploits. -/ +theorem offSq_le_count_mul_max (M : Matrix (Fin n) (Fin n) ℝ) (p' q' : Fin n) + (hmax : ∀ i j : Fin n, i ≠ j → (M i j) ^ 2 ≤ (M p' q') ^ 2) : + offSq M ≤ ((n : ℝ) ^ 2 - (n : ℝ)) * (M p' q') ^ 2 := by + rw [offSq_eq_sum, ← sum_const_offdiag ((M p' q') ^ 2)] + refine Finset.sum_le_sum (fun i _ => Finset.sum_le_sum (fun j _ => ?_)) + by_cases h : i = j + · simp [h] + · simp only [if_neg h]; exact hmax i j h + +/-! ## The classical (largest-pivot) single-step contraction -/ + +variable (A : Matrix (Fin n) (Fin n) ℝ) (p q : Nat) (hp : p < n) (hq : q < n) (hpq : p ≠ q) + (c s : ℝ) + +include hpq in +/-- **Classical Jacobi linear convergence — one step.** If the pivot `(p, q)` is the *largest* +off-diagonal entry (`hmax`), `A` is symmetric there (`hsym`), and `J` is the Givens rotation that +annihilates it (`hannih`), then conjugating by `J` contracts the squared off-diagonal mass by the +fixed factor `1 − 2/(n² − n) < 1`: + +`‖offDiag(Jᵀ A J)‖² ≤ (1 − 2/(n² − n)) · ‖offDiag A‖²`. + +This is the exact per-rotation decrease `2 · A[p,q]²` (`jacobi_off_decrease`) combined with the +pivot lower bound `A[p,q]² ≥ ‖offDiag A‖²/(n² − n)` (`offSq_le_count_mul_max`). It is an a-priori +convergence rate for the largest-pivot strategy; the *cyclic* strategy the solver uses does not +satisfy the largest-pivot hypothesis and needs the research-grade Forsythe–Henrici bound instead. -/ +theorem jacobi_off_decrease_classical (hn : 2 ≤ n) (hcs : c ^ 2 + s ^ 2 = 1) + (hsym : A ⟨q, hq⟩ ⟨p, hp⟩ = A ⟨p, hp⟩ ⟨q, hq⟩) + (hannih : ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) + ⟨p, hp⟩ ⟨q, hq⟩ = 0) + (hmax : ∀ i j : Fin n, i ≠ j → (A i j) ^ 2 ≤ (A ⟨p, hp⟩ ⟨q, hq⟩) ^ 2) : + offSq ((toM n (Spec.arrGivens n p q c s))ᵀ * A * toM n (Spec.arrGivens n p q c s)) + ≤ (1 - 2 / ((n : ℝ) ^ 2 - (n : ℝ))) * offSq A := by + have hdec := jacobi_off_decrease A p q hp hq hpq c s hcs hsym hannih + have hbound := offSq_le_count_mul_max A ⟨p, hp⟩ ⟨q, hq⟩ hmax + have hN : (0 : ℝ) < (n : ℝ) ^ 2 - (n : ℝ) := by + have h2 : (2 : ℝ) ≤ (n : ℝ) := by exact_mod_cast hn + nlinarith + rw [hdec, sub_mul, one_mul, div_mul_eq_mul_div] + apply sub_le_sub_left + rw [div_le_iff₀ hN] + nlinarith [hbound] + +/-! ## Iterating the contraction: geometric convergence -/ + +end Spec.Factorization + +namespace Spec.Factorization + +/-- **A fixed-factor contraction is bounded by a geometric sequence.** If `a (k+1) ≤ ρ · a k` for +all `k` with `0 ≤ ρ`, then `a k ≤ ρ^k · a 0`. Applied with `a k = ‖offDiag Aₖ‖²` and +`ρ = 1 − 2/(n² − n)` from `jacobi_off_decrease_classical`, this is the geometric a-priori rate of the +classical Jacobi algorithm. The factor `ρ` is arbitrary, so any future per-sweep cyclic bound plugs +in here unchanged. -/ +theorem geom_bound_of_contraction (a : ℕ → ℝ) (ρ : ℝ) (hρ : 0 ≤ ρ) + (hstep : ∀ k, a (k + 1) ≤ ρ * a k) : ∀ k, a k ≤ ρ ^ k * a 0 := by + intro k + induction k with + | zero => simp + | succ m ih => + calc a (m + 1) ≤ ρ * a m := hstep m + _ ≤ ρ * (ρ ^ m * a 0) := mul_le_mul_of_nonneg_left ih hρ + _ = ρ ^ (m + 1) * a 0 := by ring + +/-- **The contraction drives the quantity to zero.** With a genuine factor `ρ < 1` (and `0 ≤ a k`, +which holds for `offSq` by `offSq_nonneg`), the off-diagonal mass tends to `0`: the classical Jacobi +algorithm provably converges to a diagonal matrix. -/ +theorem tendsto_zero_of_contraction (a : ℕ → ℝ) (ρ : ℝ) (hρ0 : 0 ≤ ρ) (hρ1 : ρ < 1) + (hnn : ∀ k, 0 ≤ a k) (hstep : ∀ k, a (k + 1) ≤ ρ * a k) : + Filter.Tendsto a Filter.atTop (nhds 0) := by + apply squeeze_zero hnn (geom_bound_of_contraction a ρ hρ0 hstep) + have hpow : Filter.Tendsto (fun k => ρ ^ k) Filter.atTop (nhds 0) := + tendsto_pow_atTop_nhds_zero_of_lt_one hρ0 hρ1 + simpa using hpow.mul_const (a 0) + +end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 4003d8f..e4869b5 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -186,6 +186,35 @@ confirm the identity numerically (one rotation takes the off-diagonal mass `6 total mass conserved at `35`) and show its hypotheses biting: a wrong-angle rotation misses the decrease, a non-orthogonal one breaks mass invariance. +# Aggregate rate: linear convergence of the classical strategy + +The per-rotation identity removes `2 · A[p,q]²` of off-diagonal mass per step. Turning that into an +*aggregate* rate — a factor by which the mass falls each step, and hence a bound on how many steps are +needed — requires a lower bound on the pivot. For the *classical* strategy, which always annihilates +the *largest* off-diagonal entry, that bound is elementary, and +[`NN.Proofs.Tensor.Basic.FactorizationsJacobiRate`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsJacobiRate.lean) +proves it exactly over `ℝ`. There are `n² − n` off-diagonal positions, so the largest one carries at +least the average share of the mass (`offSq_le_count_mul_max`): + +$$`A[p,q]^2 \;\ge\; \frac{\bigl\|\operatorname{offDiag} A\bigr\|_F^2}{n^2 - n}.` + +Substituting this into the per-rotation decrease gives a genuine *linear contraction* +(`jacobi_off_decrease_classical`): + +$$`\bigl\|\operatorname{offDiag}(J^\top A J)\bigr\|_F^2 \;\le\; \Bigl(1 - \tfrac{2}{n^2 - n}\Bigr)\,\bigl\|\operatorname{offDiag} A\bigr\|_F^2,` + +a fixed factor strictly below `1`. A fixed-factor contraction iterates to a geometric bound +(`geom_bound_of_contraction`: `aₖ ≤ ρᵏ · a₀`) and, since `offSq ≥ 0` (`offSq_nonneg`) and the factor +is `< 1`, drives the off-diagonal mass to zero (`tendsto_zero_of_contraction`). So the classical +Jacobi eigenvalue algorithm provably converges, with an a-priori geometric rate. The geometric +machinery is stated for an *arbitrary* per-step factor `ρ`, so it is exactly the slot a future cyclic +per-sweep bound would fill. The executable witnesses in +[`NN.Examples.Factorization.JacobiRate`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Examples/Factorization/JacobiRate.lean) +exhibit the contrast on a matrix with one dominant entry (`A[0,1] = 5`): annihilating the largest +pivot collapses the off-diagonal mass `50.04 → 0.04`, far under the guaranteed `33.36`, while +annihilating a tiny pivot `A[0,2] = 0.1` removes only `0.02` and stays *above* the guaranteed bound — +the numerical teeth of the largest-pivot hypothesis. + # Exact QR reconstruction The QR factorization admits the same treatment. `qr_mul_eq` (in the same file) proves that for an @@ -226,16 +255,23 @@ into a future Mathlib matrix-level QR contribution. With Cholesky and QR fully reconstructed (`A = L · Lᵀ`, `A = Q · R`, `Qᵀ Q = 1`), the Jacobi run proved faithful — `V` orthogonal and `A = V · Af · Vᵀ` exactly, so the residual certificate holds -*unconditionally* for the real solver output — and the *per-rotation* progress proved exactly (each -annihilating rotation removes `2 · A[p,q]²` of off-diagonal mass), the single property still not -available as an a-priori theorem is the *aggregate rate*: that a full *cyclic* sweep, choosing its -pivots in fixed row-major order rather than always the largest, drives the off-diagonal mass to zero -fast enough that finitely many sweeps suffice. Summing the per-rotation decrease over a sweep is exact; -what is research-grade is bounding the *sum of the pivots* below in terms of the total off-diagonal -mass when the pivots are visited cyclically — the Forsythe–Henrici / Schönhage convergence result. -Mathlib v4.30.0 has no Jacobi convergence theory, so that aggregate rate remains captured by the exact -a-posteriori residual certificate above — bounded numerically by the `assertLt` checks on concrete -inputs — never by `sorry`. Everything else is exact: the algebraic faithfulness of the decomposition -(orthogonality, orthogonal similarity, the residual identity, the per-rotation decrease, and +*unconditionally* for the real solver output — the *per-rotation* progress proved exactly (each +annihilating rotation removes `2 · A[p,q]²` of off-diagonal mass), and the *aggregate* rate of the +*classical largest-pivot* strategy proved to be geometric (linear contraction by `1 − 2/(n²−n)`, +iterating to convergence), the one property still not available as an a-priori theorem is the +aggregate rate *for the cyclic ordering the solver actually uses*: that visiting pivots in fixed +row-major order, rather than always the largest, still drives the off-diagonal mass to zero fast +enough that finitely many sweeps suffice. The gap is precise. The classical bound rests on the +largest pivot carrying at least the average share of the mass; a cyclically-chosen pivot need not, so +its single-step decrease can fall arbitrarily short of `2·‖offDiag A‖²/(n²−n)` (and a later rotation +in the same sweep can refill an entry an earlier one zeroed). Summing the per-rotation decrease over a +sweep is exact; what is research-grade is bounding the *sum of the cyclic pivots* below in terms of +the total off-diagonal mass — the Forsythe–Henrici / Schönhage convergence result. Mathlib v4.30.0 has +no cyclic-Jacobi convergence theory, so that cyclic rate remains captured by the exact a-posteriori +residual certificate above — bounded numerically by the `assertLt` checks on concrete inputs — never +by `sorry`; and the geometric machinery (`geom_bound_of_contraction`, `tendsto_zero_of_contraction`) +is stated for an arbitrary per-step factor, ready to consume such a bound the moment it exists. +Everything else is exact: the algebraic faithfulness of the decomposition (orthogonality, orthogonal +similarity, the residual identity, the per-rotation decrease, the classical-strategy linear rate, and correctness in the zero-residual limit) is proved, and the specification-level facts the kernel methods rely on are independent of the convergence step, so the CHD foundation is complete. From b03ab4da72f4239dbf943f67e4831a0e2b933b82 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 11:18:11 -0700 Subject: [PATCH 09/22] Prove Cholesky positive-pivot keystone; make kernel-ridge solve unconditional MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add the deferred keystone `choleskyFn_diag_pos_of_posDef`: for a positive-definite A, every executable Cholesky pivot is strictly positive (equivalently the radicand A[j,j] − Σ_{k 0 at every step). Proof avoids matrix inverses entirely. Strong induction on the pivot index: - `choleskyFn_dot_eq_local` — localized reconstruction needing only the smaller pivot's positivity (which is all the original `choleskyFn_dot_eq` ever uses), powering the induction. - The Schur-complement witness z is built by reusing the already-proven back-substitution `triSolveUpperFn_mulVec` on the leading block (z m = 1, z annihilates the leading columns of L). - `double_sum_gram` collapses the Gram part of zᵀAz to L[m,m]²; the residual part reduces to the single (m,m) entry, giving zᵀAz = radicand exactly. - `Matrix.PosDef.dotProduct_mulVec_pos` forces zᵀAz > 0, hence radicand > 0, hence the pivot √radicand > 0. Unconditional corollaries (no pivot hypothesis): - `solveRidgeFn_mulVec_of_posSemidef` — for PSD K and γ > 0, solveRidgeFn solves (K+γ·I)·x = b exactly. The fully discharged verified core of CHD `solve_variationnal`. - `solveRidgeSpec_mulVec_of_posSemidef` — tensor-level form. Also lands the spec-layer solve defs (triSolveLowerFn/triSolveUpperFn/ cholSolveFn/addScaledIdFn/solveRidgeFn/solveRidgeSpec), the registration import, positive/negative `#eval` examples exhibiting the keystone dichotomy (SPD K+γ·I → all pivots > 0; singular K → a zero pivot), and the blueprint update (the direct kernel-ridge solve route now has nothing left to prove). sorry/omega/admit-free across all new and changed source. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 8 + NN/Examples/Factorization/RidgeSolve.lean | 129 ++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Tensor/Basic/FactorizationsSolve.lean | 643 ++++++++++++++++++ NN/Spec/Core/Tensor/Factorizations.lean | 47 ++ .../Ch4_Verification/Factorizations.lean | 65 +- 6 files changed, 891 insertions(+), 2 deletions(-) create mode 100644 NN/Examples/Factorization/RidgeSolve.lean create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsSolve.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 10be535..2e4d60d 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -13,6 +13,7 @@ public import NN.Examples.Factorization.SymEig public import NN.Examples.Factorization.SVD public import NN.Examples.Factorization.JacobiDecrease public import NN.Examples.Factorization.JacobiRate +public import NN.Examples.Factorization.RidgeSolve /-! # Matrix factorization examples @@ -39,6 +40,13 @@ factorization misbehaves. `‖offDiag(Jᵀ A J)‖² ≤ (1 − 2/(n²−n))·‖offDiag A‖²` (`jacobi_off_decrease_classical`); **negative control**: annihilating a non-largest (tiny) pivot misses the guaranteed factor, so the rate is specific to the largest-pivot choice. +- `RidgeSolve` — the kernel-ridge (Tikhonov) linear solve `(K + γ·I)·x = b` via Cholesky + + forward/back substitution (`solveRidgeFn_mulVec_of_posSemidef`, the verified core of CHD + `solve_variationnal`, now *unconditional* for PSD `K` and `γ > 0`): for a rank-deficient Gram kernel + `K = G·Gᵀ` and `γ > 0`, `solveRidgeFn` reconstructs `b` to machine precision; **negative control**: + with `γ = 0` the singular `K` has a zero Cholesky pivot and the solve diverges (`NaN`), so + regularization is necessary. Also exhibits the **keystone** `choleskyFn_diag_pos_of_posDef`: the SPD + `K + γ·I` has all-positive Cholesky pivots, while the singular `K` has a zero pivot (PosDef needed). Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/RidgeSolve.lean b/NN/Examples/Factorization/RidgeSolve.lean new file mode 100644 index 0000000..6fd9ab2 --- /dev/null +++ b/NN/Examples/Factorization/RidgeSolve.lean @@ -0,0 +1,129 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the kernel-ridge (Tikhonov) linear solve + +These checks corroborate the development in `NN.Proofs.Tensor.Basic.FactorizationsSolve`: the +Cholesky-based solve of `(K + γ·I)·x = b`, the linear solve at the heart of CHD `solve_variationnal`. + +The verified pipeline is: + +* `triSolveLowerFn` / `triSolveUpperFn` solve triangular systems by forward/back substitution + (`triSolveLowerFn_mulVec`, `triSolveUpperFn_mulVec` — exact); +* `cholSolveFn` composes them through a Cholesky factor `L` to solve `(L·Lᵀ)·x = b` + (`cholSolveFn_mulVec` — exact); +* `solveRidgeFn` factors `K + γ·I` and solves, giving `(K + γ·I)·x = b` + (`solveRidgeFn_mulVec`, under the SPD success condition `posDef_addScaledIdFn` provides). + +The kernel `K = G · Gᵀ` here is a rank-deficient (singular) Gram matrix — exactly the GP/kernel +setting CHD targets — so it is *not* invertible on its own. The checks exhibit: + +* **Positive — regularization makes it solvable.** With `γ = 0.5 > 0`, `K + γ·I` is SPD, the Cholesky + succeeds, and `solveRidgeFn` returns `x` with `(K + γ·I)·x = b` to machine precision (the exact + `solveRidgeFn_mulVec`). +* **Negative — regularization is necessary.** With `γ = 0` the singular `K` has a zero Cholesky pivot: + forward/back substitution divides by zero and the residual blows up (`NaN`/large). This is why CHD + regularizes; it is also exactly the `γ > 0` hypothesis of `posDef_addScaledIdFn`. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.RidgeSolve + +/-- Build a length-`n` `Float` vector from a list (missing entries `0`). -/ +def mkVec {n : Nat} (xs : List Float) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => xs.getD i.val 0.0) + +/-- The regularized matrix `K + γ·I` as a tensor. -/ +def addGammaI {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) : + Spec.Tensor Float (.dim n (.dim n .scalar)) := + Spec.ofMatFn (fun i j => Spec.get2 K i j + (if i.val == j.val then γ else 0.0)) + +/-- `ℓ¹` magnitude `Σᵢ |vᵢ|` of a vector (residual size). A *sum* rather than a `max` so that a `NaN` +entry — produced when an unregularized singular solve divides by a zero pivot — propagates to the +result instead of being silently dropped by `Float`'s `max`. -/ +def vecAbsErr {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : Float := + (List.finRange n).foldl (fun a i => a + Float.abs (Spec.Tensor.toScalar (Spec.get v i))) 0.0 + +/-- Residual `(K + γ·I)·x − b` of a proposed solution `x`. -/ +def ridgeResidual {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) + (b x : Spec.Tensor Float (.dim n .scalar)) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (Spec.matVecMulSpec (addGammaI K γ) x) i) + - Spec.Tensor.toScalar (Spec.get b i)) + +/-- A `3 × 2` factor; its Gram `K = G · Gᵀ` is a rank-2 (hence singular) `3 × 3` kernel matrix. -/ +def G : Spec.Tensor Float (.dim 3 (.dim 2 .scalar)) := + mkMat [[1, 2], + [3, 1], + [0, 1]] + +/-- The (symmetric, PSD, singular) kernel `K = G · Gᵀ`. -/ +def K : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := mm G (tr G) + +def γ : Float := 0.5 +def b : Spec.Tensor Float (.dim 3 .scalar) := mkVec [1, 2, 3] + +/-- The ridge solution `x = (K + γ·I)⁻¹ b`, via the verified Cholesky solve. -/ +def x : Spec.Tensor Float (.dim 3 .scalar) := Spec.solveRidgeSpec K γ b + +#eval IO.println s!"K = G·Gᵀ (rank-2, singular); γ = {γ}; b = {vecToList b}" +#eval IO.println s!"ridge solution x = {vecToList x}" +#eval IO.println s!"residual (K+γI)·x − b = {vecToList (ridgeResidual K γ b x)}" + +-- Positive — the verified solve reconstructs `b`: `(K + γ·I)·x = b` (instance of `solveRidgeFn_mulVec`). +#eval assertLt "kernel-ridge solve: (K + γ·I)·x = b to machine precision" + (vecAbsErr (ridgeResidual K γ b x)) + +/-! ## Negative control: regularization is necessary + +The kernel `K` is singular, so with `γ = 0` its Cholesky has a zero pivot and the substitution +divides by zero — the "solution" does not satisfy the (singular) system. -/ + +def x0 : Spec.Tensor Float (.dim 3 .scalar) := Spec.solveRidgeSpec K 0.0 b + +#eval IO.println s!"unregularized (γ = 0) on singular K: x0 = {vecToList x0}, \ + residual = {vecToList (ridgeResidual K 0.0 b x0)}" + +-- Negative — without regularization the singular system is not solved (zero pivot → NaN/blow-up). +#eval assertReconFails "unregularized solve of singular K fails (γ = 0 → zero Cholesky pivot)" + (vecAbsErr (ridgeResidual K 0.0 b x0)) + +/-! ## Keystone: positive-definite ⟹ strictly positive Cholesky pivots + +`Spec.Factorization.Reconstruction.choleskyFn_diag_pos_of_posDef` proves that an SPD matrix has *all* +Cholesky pivots `> 0` — exactly the success condition the solve needs — and +`solveRidgeFn_mulVec_of_posSemidef` uses it to make the ridge solve unconditional for PSD `K`, `γ > 0`. +These checks exhibit the dichotomy the keystone formalizes. -/ + +/-- Count of non-positive Cholesky pivots of a square matrix. A `NaN` pivot (from `√(negative)` on a +non-SPD matrix) also counts, since `NaN > 0` is `false`. The keystone guarantees this is `0` for an +SPD matrix. -/ +def numNonPosPivots {k : Nat} (M : Spec.Tensor Float (.dim k (.dim k .scalar))) : Float := + let L := Spec.choleskySpec M + (List.finRange k).foldl (fun acc j => acc + (if Spec.get2 L j j > 0 then 0.0 else 1.0)) 0.0 + +/-- The SPD regularized matrix `K + γ·I` (`γ = 0.5 > 0`, `K` PSD). -/ +def Kγ : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := addGammaI K γ + +#eval IO.println s!"Cholesky pivots of K + γ·I (SPD): {vecToList (diagOf (Spec.choleskySpec Kγ))}" +#eval IO.println s!"Cholesky pivots of K (singular, γ = 0): {vecToList (diagOf (Spec.choleskySpec K))}" + +-- Positive — SPD ⟹ every Cholesky pivot is > 0 (an instance of `choleskyFn_diag_pos_of_posDef`). +#eval assertLt "SPD K + γ·I has all-positive Cholesky pivots (keystone)" (numNonPosPivots Kγ) + +-- Negative — the singular kernel `K` (PSD but not PD) has a non-positive pivot, so PosDef is needed. +#eval assertGe "singular K has a non-positive Cholesky pivot (PosDef necessary)" + (numNonPosPivots K) 0.5 + +end NN.Examples.Factorization.RidgeSolve diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index 33718c6..efaa964 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -11,6 +11,7 @@ public import NN.Proofs.Tensor.Basic.Folds public import NN.Proofs.Tensor.Basic.LinearAlgebra public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction +public import NN.Proofs.Tensor.Basic.FactorizationsSolve public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi public import NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease diff --git a/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean b/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean new file mode 100644 index 0000000..b5edb94 --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean @@ -0,0 +1,643 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction +public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal +public import Mathlib.Data.Real.StarOrdered + +/-! +# The Cholesky linear solve and the kernel-ridge (Tikhonov) solve + +[`FactorizationsReconstruction`](./FactorizationsReconstruction.lean) proved that the executable +Cholesky factor satisfies `A = L · Lᵀ` exactly (over `ℝ`, under positive pivots). This file uses that +to verify the *linear solve* built on top of it — forward/back substitution — and hence the +kernel-ridge solve `(K + γ·I)·x = b` that is the numerical heart of CHD `solve_variationnal`. + +## Main results + +* `triSolveLowerFn_mulVec` / `triSolveUpperFn_mulVec` — forward/back substitution are correct: for a + lower- (resp. upper-) triangular matrix with nonzero diagonal, the computed vector solves + `L · y = b` (resp. `U · x = y`) **exactly**. These are finite, non-iterative algorithms, so the + identity is exact over `ℝ` — no residual/asymptotic caveat. +* `cholSolveFn_mulVec` — composing the two substitutions through a Cholesky factor `L` solves + `(L · Lᵀ) · x = b` exactly. +* `solveRidgeFn_mulVec` — the kernel-ridge solve: if the Cholesky pivots of `K + γ·I` are positive + (the success condition), then `solveRidgeFn K γ b` solves `(K + γ·I)·x = b` exactly. +* `choleskyFn_diag_pos_of_posDef` — the **keystone**: a positive-definite matrix has strictly positive + executable Cholesky pivots (the radicand `A[j,j] − Σ_{k 0` at each step), proved via an + explicit Schur-complement quadratic-form witness. +* `solveRidgeFn_mulVec_of_posSemidef` (and its tensor-level form) — composing the two: for a + positive-semidefinite kernel `K` and `γ > 0`, `solveRidgeFn K γ b` solves `(K + γ·I)·x = b` exactly, + with **no pivot hypothesis**. This is the fully discharged verified `solve_variationnal`. + +## Method + +Each substitution is a `Function.update` fold over the index list (`finRange n` forward, its reverse +for back-substitution). The key observation is that **no induction on the solved values is needed**: +the entry `yᵢ` is *defined* to make row `i` of the equation hold, so unfolding its definition and +using triangularity (the not-yet-visited and structurally-zero terms drop out of the row dot product) +gives `(L · y)ᵢ = bᵢ` directly. Two generic lemmas — `foldl_update_read` (the value written at the +split index) and `foldl_update_stable` (earlier entries are never overwritten) — capture the fold +bookkeeping; `sum_split_lt_eq_gt` performs the `k < i / k = i / k > i` trichotomy on the row sum. +-/ + +@[expose] public section + +namespace Spec.Factorization.Reconstruction + +open Matrix +open scoped BigOperators + +variable {n : Nat} + +/-! ## Generic `Function.update`-fold bookkeeping -/ + +/-- An update-fold never changes an index it does not visit. -/ +theorem foldl_update_not_mem (H : (Fin n → ℝ) → Fin n → ℝ) (l : List (Fin n)) + (init : Fin n → ℝ) {x : Fin n} (hx : x ∉ l) : + (l.foldl (fun acc j => Function.update acc j (H acc j)) init) x = init x := by + induction l generalizing init with + | nil => simp + | cons a t ih => + rw [List.foldl_cons, ih (Function.update init a (H init a)) + (fun h => hx (List.mem_cons_of_mem _ h))] + have hxa : x ≠ a := by rintro rfl; exact hx (by simp) + exact Function.update_of_ne hxa _ _ + +/-- Reading an update-fold over `l₁ ++ i :: l₂` at the split index `i` (not revisited in `l₂`) +returns the step value applied to the `l₁`-prefix state. -/ +theorem foldl_update_read (H : (Fin n → ℝ) → Fin n → ℝ) (l₁ l₂ : List (Fin n)) + (init : Fin n → ℝ) {i : Fin n} (hi : i ∉ l₂) : + ((l₁ ++ i :: l₂).foldl (fun acc j => Function.update acc j (H acc j)) init) i + = H (l₁.foldl (fun acc j => Function.update acc j (H acc j)) init) i := by + rw [List.foldl_append, List.foldl_cons, foldl_update_not_mem H l₂ _ hi, Function.update_self] + +/-- An update-fold over `l₁ ++ i :: l₂` agrees with its `l₁`-prefix at any index `≠ i` not in `l₂`. -/ +theorem foldl_update_stable (H : (Fin n → ℝ) → Fin n → ℝ) (l₁ l₂ : List (Fin n)) + (init : Fin n → ℝ) {i m : Fin n} (hm : m ∉ l₂) (hmi : m ≠ i) : + ((l₁ ++ i :: l₂).foldl (fun acc j => Function.update acc j (H acc j)) init) m + = (l₁.foldl (fun acc j => Function.update acc j (H acc j)) init) m := by + rw [List.foldl_append, List.foldl_cons, foldl_update_not_mem H l₂ _ hm, + Function.update_of_ne hmi] + +/-! ## Splitting a `Fin n` sum at an index -/ + +/-- Split a sum over `Fin n` into the `k < i`, `k = i`, and `k > i` parts. -/ +theorem sum_split_lt_eq_gt (i : Fin n) (f : Fin n → ℝ) : + (∑ k, f k) = (∑ k, if k.val < i.val then f k else 0) + f i + + (∑ k, if i.val < k.val then f k else 0) := by + rw [show f i = ∑ k, (if k = i then f k else 0) by + rw [Finset.sum_ite_eq' Finset.univ i f]; simp] + rw [← Finset.sum_add_distrib, ← Finset.sum_add_distrib] + apply Finset.sum_congr rfl + intro k _ + rcases lt_trichotomy k.val i.val with h | h | h + · have hne : k ≠ i := fun e => by rw [e] at h; exact lt_irrefl _ h + rw [if_pos h, if_neg hne, if_neg (by linarith), add_zero, add_zero] + · have hki : k = i := Fin.ext h + rw [if_neg (by linarith), if_pos hki, if_neg (by linarith), zero_add, add_zero] + · have hne : k ≠ i := fun e => by rw [e] at h; exact lt_irrefl _ h + rw [if_neg (by linarith), if_neg hne, if_pos h, zero_add, zero_add] + +/-! ## `finRange` order splits -/ + +/-- `finRange n` splits at index `i` as the strictly-smaller prefix, `i`, then the strictly-larger +suffix. -/ +theorem finRange_split (i : Fin n) : + List.finRange n + = (List.finRange n).take i.val ++ i :: (List.finRange n).drop (i.val + 1) := by + have hlen : i.val < (List.finRange n).length := by rw [List.length_finRange]; exact i.isLt + conv_lhs => rw [← List.take_append_drop i.val (List.finRange n)] + congr 1 + rw [List.drop_eq_getElem_cons hlen] + congr 1 + simp [List.getElem_finRange] + +/-! ## Forward substitution solves a lower-triangular system exactly -/ + +/-- **Forward substitution is correct.** For a lower-triangular `L` (`L i j = 0` when `i < j`) with +nonzero diagonal, `triSolveLowerFn L b` solves `L · y = b` exactly: row `i` of `L · y` is `bᵢ`. -/ +theorem triSolveLowerFn_mulVec (L : Fin n → Fin n → ℝ) + (hlow : ∀ i j, i < j → L i j = 0) (hdiag : ∀ i, L i i ≠ 0) (b : Fin n → ℝ) (i : Fin n) : + (∑ k, L i k * Spec.triSolveLowerFn L b k) = b i := by + set H : (Fin n → ℝ) → Fin n → ℝ := fun acc j => (b j - Spec.dotFn (L j) acc) / L j j with hH + set y := Spec.triSolveLowerFn L b with hy + set pre := ((List.finRange n).take i.val).foldl + (fun acc j => Function.update acc j (H acc j)) (fun _ => 0) with hpre + -- `y` is the update-fold over `finRange n`. + have hyeq : y = (List.finRange n).foldl (fun acc j => Function.update acc j (H acc j)) + (fun _ => 0) := rfl + -- `i` is not revisited after its turn, and not in its own prefix. + have hi₂ : i ∉ (List.finRange n).drop (i.val + 1) := fun hmem => by + have := mem_drop_finRange hmem; linarith + have hi₁ : i ∉ (List.finRange n).take i.val := fun hmem => by + have := mem_take_finRange hmem; exact lt_irrefl _ this + -- value written at `i`, prefix value at `i`, and stability for `k < i`. + have hy_i : y i = (b i - Spec.dotFn (L i) pre) / L i i := by + rw [hyeq]; conv_lhs => rw [finRange_split i] + rw [foldl_update_read H _ _ _ hi₂] + have hpre_i : pre i = 0 := by rw [hpre, foldl_update_not_mem H _ _ hi₁] + have hy_lt : ∀ m : Fin n, m.val < i.val → y m = pre m := by + intro m hm + have hm₂ : m ∉ (List.finRange n).drop (i.val + 1) := fun hmem => by + have := mem_drop_finRange hmem; linarith + have hmi : m ≠ i := fun e => by rw [e] at hm; exact lt_irrefl _ hm + rw [hyeq]; conv_lhs => rw [finRange_split i] + rw [foldl_update_stable H _ _ _ hm₂ hmi] + -- the row dot product `dotFn (L i) pre` is the masked partial sum over `k < i`. + have hdot : Spec.dotFn (L i) pre = ∑ k, if k.val < i.val then L i k * pre k else 0 := by + rw [dotFn_eq_sum, sum_split_lt_eq_gt i (fun k => L i k * pre k)] + rw [hpre_i, mul_zero] + rw [show (∑ k, if i.val < k.val then L i k * pre k else 0) = 0 by + apply Finset.sum_eq_zero; intro k _ + by_cases hk : i.val < k.val + · rw [if_pos hk, hlow i k (by exact hk), zero_mul] + · rw [if_neg hk]] + ring + -- assemble row `i` of `L · y`. + rw [sum_split_lt_eq_gt i (fun k => L i k * y k)] + rw [show (∑ k, if i.val < k.val then L i k * y k else 0) = 0 by + apply Finset.sum_eq_zero; intro k _ + by_cases hk : i.val < k.val + · rw [if_pos hk, hlow i k (by exact hk), zero_mul] + · rw [if_neg hk]] + rw [show (∑ k, if k.val < i.val then L i k * y k else 0) + = ∑ k, if k.val < i.val then L i k * pre k else 0 by + apply Finset.sum_congr rfl; intro k _ + by_cases hk : k.val < i.val + · rw [if_pos hk, if_pos hk, hy_lt k hk] + · rw [if_neg hk, if_neg hk]] + rw [← hdot, hy_i, add_zero] + have hdi : L i i ≠ 0 := hdiag i + field_simp + ring + +/-! ## Back substitution solves an upper-triangular system exactly -/ + +/-- `(finRange n).reverse` splits at index `i` as the strictly-larger block (reversed suffix), +then `i`, then the strictly-smaller block (reversed prefix). -/ +theorem finRange_reverse_split (i : Fin n) : + (List.finRange n).reverse + = ((List.finRange n).drop (i.val + 1)).reverse + ++ i :: ((List.finRange n).take i.val).reverse := by + conv_lhs => rw [finRange_split i] + rw [List.reverse_append, List.reverse_cons, List.append_assoc, List.singleton_append] + +/-- **Back substitution is correct.** For an upper-triangular `U` (`U i j = 0` when `j < i`) with +nonzero diagonal, `triSolveUpperFn U c` solves `U · x = c` exactly: row `i` of `U · x` is `cᵢ`. -/ +theorem triSolveUpperFn_mulVec (U : Fin n → Fin n → ℝ) + (hup : ∀ i j, j < i → U i j = 0) (hdiag : ∀ i, U i i ≠ 0) (c : Fin n → ℝ) (i : Fin n) : + (∑ k, U i k * Spec.triSolveUpperFn U c k) = c i := by + set H : (Fin n → ℝ) → Fin n → ℝ := fun acc j => (c j - Spec.dotFn (U j) acc) / U j j with hH + set y := Spec.triSolveUpperFn U c with hy + set pre := (((List.finRange n).drop (i.val + 1)).reverse).foldl + (fun acc j => Function.update acc j (H acc j)) (fun _ => 0) with hpre + have hyeq : y = ((List.finRange n).reverse).foldl + (fun acc j => Function.update acc j (H acc j)) (fun _ => 0) := rfl + have hi₂ : i ∉ ((List.finRange n).take i.val).reverse := fun hmem => by + rw [List.mem_reverse] at hmem; have := mem_take_finRange hmem; exact lt_irrefl _ this + have hi₁ : i ∉ ((List.finRange n).drop (i.val + 1)).reverse := fun hmem => by + rw [List.mem_reverse] at hmem; have := mem_drop_finRange hmem; linarith + have hy_i : y i = (c i - Spec.dotFn (U i) pre) / U i i := by + rw [hyeq]; conv_lhs => rw [finRange_reverse_split i] + rw [foldl_update_read H _ _ _ hi₂] + have hpre_i : pre i = 0 := by rw [hpre, foldl_update_not_mem H _ _ hi₁] + have hy_gt : ∀ m : Fin n, i.val < m.val → y m = pre m := by + intro m hm + have hm₂ : m ∉ ((List.finRange n).take i.val).reverse := fun hmem => by + rw [List.mem_reverse] at hmem; have := mem_take_finRange hmem; linarith + have hmi : m ≠ i := fun e => by rw [e] at hm; exact lt_irrefl _ hm + rw [hyeq]; conv_lhs => rw [finRange_reverse_split i] + rw [foldl_update_stable H _ _ _ hm₂ hmi] + have hdot : Spec.dotFn (U i) pre = ∑ k, if i.val < k.val then U i k * pre k else 0 := by + rw [dotFn_eq_sum, sum_split_lt_eq_gt i (fun k => U i k * pre k)] + rw [hpre_i, mul_zero] + rw [show (∑ k, if k.val < i.val then U i k * pre k else 0) = 0 by + apply Finset.sum_eq_zero; intro k _ + by_cases hk : k.val < i.val + · rw [if_pos hk, hup i k (by exact hk), zero_mul] + · rw [if_neg hk]] + ring + rw [sum_split_lt_eq_gt i (fun k => U i k * y k)] + rw [show (∑ k, if k.val < i.val then U i k * y k else 0) = 0 by + apply Finset.sum_eq_zero; intro k _ + by_cases hk : k.val < i.val + · rw [if_pos hk, hup i k (by exact hk), zero_mul] + · rw [if_neg hk]] + rw [show (∑ k, if i.val < k.val then U i k * y k else 0) + = ∑ k, if i.val < k.val then U i k * pre k else 0 by + apply Finset.sum_congr rfl; intro k _ + by_cases hk : i.val < k.val + · rw [if_pos hk, if_pos hk, hy_gt k hk] + · rw [if_neg hk, if_neg hk]] + rw [← hdot, hy_i] + have hdi : U i i ≠ 0 := hdiag i + field_simp + ring + +/-! ## The Cholesky linear solve -/ + +/-- **Cholesky solve is correct.** For a lower-triangular `L` with nonzero diagonal, the two-pass +substitution `cholSolveFn L b` solves `(L · Lᵀ) · x = b` exactly. -/ +theorem cholSolveFn_mulVec (L : Fin n → Fin n → ℝ) + (hlow : ∀ i j, i < j → L i j = 0) (hdiag : ∀ i, L i i ≠ 0) (b : Fin n → ℝ) : + (Matrix.of L * (Matrix.of L)ᵀ) *ᵥ (Spec.cholSolveFn L b) = b := by + set z := Spec.triSolveLowerFn L b with hz + set U : Fin n → Fin n → ℝ := fun i k => L k i with hU + have hup : ∀ i j, j < i → U i j = 0 := fun i j hji => hlow j i hji + have hUdiag : ∀ i, U i i ≠ 0 := fun i => hdiag i + have hUp : (Matrix.of L)ᵀ *ᵥ (Spec.cholSolveFn L b) = z := by + funext i + have hx : Spec.cholSolveFn L b = Spec.triSolveUpperFn U z := rfl + show (∑ k, ((Matrix.of L)ᵀ i k) * Spec.cholSolveFn L b k) = z i + simp only [Matrix.transpose_apply, Matrix.of_apply] + rw [hx] + exact triSolveUpperFn_mulVec U hup hUdiag z i + have hLow : (Matrix.of L) *ᵥ z = b := by + funext i + show (∑ k, (Matrix.of L i k) * z k) = b i + simp only [Matrix.of_apply] + exact triSolveLowerFn_mulVec L hlow hdiag b i + calc (Matrix.of L * (Matrix.of L)ᵀ) *ᵥ (Spec.cholSolveFn L b) + = Matrix.of L *ᵥ ((Matrix.of L)ᵀ *ᵥ (Spec.cholSolveFn L b)) := by + rw [Matrix.mulVec_mulVec] + _ = Matrix.of L *ᵥ z := by rw [hUp] + _ = b := hLow + +/-! ## The kernel-ridge (Tikhonov) solve -/ + +/-- **Kernel-ridge solve is correct (conditional on Cholesky success).** If `K` is symmetric and the +Cholesky pivots of `K + γ·I` are positive — exactly the condition under which the SPD Cholesky +succeeds — then `solveRidgeFn K γ b` solves `(K + γ·I)·x = b` exactly. This is the verified core of +CHD `solve_variationnal`; the positive-pivot hypothesis is discharged unconditionally for an SPD +`K + γ·I` (PSD kernel `K`, `γ > 0`) in the companion development. -/ +theorem solveRidgeFn_mulVec (K : Fin n → Fin n → ℝ) (γ : ℝ) (b : Fin n → ℝ) + (hsymm : ∀ i j, K i j = K j i) + (hpos : ∀ j : Fin n, 0 < Spec.choleskyFn (Spec.addScaledIdFn K γ) j j) : + (Matrix.of (Spec.addScaledIdFn K γ)) *ᵥ (Spec.solveRidgeFn K γ b) = b := by + set A := Spec.addScaledIdFn K γ with hA + have hAsymm : ∀ i j, A i j = A j i := by + intro i j + show K i j + (if i = j then γ else 0) = K j i + (if j = i then γ else 0) + rw [hsymm i j] + by_cases h : i = j + · rw [h] + · rw [if_neg h, if_neg (fun e => h e.symm)] + obtain ⟨hlowM, hreconM⟩ := isCholesky_of_pos A hAsymm hpos + have hlow : ∀ i j, i < j → Spec.choleskyFn A i j = 0 := fun i j hij => by + have := hlowM i j hij; simpa using this + have hdiag : ∀ i, Spec.choleskyFn A i i ≠ 0 := fun i => ne_of_gt (hpos i) + have hxeq : Spec.solveRidgeFn K γ b = Spec.cholSolveFn (Spec.choleskyFn A) b := rfl + rw [hxeq, hreconM] + exact cholSolveFn_mulVec (Spec.choleskyFn A) hlow hdiag b + +/-! ## The regularized matrix `K + γ·I` is symmetric positive-definite + +For a positive-semidefinite kernel `K` and regularization `γ > 0`, `K + γ·I` is positive definite — +the precondition under which the Cholesky-based `solveRidgeFn` is the genuine linear solve. Combined +with the keystone below (`choleskyFn_diag_pos_of_posDef`: an SPD matrix has strictly positive +executable Cholesky pivots), this discharges the positive-pivot hypothesis of `solveRidgeFn_mulVec` +unconditionally, giving `solveRidgeFn_mulVec_of_posSemidef`. -/ + +/-- `Matrix.of (addScaledIdFn K γ) = Matrix.of K + γ • 1`. -/ +theorem of_addScaledIdFn (K : Fin n → Fin n → ℝ) (γ : ℝ) : + Matrix.of (Spec.addScaledIdFn K γ) = Matrix.of K + γ • (1 : Matrix (Fin n) (Fin n) ℝ) := by + ext i j + simp only [Matrix.of_apply, Matrix.add_apply, Matrix.smul_apply, Matrix.one_apply, + Spec.addScaledIdFn, smul_eq_mul] + by_cases h : i = j <;> simp [h] + +/-- **The regularized (ridge) matrix is SPD.** For a PSD kernel `K` and `γ > 0`, `K + γ·I` is positive +definite. This is the precondition that makes the Cholesky ridge solve `solveRidgeFn` well-posed +(its Cholesky factorization exists with positive pivots). -/ +theorem posDef_addScaledIdFn {K : Fin n → Fin n → ℝ} (hK : (Matrix.of K).PosSemidef) + {γ : ℝ} (hγ : 0 < γ) : (Matrix.of (Spec.addScaledIdFn K γ)).PosDef := by + rw [of_addScaledIdFn] + exact Matrix.PosDef.posSemidef_add hK (Matrix.PosDef.one.smul hγ) + +/-! ## Keystone: a positive-definite matrix has strictly positive Cholesky pivots + +The remaining ingredient that makes `solveRidgeFn_mulVec` unconditional for SPD inputs: for a +*positive-definite* `A`, every executable Cholesky pivot is `> 0`. Equivalently, the radicand +`A[j,j] − Σ_{k 0` at every step, so the `√` never sees a non-positive argument. + +The argument is the classical Schur-complement fact, formalized as an **explicit quadratic-form +witness** (so it needs no matrix inverse). By strong induction on `j`, the leading `j`-block +reconstructs from the pivots below `j` (`choleskyFn_dot_eq_local`). Back-substitution +(`triSolveUpperFn`, already proven correct in this file) produces a vector `z` with `z j = 1` whose +`A`-quadratic form `zᵀ A z` is exactly the radicand; positive-definiteness forces `zᵀ A z > 0`. -/ + +/-- A double Gram sum collapses to a sum of squares: +`∑ᵢ∑ⱼ zᵢ·(∑ₗ Mᵢₗ Mⱼₗ)·zⱼ = ∑ₗ (∑ᵢ zᵢ Mᵢₗ)²`. (The `A = M·Mᵀ` reconstruction turns the witness +quadratic form into a manifestly nonnegative shape.) -/ +theorem double_sum_gram (z : Fin n → ℝ) (M : Fin n → Fin n → ℝ) : + (∑ i, ∑ j, z i * ((∑ l, M i l * M j l) * z j)) + = ∑ l, (∑ i, z i * M i l) * (∑ i, z i * M i l) := by + have hexp : (∑ i, ∑ j, z i * ((∑ l, M i l * M j l) * z j)) + = ∑ i, ∑ j, ∑ l, (z i * M i l) * (z j * M j l) := by + refine Finset.sum_congr rfl (fun i _ => Finset.sum_congr rfl (fun j _ => ?_)) + rw [Finset.sum_mul, Finset.mul_sum] + exact Finset.sum_congr rfl (fun l _ => by ring) + rw [hexp, + show (∑ i, ∑ j, ∑ l, (z i * M i l) * (z j * M j l)) + = ∑ i, ∑ l, ∑ j, (z i * M i l) * (z j * M j l) + from Finset.sum_congr rfl (fun i _ => Finset.sum_comm), + Finset.sum_comm] + refine Finset.sum_congr rfl (fun l _ => ?_) + rw [Fintype.sum_mul_sum (fun i => z i * M i l) (fun j => z j * M j l)] + +/-- **Localized per-entry Cholesky reconstruction.** The proof of `choleskyFn_dot_eq` only uses the +positivity of the *smaller* pivot `L[j,j]`, so for `j ≤ i` the reconstruction `∑ₖ L[i,k]·L[j,k] = +A[i,j]` holds assuming only `0 < L[j,j]` — not global positivity. This is what powers the strong +induction in `choleskyFn_diag_pos_of_posDef`. -/ +theorem choleskyFn_dot_eq_local (A : Fin n → Fin n → ℝ) {i j : Fin n} + (hjpos : 0 < Spec.choleskyFn A j j) (hji : j.val ≤ i.val) : + (∑ k, Spec.choleskyFn A i k * Spec.choleskyFn A j k) = A i j := by + set L := Spec.choleskyFn A with hL + have key : ∀ k : Fin n, L i k * L j k + = (if k.val < j.val then L i k * L j k else 0) + (if k = j then L i j * L j j else 0) := by + intro k + rcases lt_trichotomy k.val j.val with h | h | h + · have hne : k ≠ j := fun hk => by rw [hk] at h; exact lt_irrefl _ h + rw [if_pos h, if_neg hne, add_zero] + · have hkj : k = j := Fin.ext h + rw [if_neg (by rw [h]; exact lt_irrefl _), if_pos hkj, zero_add, hkj] + · have hne : k ≠ j := fun hk => by rw [hk] at h; exact lt_irrefl _ h + rw [if_neg (Nat.not_lt.mpr (le_of_lt h)), if_neg hne, add_zero, + show L j k = 0 from by rw [hL]; exact Spec.Factorization.choleskyFn_lower_triangular A h, + mul_zero] + rw [show (∑ k, L i k * L j k) + = ∑ k, ((if k.val < j.val then L i k * L j k else 0) + (if k = j then L i j * L j j else 0)) + from Finset.sum_congr rfl (fun k _ => key k), + Finset.sum_add_distrib, Finset.sum_ite_eq' Finset.univ j (fun _ => L i j * L j j)] + simp only [Finset.mem_univ, if_true] + rcases eq_or_lt_of_le hji with heq | hlt + · have hij' : i = j := Fin.ext heq.symm + subst hij' + have hrad : 0 < A i i - (∑ k, if k.val < i.val then L i k * L i k else 0) := by + have hp := hjpos + rw [hL, choleskyFn_diag_eq] at hp + exact Real.sqrt_pos.mp hp + have hsq : L i i * L i i = A i i - (∑ k, if k.val < i.val then L i k * L i k else 0) := by + conv_lhs => rw [hL, choleskyFn_diag_eq A i] + exact Real.mul_self_sqrt hrad.le + rw [hsq]; ring + · have hne : L j j ≠ 0 := ne_of_gt hjpos + have hmul : L i j * L j j + = A i j - (∑ k, if k.val < j.val then L i k * L j k else 0) := by + rw [hL, choleskyFn_offdiag_eq A hlt, div_mul_eq_mul_div, mul_div_assoc, div_self hne, mul_one] + rw [hmul]; ring + +/-- **The radicand / Schur keystone.** For a positive-definite `A`, every executable Cholesky pivot is +strictly positive: `0 < L[j,j]`. Hence the SPD Cholesky succeeds and the ridge solve is exact. -/ +theorem choleskyFn_diag_pos_of_posDef (A : Fin n → Fin n → ℝ) (hpd : (Matrix.of A).PosDef) + (m : Fin n) : 0 < Spec.choleskyFn A m m := by + -- symmetry of `A` from Hermitian-ness + have hsymm : ∀ i j, A i j = A j i := by + intro i j + have h := hpd.1.apply i j + simp only [Matrix.of_apply, star_trivial] at h + exact h.symm + -- strong induction on `m.val` + suffices H : ∀ N : Nat, ∀ m : Fin n, m.val = N → 0 < Spec.choleskyFn A m m by + exact H m.val m rfl + intro N + induction N using Nat.strong_induction_on with + | _ N IH => + intro m hmN + have ihpos : ∀ i : Fin n, i.val < m.val → 0 < Spec.choleskyFn A i i := fun i hi => + IH i.val (hmN ▸ hi) i rfl + -- reduce to positivity of the radicand + rw [choleskyFn_diag_eq A m, Real.sqrt_pos] + -- localized reconstruction for pairs `≤ m` with at least one index `< m` + have hAij : ∀ i j : Fin n, i.val ≤ m.val → j.val ≤ m.val → (i.val < m.val ∨ j.val < m.val) → + (∑ l, Spec.choleskyFn A i l * Spec.choleskyFn A j l) = A i j := by + intro i j _ _ hor + rcases le_total j.val i.val with hle | hle + · have hjm : j.val < m.val := by + rcases hor with h | h + · exact lt_of_le_of_lt hle h + · exact h + exact choleskyFn_dot_eq_local A (ihpos j hjm) hle + · have him : i.val < m.val := by + rcases hor with h | h + · exact h + · exact lt_of_le_of_lt hle h + rw [show (∑ l, Spec.choleskyFn A i l * Spec.choleskyFn A j l) + = ∑ l, Spec.choleskyFn A j l * Spec.choleskyFn A i l + from Finset.sum_congr rfl (fun l _ => mul_comm _ _), + choleskyFn_dot_eq_local A (ihpos i him) hle, hsymm j i] + -- the back-substitution system solving `(Lₘᵀ) z = −(row m of L)` on the leading block + set U' : Fin n → Fin n → ℝ := fun l i => + if l.val < m.val then (if i.val < m.val then Spec.choleskyFn A i l else 0) + else (if i = l then 1 else 0) with hU' + set c : Fin n → ℝ := fun l => if l.val < m.val then -(Spec.choleskyFn A m l) else 0 with hc + set x' := Spec.triSolveUpperFn U' c with hx' + set z : Fin n → ℝ := fun i => if i = m then 1 else x' i with hz + have zm1 : z m = 1 := by simp [hz] + -- `U'` is upper-triangular with nonzero diagonal + have hup : ∀ a b : Fin n, b.val < a.val → U' a b = 0 := by + intro a b hba + simp only [hU'] + by_cases ha : a.val < m.val + · rw [if_pos ha] + by_cases hb : b.val < m.val + · rw [if_pos hb]; exact Spec.Factorization.choleskyFn_lower_triangular A hba + · rw [if_neg hb] + · rw [if_neg ha, if_neg (by intro e; rw [e] at hba; exact lt_irrefl _ hba)] + have hUdiag : ∀ a : Fin n, U' a a ≠ 0 := by + intro a + simp only [hU'] + by_cases ha : a.val < m.val + · rw [if_pos ha, if_pos ha]; exact ne_of_gt (ihpos a ha) + · rw [if_neg ha]; simp + have hsolve : ∀ l : Fin n, (∑ i, U' l i * x' i) = c l := fun l => + triSolveUpperFn_mulVec U' hup hUdiag c l + -- entries `≥ m` of the solve vanish + have hx'_ge : ∀ l : Fin n, m.val ≤ l.val → x' l = 0 := by + intro l hl + have hlm : ¬ l.val < m.val := Nat.not_lt.mpr hl + have hsum : (∑ i, U' l i * x' i) = x' l := by + rw [show (∑ i, U' l i * x' i) = ∑ i, (if i = l then x' i else 0) from + Finset.sum_congr rfl (fun i _ => by + simp only [hU', if_neg hlm] + by_cases hi : i = l + · rw [if_pos hi, if_pos hi, one_mul] + · rw [if_neg hi, if_neg hi, zero_mul]), + Finset.sum_ite_eq' Finset.univ l (fun i => x' i)] + simp + have hcl : c l = 0 := by simp only [hc, if_neg hlm] + have := hsolve l + rw [hsum, hcl] at this + exact this + have hz_gt : ∀ i : Fin n, m.val < i.val → z i = 0 := by + intro i hi + have hne : i ≠ m := fun e => by rw [e] at hi; exact lt_irrefl _ hi + simp only [hz, if_neg hne] + exact hx'_ge i (le_of_lt hi) + -- the witness annihilates the leading columns of `L` + have hker : ∀ l : Fin n, l.val < m.val → (∑ i, z i * Spec.choleskyFn A i l) = 0 := by + intro l hlm + have hpl : (∑ i, (if i.val < m.val then x' i * Spec.choleskyFn A i l else 0)) + = -(Spec.choleskyFn A m l) := by + have h := hsolve l + rw [show c l = -(Spec.choleskyFn A m l) from by simp only [hc, if_pos hlm]] at h + rw [← h] + refine Finset.sum_congr rfl (fun i _ => ?_) + simp only [hU', if_pos hlm] + by_cases hi : i.val < m.val + · rw [if_pos hi, if_pos hi, mul_comm] + · rw [if_neg hi, if_neg hi, zero_mul] + have tw : ∀ i : Fin n, z i * Spec.choleskyFn A i l + = (if i.val < m.val then x' i * Spec.choleskyFn A i l else 0) + + (if i = m then Spec.choleskyFn A m l else 0) := by + intro i + rcases lt_trichotomy i.val m.val with hi | hi | hi + · have hne : i ≠ m := fun e => by rw [e] at hi; exact lt_irrefl _ hi + rw [if_pos hi, if_neg hne, add_zero] + simp only [hz, if_neg hne] + · have him : i = m := Fin.ext hi + rw [if_neg (by rw [hi]; exact lt_irrefl _), if_pos him, zero_add, him, zm1, one_mul] + · have hne : i ≠ m := fun e => by rw [e] at hi; exact lt_irrefl _ hi + rw [if_neg (Nat.not_lt.mpr (le_of_lt hi)), if_neg hne, add_zero, + show z i = 0 from hz_gt i hi, zero_mul] + rw [show (∑ i, z i * Spec.choleskyFn A i l) + = ∑ i, ((if i.val < m.val then x' i * Spec.choleskyFn A i l else 0) + + (if i = m then Spec.choleskyFn A m l else 0)) + from Finset.sum_congr rfl (fun i _ => tw i), + Finset.sum_add_distrib, Finset.sum_ite_eq' Finset.univ m (fun _ => Spec.choleskyFn A m l)] + simp only [Finset.mem_univ, if_true] + rw [hpl]; ring + -- value of the column-`l` contraction `∑ᵢ zᵢ L[i,l]` + have wval : ∀ l : Fin n, (∑ i, z i * Spec.choleskyFn A i l) + = if l = m then Spec.choleskyFn A m m else 0 := by + intro l + rcases lt_trichotomy l.val m.val with hl | hl | hl + · rw [if_neg (fun e => by rw [e] at hl; exact lt_irrefl _ hl)] + exact hker l hl + · have hlm : l = m := Fin.ext hl + rw [if_pos hlm, hlm] + have hper : ∀ i : Fin n, z i * Spec.choleskyFn A i m + = if i = m then Spec.choleskyFn A m m else 0 := by + intro i + rcases lt_trichotomy i.val m.val with hi | hi | hi + · rw [if_neg (fun e => by rw [e] at hi; exact lt_irrefl _ hi), + Spec.Factorization.choleskyFn_lower_triangular A hi, mul_zero] + · have him : i = m := Fin.ext hi + rw [if_pos him, him, zm1, one_mul] + · rw [if_neg (fun e => by rw [e] at hi; exact lt_irrefl _ hi), + show z i = 0 from hz_gt i hi, zero_mul] + rw [Finset.sum_congr rfl (fun i _ => hper i), + Finset.sum_ite_eq' Finset.univ m (fun _ => Spec.choleskyFn A m m)] + simp + · rw [if_neg (fun e => by rw [e] at hl; exact lt_irrefl _ hl)] + refine Finset.sum_eq_zero (fun i _ => ?_) + rcases Nat.lt_or_ge i.val l.val with hi | hi + · rw [Spec.Factorization.choleskyFn_lower_triangular A hi, mul_zero] + · rw [show z i = 0 from hz_gt i (lt_of_lt_of_le hl hi), zero_mul] + -- the Gram term `T1 = ∑ₗ (∑ᵢ zᵢ L[i,l])² = L[m,m]²` + have T1eval : (∑ l, (∑ i, z i * Spec.choleskyFn A i l) * (∑ i, z i * Spec.choleskyFn A i l)) + = Spec.choleskyFn A m m * Spec.choleskyFn A m m := by + rw [show (∑ l, (∑ i, z i * Spec.choleskyFn A i l) * (∑ i, z i * Spec.choleskyFn A i l)) + = ∑ l, (if l = m then Spec.choleskyFn A m m * Spec.choleskyFn A m m else 0) + from Finset.sum_congr rfl (fun l _ => by + rw [wval l] + by_cases hlm : l = m + · rw [if_pos hlm, if_pos hlm] + · rw [if_neg hlm, if_neg hlm, mul_zero]), + Finset.sum_ite_eq' Finset.univ m (fun _ => Spec.choleskyFn A m m * Spec.choleskyFn A m m)] + simp + -- the residual term `T2 = ∑ᵢ∑ⱼ zᵢ (A[i,j] − R[i,j]) zⱼ` reduces to the `(m,m)` entry + have T2eval : (∑ i, ∑ j, z i + * ((A i j - ∑ l, Spec.choleskyFn A i l * Spec.choleskyFn A j l) * z j)) + = A m m - ∑ l, Spec.choleskyFn A m l * Spec.choleskyFn A m l := by + rw [Finset.sum_eq_single m] + · rw [Finset.sum_eq_single m] + · rw [zm1]; ring + · intro j _ hj + rcases lt_trichotomy j.val m.val with hjm | hjm | hjm + · rw [hAij m j (le_refl _) (le_of_lt hjm) (Or.inr hjm)]; ring + · exact absurd (Fin.ext hjm) hj + · rw [show z j = 0 from hz_gt j hjm, mul_zero, mul_zero] + · intro h; exact absurd (Finset.mem_univ m) h + · intro i _ hi + refine Finset.sum_eq_zero (fun j _ => ?_) + rcases lt_trichotomy i.val m.val with him | him | him + · rcases lt_trichotomy j.val m.val with hjm | hjm | hjm + · rw [hAij i j (le_of_lt him) (le_of_lt hjm) (Or.inl him)]; ring + · rw [hAij i j (le_of_lt him) (le_of_eq hjm) (Or.inl him)]; ring + · rw [show z j = 0 from hz_gt j hjm, mul_zero, mul_zero] + · exact absurd (Fin.ext him) hi + · rw [show z i = 0 from hz_gt i him, zero_mul] + · intro h; exact absurd (Finset.mem_univ m) h + -- splitting the full squared norm of row `m` of `L` (the `> m` part vanishes) + have Rmm_split : (∑ l, Spec.choleskyFn A m l * Spec.choleskyFn A m l) + = (∑ k, if k.val < m.val then Spec.choleskyFn A m k * Spec.choleskyFn A m k else 0) + + Spec.choleskyFn A m m * Spec.choleskyFn A m m := by + rw [sum_split_lt_eq_gt m (fun l => Spec.choleskyFn A m l * Spec.choleskyFn A m l), + show (∑ k, if m.val < k.val then Spec.choleskyFn A m k * Spec.choleskyFn A m k else 0) = 0 + from Finset.sum_eq_zero (fun k _ => by + by_cases hk : m.val < k.val + · rw [if_pos hk, Spec.Factorization.choleskyFn_lower_triangular A hk, zero_mul] + · rw [if_neg hk])] + ring + -- the witness quadratic form `zᵀ A z` equals the radicand + have hqf : star z ⬝ᵥ (Matrix.of A *ᵥ z) = ∑ i, ∑ j, z i * (A i j * z j) := by + show (∑ i, star (z i) * (∑ j, (Matrix.of A) i j * z j)) = _ + refine Finset.sum_congr rfl (fun i _ => ?_) + rw [star_trivial, Finset.mul_sum] + exact Finset.sum_congr rfl (fun j _ => by rw [Matrix.of_apply]) + have hQsplit : (∑ i, ∑ j, z i * (A i j * z j)) + = (∑ i, ∑ j, z i * ((∑ l, Spec.choleskyFn A i l * Spec.choleskyFn A j l) * z j)) + + (∑ i, ∑ j, z i + * ((A i j - ∑ l, Spec.choleskyFn A i l * Spec.choleskyFn A j l) * z j)) := by + rw [← Finset.sum_add_distrib] + refine Finset.sum_congr rfl (fun i _ => ?_) + rw [← Finset.sum_add_distrib] + exact Finset.sum_congr rfl (fun j _ => by ring) + have hqf_eq_rad : star z ⬝ᵥ (Matrix.of A *ᵥ z) + = A m m - ∑ k, if k.val < m.val then Spec.choleskyFn A m k * Spec.choleskyFn A m k else 0 := by + rw [hqf, hQsplit, double_sum_gram z (Spec.choleskyFn A), T1eval, T2eval, Rmm_split] + ring + -- positive-definiteness applied to the nonzero witness finishes it + have hz_ne : z ≠ 0 := fun h => one_ne_zero (by + have hzm := congrFun h m; rwa [zm1, Pi.zero_apply] at hzm) + have hpos := hpd.dotProduct_mulVec_pos hz_ne + rw [hqf_eq_rad] at hpos + exact hpos + +/-! ## The kernel-ridge solve, unconditional for SPD inputs -/ + +/-- **Kernel-ridge solve, unconditional for an SPD regularized system.** For a positive-semidefinite +kernel `K` and `γ > 0`, `solveRidgeFn K γ b` solves `(K + γ·I)·x = b` exactly — with *no* pivot +hypothesis. This is the fully discharged verified `solve_variationnal`: the keystone +`choleskyFn_diag_pos_of_posDef` supplies the positive pivots from `posDef_addScaledIdFn`. -/ +theorem solveRidgeFn_mulVec_of_posSemidef (K : Fin n → Fin n → ℝ) (γ : ℝ) (b : Fin n → ℝ) + (hK : (Matrix.of K).PosSemidef) (hγ : 0 < γ) : + (Matrix.of (Spec.addScaledIdFn K γ)) *ᵥ (Spec.solveRidgeFn K γ b) = b := by + have hpd : (Matrix.of (Spec.addScaledIdFn K γ)).PosDef := posDef_addScaledIdFn hK hγ + have hsymm : ∀ i j, K i j = K j i := by + intro i j + have h := hK.1.apply i j + simp only [Matrix.of_apply, star_trivial] at h + exact h.symm + exact solveRidgeFn_mulVec K γ b hsymm + (fun j => choleskyFn_diag_pos_of_posDef (Spec.addScaledIdFn K γ) hpd j) + +/-- **Tensor-level kernel-ridge solve, unconditional for SPD inputs.** For a tensor kernel `K` whose +matrix view is positive-semidefinite and `γ > 0`, `solveRidgeSpec K γ b` solves `(K + γ·I)·x = b` +exactly: `(K + γ·I) *ᵥ (solveRidgeSpec K γ b) = b`. -/ +theorem solveRidgeSpec_mulVec_of_posSemidef (K : Spec.Tensor ℝ (.dim n (.dim n .scalar))) (γ : ℝ) + (b : Spec.Tensor ℝ (.dim n .scalar)) (hK : (Matrix.of (Spec.toMatFn K)).PosSemidef) (hγ : 0 < γ) : + (Matrix.of (Spec.addScaledIdFn (Spec.toMatFn K) γ)) *ᵥ (Spec.toVecFn (Spec.solveRidgeSpec K γ b)) + = Spec.toVecFn b := by + have hround : Spec.toVecFn (Spec.solveRidgeSpec K γ b) + = Spec.solveRidgeFn (Spec.toMatFn K) γ (Spec.toVecFn b) := by + funext i; rfl + rw [hround] + exact solveRidgeFn_mulVec_of_posSemidef (Spec.toMatFn K) γ (Spec.toVecFn b) hK hγ diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index 0fbe845..98c2b42 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -134,6 +134,53 @@ def choleskySpec {n : Nat} (A : Tensor α (.dim n (.dim n .scalar))) : Tensor α (.dim n (.dim n .scalar)) := ofMatFn (choleskyFn (toMatFn A)) +/-! ## Triangular solves and the kernel-ridge (Tikhonov) linear solve + +Once `A` is factored as `A = L · Lᵀ` (Cholesky), the linear system `A · x = b` is solved by two +triangular substitutions: forward-solve `L · z = b`, then back-solve `Lᵀ · x = z`. Each substitution +visits the unknowns in an order such that, when row `i` is reached, every unknown it depends on has +already been computed; the accumulator `acc` holds those values and `0` everywhere else, so the dot +`dotFn (row i) acc` is exactly the required partial sum (the not-yet-solved and structurally-zero +terms drop out). This is the linear solve at the heart of CHD `solve_variationnal`. -/ + +/-- Forward substitution: solve `L · y = b` for a lower-triangular `L` with nonzero diagonal. +Unknowns are visited `0, 1, …, n-1`; when row `i` is reached `acc` holds `y₀ … yᵢ₋₁` (and `0` +elsewhere), so `dotFn (L i) acc = Σ_{k Function.update acc i ((b i - dotFn (L i) acc) / L i i)) + (fun _ => 0) + +/-- Back substitution: solve `U · x = y` for an upper-triangular `U` with nonzero diagonal. +Unknowns are visited `n-1, …, 1, 0`; when row `i` is reached `acc` holds `xᵢ₊₁ … xₙ₋₁` (and `0` +elsewhere), so `dotFn (U i) acc = Σ_{k>i} U[i,k]·xₖ` by upper-triangularity. -/ +def triSolveUpperFn {n : Nat} (U : Fin n → Fin n → α) (y : Fin n → α) : Fin n → α := + (List.finRange n).reverse.foldl + (fun acc i => Function.update acc i ((y i - dotFn (U i) acc) / U i i)) + (fun _ => 0) + +/-- Solve `A · x = b` given a Cholesky factor `L` of `A` (so `A = L · Lᵀ`): forward-solve +`L · z = b`, then back-solve `Lᵀ · x = z`. -/ +def cholSolveFn {n : Nat} (L : Fin n → Fin n → α) (b : Fin n → α) : Fin n → α := + triSolveUpperFn (fun i k => L k i) (triSolveLowerFn L b) + +/-- The regularized matrix `K + γ·I` as a function. For a symmetric PSD kernel `K` and `γ > 0` +this is symmetric positive-definite, so its Cholesky factorization succeeds. -/ +def addScaledIdFn {n : Nat} (K : Fin n → Fin n → α) (γ : α) : Fin n → Fin n → α := + fun i j => K i j + (if i = j then γ else 0) + +/-- The Tikhonov-regularized (kernel-ridge) solve `(K + γ·I)·x = b`, via the Cholesky factorization +of `K + γ·I`. This is the linear solve at the core of CHD `solve_variationnal`. -/ +def solveRidgeFn {n : Nat} (K : Fin n → Fin n → α) (γ : α) (b : Fin n → α) : Fin n → α := + cholSolveFn (choleskyFn (addScaledIdFn K γ)) b + +/-- Tensor-level kernel-ridge solve: `(K + γ·I)·x = b`. + +PyTorch analogue: `torch.linalg.solve(K + γ·I, b)` (specialized to the SPD Cholesky path). -/ +def solveRidgeSpec {n : Nat} (K : Tensor α (.dim n (.dim n .scalar))) (γ : α) + (b : Tensor α (.dim n .scalar)) : Tensor α (.dim n .scalar) := + ofVecFn (solveRidgeFn (toMatFn K) γ (toVecFn b)) + /-! ## QR factorization (modified Gram–Schmidt) For `A : m × n`, produce `Q : m × n` with orthonormal columns and `R : n × n` upper-triangular diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index e4869b5..a50fddd 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -106,6 +106,56 @@ pivot term, and the positive-pivot hypothesis discharges the two side conditions radicand for the diagonal (`Real.mul_self_sqrt`) and a non-zero divisor for the below-diagonal entries. Symmetry of `A` extends the lower-triangular reconstruction to the whole matrix. +# Solving the regularized system: verified `solve_variationnal` + +The eigendecomposition route above gives `(K + γI)⁻¹` as an abstract identity. But CHD does not form +inverses; it *solves* the regularized system `(K + γI)·x = b`, and the SPD structure makes the direct +Cholesky route both faster and — crucially for verification — *exact*: because `K + γI` is symmetric +positive-definite, its Cholesky factorization is finite, so the whole solve carries no asymptotic +caveat. This is the second, complementary verified route to `solve_variationnal`, in +[`NN.Proofs.Tensor.Basic.FactorizationsSolve`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean). + +The solve is two triangular substitutions. Forward substitution `triSolveLowerFn` and back +substitution `triSolveUpperFn` are *exact*: for a lower- (resp. upper-) triangular matrix with nonzero +diagonal, + +$$`(L\,y)_i = b_i \quad\text{and}\quad (U\,x)_i = c_i \qquad\text{for every } i.` + +The key observation is that *no induction on the solved values is needed*: the entry `yᵢ` is defined +precisely to make row `i` balance, so unfolding it and using triangularity — the not-yet-visited and +structurally-zero terms drop out of the row dot product — gives the identity directly +(`triSolveLowerFn_mulVec`, `triSolveUpperFn_mulVec`). Each substitution is a `Function.update` fold +over the index list (`finRange n` forward, its reverse for back-substitution); two generic lemmas, +`foldl_update_read` and `foldl_update_stable`, capture the bookkeeping that the value written at index +`i` is never overwritten and earlier values are already in place. + +Composing them through a Cholesky factor solves the SPD system exactly (`cholSolveFn_mulVec`): + +$$`(L\,L^\top)\,x = b, \qquad x = \texttt{backSolve}\,L^\top\,(\texttt{forwardSolve}\,L\,b).` + +Specializing `L` to the Cholesky factor of `K + γI` gives `solveRidgeFn_mulVec`: if the Cholesky +pivots of `K + γI` are positive — the success condition — then `solveRidgeFn K γ b` solves +`(K + γI)·x = b` *exactly*. The `RidgeSolve` example exercises this on a rank-deficient Gram kernel +`K = G·Gᵀ`: with `γ = 0.5` the residual is zero to machine precision, while the *negative control* +`γ = 0` hits a zero pivot on the singular `K` and diverges — regularization is what makes the solve +well-posed. + +That success condition is now discharged, so the headline `solveRidgeFn_mulVec_of_posSemidef` is +*unconditional*: for a positive-semidefinite kernel `K` and `γ > 0`, `solveRidgeFn K γ b` solves +`(K + γI)·x = b` exactly with no pivot hypothesis. Two facts combine. First, `posDef_addScaledIdFn` +proves `K + γI` is positive-definite (via `Matrix.PosDef.one`, `Matrix.PosDef.smul`, +`Matrix.PosDef.posSemidef_add`) — genuinely SPD, exactly the regime where Cholesky succeeds. Second, +the *keystone* `choleskyFn_diag_pos_of_posDef` proves that a positive-definite matrix has +*strictly positive* executable Cholesky pivots (equivalently the radicand `A[j,j] − Σ_{k 0` at each +step). The proof is the leading-principal Schur-complement fact, formalized as an *explicit +quadratic-form witness* so it needs no matrix inverse: by strong induction on `j`, the leading block +reconstructs from the pivots below `j` (`choleskyFn_dot_eq_local`), and back-substitution — the +`triSolveUpperFn` already proven correct here — produces a vector `z` with `z_j = 1` whose `A`-quadratic +form `zᵀ A z` *equals* the radicand; positive-definiteness (`Matrix.PosDef.dotProduct_mulVec_pos`) +forces `zᵀ A z > 0`. The `RidgeSolve` example also exhibits the keystone directly: the SPD `K + γI` has +all-positive pivots, while the singular `K` has a zero pivot — PosDef is necessary. Nothing here is an +unproved axiom. + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact @@ -271,7 +321,18 @@ no cyclic-Jacobi convergence theory, so that cyclic rate remains captured by the residual certificate above — bounded numerically by the `assertLt` checks on concrete inputs — never by `sorry`; and the geometric machinery (`geom_bound_of_contraction`, `tendsto_zero_of_contraction`) is stated for an arbitrary per-step factor, ready to consume such a bound the moment it exists. + +On the *direct* solve route there is nothing left to do, because it avoids the eigensolver entirely. +The kernel-ridge solve `(K + γI)·x = b` is proved correct *exactly* (via verified forward/back +substitution and Cholesky), the regularized matrix is proved SPD for `γ > 0` (`posDef_addScaledIdFn`), +and the positive-pivot success condition is now discharged from that SPD fact by the keystone +`choleskyFn_diag_pos_of_posDef` (the radicand `A[j,j] − Σ_{k 0`, proved via the explicit +Schur-complement quadratic-form witness). Composing them, `solveRidgeFn_mulVec_of_posSemidef` makes the +verified `solve_variationnal` *unconditional* for any positive-semidefinite kernel `K` and `γ > 0`, with +no pivot hypothesis remaining. + Everything else is exact: the algebraic faithfulness of the decomposition (orthogonality, orthogonal similarity, the residual identity, the per-rotation decrease, the classical-strategy linear rate, and -correctness in the zero-residual limit) is proved, and the specification-level facts the kernel methods -rely on are independent of the convergence step, so the CHD foundation is complete. +correctness in the zero-residual limit), the finite Cholesky/QR reconstructions, and the +Cholesky-based regularized solve are proved, and the specification-level facts the kernel methods rely +on are independent of the convergence step, so the CHD foundation is complete. From e59f536725397bb15ade6c13c83cc5c37c9d98ca Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 11:47:09 -0700 Subject: [PATCH 10/22] Close the kernel-ridge solve loop: SPD Cholesky capstone + inverse form MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two capstones on top of the positive-pivot keystone, making the verified solve_variationnal match the form CHD actually specifies. Proofs (NN/Proofs/Tensor/Basic/FactorizationsSolve.lean): * cholesky_posDef — for any PosDef A, the executable choleskyFn IS a genuine Cholesky factor (lower-triangular, A = L·Lᵀ, strictly positive diagonal), with no pivot/symmetry/success hypothesis. Combines the keystone choleskyFn_diag_pos_of_posDef with isCholesky_of_pos. * solveRidgeFn_eq_inv_mulVec (+ tensor-level solveRidgeSpec_eq_inv_mulVec) — solveRidgeFn K γ b = (K + γ·I)⁻¹ b, the closed form CHD solve_variationnal specifies. Invertibility from Matrix.PosDef.isUnit; the solve identity (K + γ·I)·x = b then pins x uniquely to the inverse. No inverse is ever formed by the algorithm. Examples (NN/Examples/Factorization/RidgeSolve.lean), 8 #eval checks green: * Capstone reconstruction: SPD K + γ·I gives L·Lᵀ = K + γ·I to machine precision; negative control an indefinite matrix hits √(negative) = NaN and fails — PosDef (not mere symmetry) is necessary. (Documented subtlety: the singular PSD K still reconstructs with a zero pivot; the zero pivot breaks only the solve, the dichotomy the keystone isolates — so the reconstruction negative control uses a genuinely indefinite matrix.) * Inverse form: columns built by solveRidgeSpec K γ eⱼ assemble into (K + γ·I)⁻¹, and (K + γ·I)·(K + γ·I)⁻¹ = I; negative control γ = 0 on the singular K diverges (NaN). Docs: blueprint Ch4 Factorizations chapter gains a capstone paragraph and a closed-loop note in "What remains"; the two example module docstrings updated. sorry/admit/omega-free. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 6 +- NN/Examples/Factorization/RidgeSolve.lean | 68 +++++++++++++++++++ .../Tensor/Basic/FactorizationsSolve.lean | 60 ++++++++++++++++ .../Ch4_Verification/Factorizations.lean | 26 ++++++- 4 files changed, 158 insertions(+), 2 deletions(-) diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 2e4d60d..79dc2e1 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -46,7 +46,11 @@ factorization misbehaves. `K = G·Gᵀ` and `γ > 0`, `solveRidgeFn` reconstructs `b` to machine precision; **negative control**: with `γ = 0` the singular `K` has a zero Cholesky pivot and the solve diverges (`NaN`), so regularization is necessary. Also exhibits the **keystone** `choleskyFn_diag_pos_of_posDef`: the SPD - `K + γ·I` has all-positive Cholesky pivots, while the singular `K` has a zero pivot (PosDef needed). + `K + γ·I` has all-positive Cholesky pivots, while the singular `K` has a zero pivot (PosDef needed); + and the two **capstones** — `cholesky_posDef` (the SPD Cholesky reconstructs `L·Lᵀ = K + γ·I` + exactly, while an *indefinite* matrix fails with a `NaN` pivot) and `solveRidgeFn_eq_inv_mulVec` (the + solve *is* the regularized inverse: its columns assemble into `(K + γ·I)⁻¹` with + `(K + γ·I)·(K + γ·I)⁻¹ = I`). Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/RidgeSolve.lean b/NN/Examples/Factorization/RidgeSolve.lean index 6fd9ab2..2eeb5ff 100644 --- a/NN/Examples/Factorization/RidgeSolve.lean +++ b/NN/Examples/Factorization/RidgeSolve.lean @@ -33,6 +33,16 @@ setting CHD targets — so it is *not* invertible on its own. The checks exhibit * **Negative — regularization is necessary.** With `γ = 0` the singular `K` has a zero Cholesky pivot: forward/back substitution divides by zero and the residual blows up (`NaN`/large). This is why CHD regularizes; it is also exactly the `γ > 0` hypothesis of `posDef_addScaledIdFn`. + +It then exercises the two capstone theorems that close the solve story: + +* `cholesky_posDef` — for the SPD `K + γ·I` the executable Cholesky reconstructs *exactly* + (`L · Lᵀ = K + γ·I`); an *indefinite* matrix instead gets a `√(negative) = NaN` pivot and fails, so + positive-definiteness is what the capstone needs. +* `solveRidgeFn_eq_inv_mulVec` — `solveRidgeFn K γ b = (K + γ·I)⁻¹ b`, the closed form CHD + `solve_variationnal` specifies. Solving against each basis vector builds the columns of the inverse, + and the assembled matrix satisfies `(K + γ·I) · (K + γ·I)⁻¹ = I` — no inverse is ever formed by the + algorithm; every column is a verified Cholesky solve. -/ @[expose] public section @@ -126,4 +136,62 @@ def Kγ : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := addGammaI K γ #eval assertGe "singular K has a non-positive Cholesky pivot (PosDef necessary)" (numNonPosPivots K) 0.5 +/-! ## Capstone: the SPD Cholesky reconstructs exactly + +`Spec.Factorization.Reconstruction.cholesky_posDef` bundles the keystone with the reconstruction +theorem: for the *positive-definite* `K + γ·I`, the executable Cholesky factor is a genuine factor — +`L · Lᵀ = K + γ·I` exactly — with no pivot or symmetry hypothesis. The negative control is an +*indefinite* symmetric matrix: there a radicand goes negative, the pivot is `√(negative) = NaN`, and +reconstruction fails — so positive-definiteness (not mere symmetry) is what the capstone needs. (Note +the singular `K` itself, being PSD, *does* reconstruct with a zero pivot; the zero pivot breaks only +the *solve*, which is the dichotomy the keystone above isolates.) -/ + +/-- An indefinite symmetric matrix (top-left block has eigenvalues `3, −1`): not PosDef, so its +Cholesky hits `√(negative)`. -/ +def Aindef : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[1, 2, 0], + [2, 1, 0], + [0, 0, 1]] + +-- Positive — the SPD `K + γ·I` Cholesky reconstructs exactly: `L · Lᵀ = K + γ·I` (`cholesky_posDef`). +#eval assertLt "SPD Cholesky reconstructs: L·Lᵀ = K + γ·I (capstone)" + (frobSqErr (let L := Spec.choleskySpec Kγ; mm L (tr L)) Kγ) + +-- Negative — an indefinite matrix gets a `√(negative) = NaN` pivot, so it does not reconstruct. +#eval assertReconFails "indefinite matrix Cholesky does not reconstruct (PosDef necessary)" + (frobSqErr (let L := Spec.choleskySpec Aindef; mm L (tr L)) Aindef) + +/-! ## Closing the loop: the ridge solve *is* the regularized inverse + +`Spec.Factorization.Reconstruction.solveRidgeFn_eq_inv_mulVec` proves `solveRidgeFn K γ b += (K + γ·I)⁻¹ b` — the closed form CHD `solve_variationnal` specifies. Solving against each standard +basis vector `eⱼ` therefore produces column `j` of `(K + γ·I)⁻¹`; assembling the columns gives a +genuine inverse, witnessed by `(K + γ·I) · (K + γ·I)⁻¹ = I`. No matrix inverse is formed by the +algorithm — every column comes from the verified Cholesky solve. -/ + +/-- The `j`-th standard basis vector. -/ +def unitVec {k : Nat} (j : Fin k) : Spec.Tensor Float (.dim k .scalar) := + Spec.ofVecFn (fun i => if i = j then 1.0 else 0.0) + +/-- The `k × k` identity matrix. -/ +def idMat {k : Nat} : Spec.Tensor Float (.dim k (.dim k .scalar)) := + Spec.ofMatFn (fun i j => if i = j then 1.0 else 0.0) + +/-- The regularized inverse `(K + γ·I)⁻¹`, built column-by-column by the verified ridge solve: column +`j` is `solveRidgeSpec K γ eⱼ` (an instance of `solveRidgeFn_eq_inv_mulVec`). -/ +def ridgeInv {k : Nat} (K : Spec.Tensor Float (.dim k (.dim k .scalar))) (γ : Float) : + Spec.Tensor Float (.dim k (.dim k .scalar)) := + Spec.ofMatFn (fun i j => Spec.Tensor.toScalar (Spec.get (Spec.solveRidgeSpec K γ (unitVec j)) i)) + +#eval IO.println s!"(K+γI)⁻¹ diagonal (assembled from ridge solves): \ + {vecToList (diagOf (ridgeInv K γ))}" + +-- Positive — the assembled inverse really inverts: `(K + γ·I) · (K + γ·I)⁻¹ = I`. +#eval assertLt "ridge solve builds the regularized inverse: (K+γI)·(K+γI)⁻¹ = I" + (frobSqErr (mm Kγ (ridgeInv K γ)) idMat) + +-- Negative — with `γ = 0` the singular `K` has no inverse: the column solves diverge (NaN). +#eval assertReconFails "unregularized singular K has no inverse (γ = 0 → solve diverges)" + (frobSqErr (mm K (ridgeInv K 0.0)) idMat) + end NN.Examples.Factorization.RidgeSolve diff --git a/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean b/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean index b5edb94..86fa1d9 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsSolve.lean @@ -611,6 +611,28 @@ theorem choleskyFn_diag_pos_of_posDef (A : Fin n → Fin n → ℝ) (hpd : (Matr rw [hqf_eq_rad] at hpos exact hpos +/-! ## Capstone: the executable Cholesky *is* the factorization of any SPD matrix + +Combining the keystone with the reconstruction theorem proved in `FactorizationsReconstruction`, the +executable `choleskyFn` is — with *no* hypothesis beyond positive-definiteness — a genuine Cholesky +factor of any SPD matrix: lower-triangular, with `A = L · Lᵀ`, and strictly positive diagonal. This is +the unconditional statement "`choleskyFn` computes the Cholesky factorization of an SPD matrix". -/ + +/-- **The executable Cholesky factorization of an SPD matrix.** For a positive-definite `A`, +`choleskyFn A` is a genuine Cholesky factor of `A` (lower-triangular, `A = L · Lᵀ`) with strictly +positive diagonal — no pivot, symmetry, or success hypothesis. The positivity of the pivots is the +keystone `choleskyFn_diag_pos_of_posDef`; the factorization identity is `isCholesky_of_pos` fed by it. -/ +theorem cholesky_posDef (A : Fin n → Fin n → ℝ) (hpd : (Matrix.of A).PosDef) : + Spec.Factorization.IsCholesky (Matrix.of A) (Matrix.of (Spec.choleskyFn A)) + ∧ ∀ j, 0 < Spec.choleskyFn A j j := by + have hsymm : ∀ i j, A i j = A j i := by + intro i j + have h := hpd.1.apply i j + simp only [Matrix.of_apply, star_trivial] at h + exact h.symm + have hpos : ∀ j, 0 < Spec.choleskyFn A j j := fun j => choleskyFn_diag_pos_of_posDef A hpd j + exact ⟨isCholesky_of_pos A hsymm hpos, hpos⟩ + /-! ## The kernel-ridge solve, unconditional for SPD inputs -/ /-- **Kernel-ridge solve, unconditional for an SPD regularized system.** For a positive-semidefinite @@ -641,3 +663,41 @@ theorem solveRidgeSpec_mulVec_of_posSemidef (K : Spec.Tensor ℝ (.dim n (.dim n funext i; rfl rw [hround] exact solveRidgeFn_mulVec_of_posSemidef (Spec.toMatFn K) γ (Spec.toVecFn b) hK hγ + +/-! ## Closing the loop: the ridge solve *is* the regularized inverse + +CHD `solve_variationnal` is specified as `x = (K + γ·I)⁻¹ b`. The solve theorems above prove +`(K + γ·I)·x = b`; positive-definiteness makes `K + γ·I` invertible, so that equation pins `x` down +*uniquely* — and identifies the computed `solveRidgeFn` with the closed form `(K + γ·I)⁻¹ b`. This is +the exact statement CHD consumes, with no inverse ever formed by the algorithm itself. -/ + +/-- **The ridge solve equals the regularized inverse applied to `b`.** For a positive-semidefinite +kernel `K` and `γ > 0`, the computed `solveRidgeFn K γ b` is exactly `(K + γ·I)⁻¹ b` — the closed form +CHD `solve_variationnal` specifies. Invertibility comes from `posDef_addScaledIdFn` (PosDef ⟹ unit), +and the solve identity `solveRidgeFn_mulVec_of_posSemidef` then forces equality with the inverse. -/ +theorem solveRidgeFn_eq_inv_mulVec (K : Fin n → Fin n → ℝ) (γ : ℝ) (b : Fin n → ℝ) + (hK : (Matrix.of K).PosSemidef) (hγ : 0 < γ) : + Spec.solveRidgeFn K γ b = (Matrix.of (Spec.addScaledIdFn K γ))⁻¹ *ᵥ b := by + set M := Matrix.of (Spec.addScaledIdFn K γ) with hM + have hpd : M.PosDef := posDef_addScaledIdFn hK hγ + have hdet : IsUnit M.det := (Matrix.isUnit_iff_isUnit_det (A := M)).mp hpd.isUnit + have hsolve : M *ᵥ (Spec.solveRidgeFn K γ b) = b := + solveRidgeFn_mulVec_of_posSemidef K γ b hK hγ + calc Spec.solveRidgeFn K γ b + = (M⁻¹ * M) *ᵥ (Spec.solveRidgeFn K γ b) := by + rw [Matrix.nonsing_inv_mul M hdet, Matrix.one_mulVec] + _ = M⁻¹ *ᵥ (M *ᵥ (Spec.solveRidgeFn K γ b)) := by rw [Matrix.mulVec_mulVec] + _ = M⁻¹ *ᵥ b := by rw [hsolve] + +/-- **Tensor-level: the ridge solve equals the regularized inverse.** For a tensor kernel `K` whose +matrix view is positive-semidefinite and `γ > 0`, `solveRidgeSpec K γ b` is the regularized inverse +`(K + γ·I)⁻¹` applied to `b`. -/ +theorem solveRidgeSpec_eq_inv_mulVec (K : Spec.Tensor ℝ (.dim n (.dim n .scalar))) (γ : ℝ) + (b : Spec.Tensor ℝ (.dim n .scalar)) (hK : (Matrix.of (Spec.toMatFn K)).PosSemidef) (hγ : 0 < γ) : + Spec.toVecFn (Spec.solveRidgeSpec K γ b) + = (Matrix.of (Spec.addScaledIdFn (Spec.toMatFn K) γ))⁻¹ *ᵥ Spec.toVecFn b := by + have hround : Spec.toVecFn (Spec.solveRidgeSpec K γ b) + = Spec.solveRidgeFn (Spec.toMatFn K) γ (Spec.toVecFn b) := by + funext i; rfl + rw [hround] + exact solveRidgeFn_eq_inv_mulVec (Spec.toMatFn K) γ (Spec.toVecFn b) hK hγ diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index a50fddd..d60d5a0 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -156,6 +156,27 @@ forces `zᵀ A z > 0`. The `RidgeSolve` example also exhibits the keystone direc all-positive pivots, while the singular `K` has a zero pivot — PosDef is necessary. Nothing here is an unproved axiom. +Two capstones close the solve story. First, the keystone and the reconstruction theorem combine into +`cholesky_posDef`: for *any* positive-definite `A`, the executable `choleskyFn` is — with no pivot, +symmetry, or success hypothesis — a genuine Cholesky factor (`A = L · Lᵀ`, lower-triangular, strictly +positive diagonal). This is the unconditional statement "`choleskyFn` computes the Cholesky +factorization of an SPD matrix". The `RidgeSolve` example exhibits both directions: the SPD `K + γI` +reconstructs to machine precision, while an *indefinite* matrix hits a `√(negative) = NaN` pivot and +fails — positive-definiteness, not mere symmetry, is the hypothesis the capstone needs. (A singular +PSD `K` still reconstructs, with a zero pivot; the zero pivot breaks only the *solve*, which is exactly +the dichotomy the keystone isolates.) + +Second, `solveRidgeFn_eq_inv_mulVec` identifies the computed solve with the closed form CHD specifies: + +$$`\texttt{solveRidgeFn}\,K\,\gamma\,b \;=\; (K + \gamma I)^{-1} b.` + +The solve theorems prove `(K + γI)·x = b`; positive-definiteness makes `K + γI` invertible +(`Matrix.PosDef.isUnit`), so that equation pins `x` down *uniquely* and forces equality with the +inverse — closing the loop to `solve_variationnal`'s `(K + γI)⁻¹ b` *without the algorithm ever forming +an inverse*. The `RidgeSolve` example makes this concrete: solving against each standard basis vector +`eⱼ` produces column `j` of `(K + γI)⁻¹`, and the assembled matrix satisfies +`(K + γI) · (K + γI)⁻¹ = I` to machine precision, every column coming from the verified Cholesky solve. + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact @@ -329,7 +350,10 @@ and the positive-pivot success condition is now discharged from that SPD fact by `choleskyFn_diag_pos_of_posDef` (the radicand `A[j,j] − Σ_{k 0`, proved via the explicit Schur-complement quadratic-form witness). Composing them, `solveRidgeFn_mulVec_of_posSemidef` makes the verified `solve_variationnal` *unconditional* for any positive-semidefinite kernel `K` and `γ > 0`, with -no pivot hypothesis remaining. +no pivot hypothesis remaining. The loop to the CHD specification is closed by +`solveRidgeFn_eq_inv_mulVec`, which upgrades the solve identity `(K + γI)·x = b` to the closed form +`x = (K + γI)⁻¹ b` (uniqueness from invertibility), and by `cholesky_posDef`, which states +unconditionally that the executable Cholesky *is* the factorization of any SPD matrix. Everything else is exact: the algebraic faithfulness of the decomposition (orthogonality, orthogonal similarity, the residual identity, the per-rotation decrease, the classical-strategy linear rate, and From a79b972c09f2e09027c39d4ee5db8bd605608ba5 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 12:19:05 -0700 Subject: [PATCH 11/22] Identify the CHD eig-form routines: solve_variationnal, find_gamma, Z_test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tier A's predicate foundation (IsSymEig, add_smul_inv, trace_eq, det_eq) already existed; this adds the three concrete CHD routines built on it, mirroring interpolatory.py's eigendecomposition route, with their exact algebra proved over ℝ from the IsSymEig specification (no appeal to Jacobi convergence). Spec (NN/Spec/Core/Tensor/Factorizations.lean): executable eig-form mirrors — projFn (Pga = Vᵀga), ridgeCoeffFn (rᵢ = γ/(λᵢ+γ)), variationalSolveFn, varNoiseFn, plus tensor wrappers variationalSolveSpec / varNoiseSpec. Proofs (new NN/Proofs/Tensor/Basic/FactorizationsVariational.lean, sorry/omega-free): - variationalSolveFn_eq_neg_inv_mulVec: the eig-form solve_variationnal IS -(K+γI)⁻¹·ga (from add_smul_inv). - variationalSolveFn_eq_neg_solveRidgeFn: eig route = Cholesky route — two independent implementations of solve_variationnal agree on the one closed form. - IsSymEig.eigenvalues_nonneg: PSD ⟹ λ ≥ 0 (via VᵀAV PSD-congruence), discharging λᵢ+γ ≠ 0 from γ > 0. - varNoiseFn_eq_ratio: the noise / find_gamma loss / Z_test statistic is the spectral ratio Σ(Pgaᵢ·rᵢ)² / Σ Pgaᵢ²·rᵢ. - ridgeCoeffFn_pos/le_one, varNoiseFn_nonneg/le_one: for a PSD spectrum and γ > 0 the noise is a genuine fraction in [0,1]. - projFn_mulVec_self, varNoiseFn_projFn_mulVec: Z_test spectral invariance — feeding ga = V·z drops V, so the statistic depends on the kernel only through its spectrum. Examples (new NN/Examples/Factorization/Variational.lean): 8 green #eval checks on an SPD kernel — (K+γI)·yb = -ga and yb = -solveRidgeSpec to machine precision, noise ∈ [0,1], spectral invariance noise(V·z)=noise(z); negative controls: wrong eigenvectors break the solve (residual 3.72), γ = -0.7 pushes noise to -7.19. Blueprint: new "CHD routines" section + updated "What remains". Scope honesty: only the deterministic algebra is proved; Z_test's Gaussian sampling/percentiles are statistical, not algebraic, and are exercised numerically rather than proved. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 8 + NN/Examples/Factorization/Variational.lean | 162 ++++++++++++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Basic/FactorizationsVariational.lean | 200 ++++++++++++++++++ NN/Spec/Core/Tensor/Factorizations.lean | 54 +++++ .../Ch4_Verification/Factorizations.lean | 57 ++++- 6 files changed, 481 insertions(+), 1 deletion(-) create mode 100644 NN/Examples/Factorization/Variational.lean create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsVariational.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 79dc2e1..3d06494 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -14,6 +14,7 @@ public import NN.Examples.Factorization.SVD public import NN.Examples.Factorization.JacobiDecrease public import NN.Examples.Factorization.JacobiRate public import NN.Examples.Factorization.RidgeSolve +public import NN.Examples.Factorization.Variational /-! # Matrix factorization examples @@ -51,6 +52,13 @@ factorization misbehaves. exactly, while an *indefinite* matrix fails with a `NaN` pivot) and `solveRidgeFn_eq_inv_mulVec` (the solve *is* the regularized inverse: its columns assemble into `(K + γ·I)⁻¹` with `(K + γ·I)·(K + γ·I)⁻¹ = I`). +- `Variational` — the *eigendecomposition* form of CHD `perform_regression_and_find_gamma` + (`interpolatory.py`): from `eigh(K)`, the variational solve `yb = -(K + γ·I)⁻¹·ga`, the agreement of + the eig and Cholesky routes (`variationalSolveFn_eq_neg_solveRidgeFn`), the + `noise`/`find_gamma`-loss/`Z_test` statistic as a spectral ratio bounded in `[0,1]` + (`varNoiseFn_nonneg`, `varNoiseFn_le_one`), and `Z_test` spectral invariance + (`varNoiseFn_projFn_mulVec`); **negative controls**: wrong eigenvectors break the solve, and `γ < 0` + pushes the noise outside `[0,1]`. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Variational.lean b/NN/Examples/Factorization/Variational.lean new file mode 100644 index 0000000..83f6e44 --- /dev/null +++ b/NN/Examples/Factorization/Variational.lean @@ -0,0 +1,162 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the CHD variational solve, noise, and `Z_test` statistic (eigendecomposition form) + +These checks corroborate `NN.Proofs.Tensor.Basic.FactorizationsVariational`, the eigendecomposition +route CHD's `perform_regression_and_find_gamma` actually takes (`interpolatory.py`). From `eigh(K)` it +forms the projected data `Pga = Vᵀ·ga` and shrinkage coefficients `rᵢ = γ/(λᵢ+γ)`, then runs three +routines off that shared core. We exercise each: + +* **The variational solve is the regularized inverse.** `variationalSolveSpec` returns + `yb = -(K+γ·I)⁻¹·ga`, so `(K+γ·I)·yb = -ga` to machine precision (`variationalSolveFn_eq_inv_mulVec`). +* **Eig route = Cholesky route.** The same `yb` equals `-solveRidgeSpec K γ ga` — the verified Cholesky + solve from `FactorizationsSolve` — to machine precision (`variationalSolveFn_eq_neg_solveRidgeFn`): + two independent implementations, one closed form. +* **The noise is a fraction.** `varNoiseSpec` (the `noise`, the `find_gamma` loss, the `Z_test` + statistic) lies in `[0,1]` (`varNoiseFn_nonneg`, `varNoiseFn_le_one`). +* **`Z_test` spectral invariance.** Feeding `ga = V·z` makes `V` drop out: the noise of `V·z` under `V` + equals the noise of `z` under the identity (`varNoiseFn_projFn_mulVec`) — the statistic depends on + the kernel only through its eigenvalues. + +Negative controls give the metrics teeth: + +* feeding the **wrong** eigenvectors (the identity instead of the true `V`) breaks the solve — the + residual `(K+γ·I)·yb + ga` is large, so the *actual* eigendecomposition is needed; +* with **`γ < 0`** the shrinkage coefficients leave `(0,1]` and the noise falls outside `[0,1]`, so + `γ > 0` is necessary for the bound. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.Variational + +/-- Build a length-`n` `Float` vector from a list (missing entries `0`). -/ +def mkVec {n : Nat} (xs : List Float) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => xs.getD i.val 0.0) + +/-- The regularized matrix `K + γ·I` as a tensor. -/ +def addGammaI {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) : + Spec.Tensor Float (.dim n (.dim n .scalar)) := + Spec.ofMatFn (fun i j => Spec.get2 K i j + (if i.val == j.val then γ else 0.0)) + +/-- Matrix–vector product `M · v`. -/ +def mv {n : Nat} (M : Spec.Tensor Float (.dim n (.dim n .scalar))) + (v : Spec.Tensor Float (.dim n .scalar)) : Spec.Tensor Float (.dim n .scalar) := + Spec.matVecMulSpec M v + +/-- Entrywise negation of a vector. -/ +def negVec {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => 0.0 - Spec.Tensor.toScalar (Spec.get v i)) + +/-- `ℓ¹` magnitude `Σᵢ |vᵢ|` (a sum, so a `NaN` entry propagates instead of being dropped). -/ +def vecAbsErr {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : Float := + (List.finRange n).foldl (fun a i => a + Float.abs (Spec.Tensor.toScalar (Spec.get v i))) 0.0 + +/-- `ℓ¹` distance `Σᵢ |uᵢ − vᵢ|` between two vectors. -/ +def vecDist {n : Nat} (u v : Spec.Tensor Float (.dim n .scalar)) : Float := + (List.finRange n).foldl + (fun a i => a + Float.abs (Spec.Tensor.toScalar (Spec.get u i) + - Spec.Tensor.toScalar (Spec.get v i))) 0.0 + +/-- The `k × k` identity matrix. -/ +def idMat {k : Nat} : Spec.Tensor Float (.dim k (.dim k .scalar)) := + Spec.ofMatFn (fun i j => if i = j then 1.0 else 0.0) + +/-- A symmetric positive-definite kernel (eigenvalues ≈ {0.5858, 2, 3.4142}). -/ +def K : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[2, 1, 0], + [1, 2, 1], + [0, 1, 2]] + +def γ : Float := 0.5 +def ga : Spec.Tensor Float (.dim 3 .scalar) := mkVec [1, 2, 3] + +/-- Eigendecomposition `K = V·diag(λ)·Vᵀ` via cyclic Jacobi (12 sweeps). -/ +def eig : Spec.Tensor Float (.dim 3 .scalar) × Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + Spec.symEigJacobiSpec K 12 +def evals : Spec.Tensor Float (.dim 3 .scalar) := eig.1 +def V : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := eig.2 + +/-- The variational solution `yb = -(K + γ·I)⁻¹·ga` (eigendecomposition form). -/ +def yb : Spec.Tensor Float (.dim 3 .scalar) := Spec.variationalSolveSpec evals V γ ga + +#eval IO.println s!"eigenvalues λ = {vecToList evals}; γ = {γ}; ga = {vecToList ga}" +#eval IO.println s!"variational solution yb = {vecToList yb}" +#eval IO.println s!"(K+γI)·yb + ga = {vecToList (Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (mv (addGammaI K γ) yb) i) + + Spec.Tensor.toScalar (Spec.get ga i)))}" + +/-! ## The variational solve is the regularized-inverse solve -/ + +-- Positive — `yb = -(K+γI)⁻¹·ga`, so `(K+γI)·yb = -ga`, i.e. `(K+γI)·yb + ga ≈ 0`. +#eval assertLt "variational solve: (K+γI)·yb = -ga to machine precision" + (vecAbsErr (Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (mv (addGammaI K γ) yb) i) + + Spec.Tensor.toScalar (Spec.get ga i)))) + +-- Positive — eig route = Cholesky route: `yb = -solveRidgeSpec K γ ga` to machine precision. +#eval assertLt "eig-form solve = -(Cholesky ridge solve) (two implementations agree)" + (vecDist yb (negVec (Spec.solveRidgeSpec K γ ga))) + +/-! ## The noise level is a fraction in `[0,1]` -/ + +/-- The CHD `noise` / `find_gamma` loss / `Z_test` statistic at this `(K, γ, ga)`. -/ +def noise : Float := Spec.varNoiseSpec evals V γ ga + +#eval IO.println s!"noise level = {noise}" + +-- Positive — `noise ≤ 1` (err = noise − 1 < tol ⟺ noise < 1 + tol). +#eval assertLt "noise ≤ 1 (find_gamma loss is a fraction)" (noise - 1.0) +-- Positive — `0 ≤ noise` (err = −noise < tol ⟺ noise > −tol). +#eval assertLt "0 ≤ noise" (0.0 - noise) + +/-! ## `Z_test` spectral invariance: feeding `ga = V·z` drops `V` -/ + +def z : Spec.Tensor Float (.dim 3 .scalar) := mkVec [0.7, -1.3, 2.1] +/-- Data expressed in eigencoordinates: `ga = V·z`. -/ +def gaVz : Spec.Tensor Float (.dim 3 .scalar) := mv V z + +#eval IO.println s!"noise(V·z under V) = {Spec.varNoiseSpec evals V γ gaVz}; \ + noise(z under I) = {Spec.varNoiseSpec evals idMat γ z}" + +-- Positive — `noise` of `V·z` under `V` equals `noise` of `z` under the identity (spectral only). +#eval assertApproxEq "Z_test statistic depends only on the spectrum (ga = V·z ⟹ V drops out)" + (Spec.varNoiseSpec evals V γ gaVz) (Spec.varNoiseSpec evals idMat γ z) + +/-! ## Negative controls -/ + +/-- The solve fed the **wrong** eigenvectors (identity instead of the true `V`). -/ +def ybWrong : Spec.Tensor Float (.dim 3 .scalar) := Spec.variationalSolveSpec evals idMat γ ga + +#eval IO.println s!"wrong-V residual (K+γI)·ybWrong + ga = {vecToList (Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (mv (addGammaI K γ) ybWrong) i) + + Spec.Tensor.toScalar (Spec.get ga i)))}" + +-- Negative — with the wrong eigenvectors the solve no longer inverts: the residual is large. +#eval assertGe "wrong eigenvectors break the solve (true eigendecomposition needed)" + (vecAbsErr (Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (mv (addGammaI K γ) ybWrong) i) + + Spec.Tensor.toScalar (Spec.get ga i)))) 0.5 + +/-- The noise level computed with `γ < 0` (here `γ = -0.7`, below the smallest eigenvalue ≈ 0.586, so a +shrinkage coefficient `rᵢ = γ/(λᵢ+γ)` leaves `(0,1]`). -/ +def noiseNeg : Float := Spec.varNoiseSpec evals V (-0.7) ga + +#eval IO.println s!"noise with γ = -0.7 (outside [0,1]) = {noiseNeg}" + +-- Negative — with `γ < 0` the noise falls outside `[0,1]`, so `γ > 0` is necessary for the bound. +#eval assertGe "γ < 0 pushes noise outside [0,1] (γ > 0 necessary)" + (max (0.0 - noiseNeg) (noiseNeg - 1.0)) 0.01 + +end NN.Examples.Factorization.Variational diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index efaa964..aecf544 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -12,6 +12,7 @@ public import NN.Proofs.Tensor.Basic.LinearAlgebra public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.FactorizationsSolve +public import NN.Proofs.Tensor.Basic.FactorizationsVariational public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi public import NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease diff --git a/NN/Proofs/Tensor/Basic/FactorizationsVariational.lean b/NN/Proofs/Tensor/Basic/FactorizationsVariational.lean new file mode 100644 index 0000000..61facf0 --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsVariational.lean @@ -0,0 +1,200 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.Factorizations +public import NN.Proofs.Tensor.Basic.FactorizationsSolve + +/-! +# CHD `solve_variationnal`, `find_gamma`, and `Z_test` (eigendecomposition form) + +[`Factorizations`](./Factorizations.lean) proved the *predicate-level* spectral facts CHD consumes — +the regularized inverse `(K + γ·I)⁻¹ = V·diag(1/(λ+γ))·Vᵀ` (`IsSymEig.add_smul_inv`), the trace/det +sums, and the SVD ⟹ Gram-eigendecomposition bridge. This file closes the gap up to the three concrete +CHD routines built on those facts (`interpolatory.py`): `solve_variationnal`, `find_gamma`, `Z_test`. + +All three are computed from `eigh(K)` and share one arithmetic core: the projected data +`Pga = Vᵀ·ga` and the shrinkage coefficients `rᵢ = γ/(λᵢ + γ)`. The executable definitions +(`Spec.variationalSolveFn`, `Spec.varNoiseFn`, …) mirror `interpolatory.py` verbatim. The theorems +here identify what they compute: + +* **`variationalSolveFn_eq_neg_inv_mulVec`** — the eigendecomposition-form variational solution + `yb = -V·(Pga/(λ+γ))` *is* the regularized-inverse solve `-(K + γ·I)⁻¹·ga` (from `add_smul_inv`). +* **`variationalSolveFn_eq_neg_solveRidgeFn`** — hence the eig route and the *Cholesky* route + (`solveRidgeFn`, verified in `FactorizationsSolve`) compute the **same** `solve_variationnal`, up to + CHD's sign convention. Two independent implementations, one closed form. +* **`varNoiseFn_eq_ratio`** — the `noise` level (= the `find_gamma` loss = the `Z_test` per-sample + statistic) is the spectral ratio `Σ (Pgaᵢ·rᵢ)² / Σ Pgaᵢ²·rᵢ`. +* **`varNoiseFn_nonneg` / `varNoiseFn_le_one`** — for a PSD spectrum (`λᵢ ≥ 0`) and `γ > 0` the noise + lies in `[0, 1]`, because each shrinkage coefficient does (`ridgeCoeffFn_pos`, `ridgeCoeffFn_le_one`). + This is the meaningful invariant: the CHD noise level is a genuine fraction. +* **`projFn_mulVec_self` / `varNoiseFn_projFn_mulVec`** — feeding data `ga = V·z` makes `V` drop out of + the statistic, so it depends on the kernel only through its spectrum. This is the deterministic + content of "the `Z_test` null distribution depends only on the eigenvalues" (the *distributional* + step — Gaussian sampling and percentiles — is out of scope here, exercised numerically instead). + +`IsSymEig.eigenvalues_nonneg` supplies the `λᵢ ≥ 0` hypothesis from a positive-semidefinite kernel. + +Scope honesty: everything here is exact over `ℝ`, proved from the *specification* `IsSymEig` (so it +holds for whatever eigendecomposition the solver returns), not from the asymptotic Jacobi convergence. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Matrix +open scoped BigOperators +open Spec.Factorization.Reconstruction + +variable {n : Nat} + +/-! ## Bridge: the projection is `Vᵀ·ga` -/ + +/-- `Spec.projFn V ga = Vᵀ *ᵥ ga`: the executable projection is multiplication by `Vᵀ`. -/ +theorem projFn_eq_mulVec (V : Matrix (Fin n) (Fin n) ℝ) (ga : Fin n → ℝ) : + Spec.projFn V ga = Vᵀ *ᵥ ga := by + funext i + rw [Spec.projFn, dotFn_eq_sum] + show ∑ k, V k i * ga k = ∑ k, Vᵀ i k * ga k + exact Finset.sum_congr rfl (fun k _ => by rw [Matrix.transpose_apply]) + +/-- Feeding `ga = V·z` recovers `z`: `projFn V (V *ᵥ z) = z` when `Vᵀ·V = 1`. The change of variables +that makes the `Z_test` statistic depend on the kernel only through its spectrum. -/ +theorem projFn_mulVec_self {V : Matrix (Fin n) (Fin n) ℝ} (hV : Vᵀ * V = 1) (z : Fin n → ℝ) : + Spec.projFn V (V *ᵥ z) = z := by + rw [projFn_eq_mulVec, Matrix.mulVec_mulVec, hV, Matrix.one_mulVec] + +/-! ## The variational solution is the regularized inverse -/ + +/-- **The eigendecomposition-form `solve_variationnal` is the regularized-inverse solve.** Given an +eigendecomposition `IsSymEig A Λ V` and `γ` avoiding every `-λᵢ`, the CHD solution +`yb = -V·(Pga/(λ+γ))` equals `-(A + γ·I)⁻¹·ga`. Proved directly from `add_smul_inv`. -/ +theorem variationalSolveFn_eq_neg_inv_mulVec + {A V : Matrix (Fin n) (Fin n) ℝ} {Λ : Fin n → ℝ} + (h : IsSymEig A Λ V) (γ : ℝ) (hγ : ∀ i, Λ i + γ ≠ 0) (ga : Fin n → ℝ) : + Spec.variationalSolveFn Λ V γ ga + = -((A + γ • (1 : Matrix (Fin n) (Fin n) ℝ))⁻¹ *ᵥ ga) := by + rw [h.add_smul_inv γ hγ] + funext i + simp only [Spec.variationalSolveFn, Pi.neg_apply] + congr 1 + rw [dotFn_eq_sum] + rw [show (V * Matrix.diagonal (fun j => (Λ j + γ)⁻¹) * Vᵀ) *ᵥ ga + = V *ᵥ (fun j => (Λ j + γ)⁻¹ * Spec.projFn V ga j) from by + rw [← Matrix.mulVec_mulVec, ← Matrix.mulVec_mulVec] + congr 1 + funext j + rw [Matrix.mulVec_diagonal, ← projFn_eq_mulVec]] + show ∑ j, V i j * (Spec.projFn V ga j / (Λ j + γ)) + = ∑ j, V i j * ((Λ j + γ)⁻¹ * Spec.projFn V ga j) + exact Finset.sum_congr rfl (fun j _ => by rw [div_eq_mul_inv]; ring) + +/-! ## PSD kernels have nonnegative eigenvalues -/ + +/-- For a positive-semidefinite `A`, every eigenvalue in *any* `IsSymEig` decomposition is `≥ 0`. The +`i`-th eigenvalue is the quadratic form `vᵢᵀ A vᵢ` of the `i`-th eigenvector, which PSD makes +nonnegative. -/ +theorem IsSymEig.eigenvalues_nonneg {A V : Matrix (Fin n) (Fin n) ℝ} {Λ : Fin n → ℝ} + (h : IsSymEig A Λ V) (hA : A.PosSemidef) (i : Fin n) : 0 ≤ Λ i := by + obtain ⟨hV, hAeq⟩ := h + -- `Vᵀ A V = diag Λ` (orthogonal conjugation collapses to the diagonal) + have hconj : Vᵀ * A * V = Matrix.diagonal Λ := by + rw [hAeq, + show Vᵀ * (V * Matrix.diagonal Λ * Vᵀ) * V + = (Vᵀ * V) * Matrix.diagonal Λ * (Vᵀ * V) by simp [Matrix.mul_assoc], + hV, Matrix.one_mul, Matrix.mul_one] + -- over ℝ, `Vᴴ = Vᵀ`, so PSD-congruence `Vᵀ A V` is PSD, i.e. `diag Λ` is PSD + have hVH : (Vᴴ : Matrix (Fin n) (Fin n) ℝ) = Vᵀ := by + ext a b; simp [Matrix.conjTranspose_apply, Matrix.transpose_apply] + have hps : (Matrix.diagonal Λ).PosSemidef := by + have hcong := hA.conjTranspose_mul_mul_same V + rwa [hVH, hconj] at hcong + have hdiag := hps.diag_nonneg (i := i) + rwa [Matrix.diagonal_apply_eq] at hdiag + +/-- **The eig route and the Cholesky route agree.** For a PSD kernel `K` and `γ > 0`, the +eigendecomposition-form `variationalSolveFn` equals `-solveRidgeFn` (the verified Cholesky solve of +`FactorizationsSolve`): two independent implementations of CHD `solve_variationnal`, both equal to +`-(K + γ·I)⁻¹·ga`. -/ +theorem variationalSolveFn_eq_neg_solveRidgeFn + {K : Fin n → Fin n → ℝ} {Λ : Fin n → ℝ} {V : Matrix (Fin n) (Fin n) ℝ} + (h : IsSymEig (Matrix.of K) Λ V) (hK : (Matrix.of K).PosSemidef) {γ : ℝ} (hγ : 0 < γ) + (ga : Fin n → ℝ) : + Spec.variationalSolveFn Λ V γ ga = -(Spec.solveRidgeFn K γ ga) := by + have hΛ : ∀ i, 0 ≤ Λ i := h.eigenvalues_nonneg hK + have hγne : ∀ i, Λ i + γ ≠ 0 := fun i => (by have := hΛ i; linarith : (0:ℝ) < Λ i + γ).ne' + rw [variationalSolveFn_eq_neg_inv_mulVec h γ hγne ga, + show Spec.solveRidgeFn K γ ga = (Matrix.of (Spec.addScaledIdFn K γ))⁻¹ *ᵥ ga from + solveRidgeFn_eq_inv_mulVec K γ ga hK hγ, + of_addScaledIdFn] + +/-! ## The noise / `find_gamma` loss / `Z_test` statistic -/ + +/-- **The noise functional as a spectral ratio.** `varNoiseFn` (the CHD `noise`, the `find_gamma` loss, +and the `Z_test` per-sample statistic) is `Σᵢ (Pgaᵢ·rᵢ)² / Σᵢ Pgaᵢ²·rᵢ`, with `rᵢ = γ/(λᵢ + γ)`. -/ +theorem varNoiseFn_eq_ratio (Λ : Fin n → ℝ) (γ : ℝ) (Pga : Fin n → ℝ) : + Spec.varNoiseFn Λ γ Pga + = (∑ i, (Pga i * (γ / (Λ i + γ))) ^ 2) / (∑ i, Pga i ^ 2 * (γ / (Λ i + γ))) := by + simp only [Spec.varNoiseFn, Spec.ridgeCoeffFn] + rw [dotFn_eq_sum, dotFn_eq_sum] + congr 1 + · exact Finset.sum_congr rfl (fun i _ => by ring) + · exact Finset.sum_congr rfl (fun i _ => by ring) + +/-- A shrinkage coefficient is strictly positive for a PSD spectrum and `γ > 0`. -/ +theorem ridgeCoeffFn_pos {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) (i : Fin n) : + 0 < Spec.ridgeCoeffFn Λ γ i := by + rw [Spec.ridgeCoeffFn]; exact div_pos hγ (by have := hΛ i; linarith) + +/-- A shrinkage coefficient is at most `1` for a PSD spectrum and `γ > 0`. -/ +theorem ridgeCoeffFn_le_one {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) (i : Fin n) : + Spec.ridgeCoeffFn Λ γ i ≤ 1 := by + rw [Spec.ridgeCoeffFn, div_le_one (by have := hΛ i; linarith)] + have := hΛ i; linarith + +/-- **The noise level is nonnegative** for a PSD spectrum and `γ > 0`. -/ +theorem varNoiseFn_nonneg {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (Pga : Fin n → ℝ) : 0 ≤ Spec.varNoiseFn Λ γ Pga := by + rw [varNoiseFn_eq_ratio] + apply div_nonneg + · exact Finset.sum_nonneg (fun i _ => sq_nonneg _) + · refine Finset.sum_nonneg (fun i _ => ?_) + have hd : (0:ℝ) < Λ i + γ := by have := hΛ i; linarith + exact mul_nonneg (sq_nonneg _) (div_nonneg hγ.le hd.le) + +/-- **The noise level is at most `1`** for a PSD spectrum and `γ > 0`: each squared shrinkage +coefficient `rᵢ²` is dominated by `rᵢ` (since `0 ≤ rᵢ ≤ 1`), so the numerator is at most the +denominator. The CHD `noise` is therefore a genuine fraction in `[0, 1]`. -/ +theorem varNoiseFn_le_one {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (Pga : Fin n → ℝ) : Spec.varNoiseFn Λ γ Pga ≤ 1 := by + rw [varNoiseFn_eq_ratio] + have hdenom_nonneg : 0 ≤ ∑ i, Pga i ^ 2 * (γ / (Λ i + γ)) := + Finset.sum_nonneg (fun i _ => by + have hd : (0:ℝ) < Λ i + γ := by have := hΛ i; linarith + exact mul_nonneg (sq_nonneg _) (div_nonneg hγ.le hd.le)) + have hle : (∑ i, (Pga i * (γ / (Λ i + γ))) ^ 2) ≤ ∑ i, Pga i ^ 2 * (γ / (Λ i + γ)) := by + refine Finset.sum_le_sum (fun i _ => ?_) + have hd : (0:ℝ) < Λ i + γ := by have := hΛ i; linarith + have hr0 : 0 ≤ γ / (Λ i + γ) := div_nonneg hγ.le hd.le + have hr1 : γ / (Λ i + γ) ≤ 1 := by rw [div_le_one hd]; have := hΛ i; linarith + rw [show (Pga i * (γ / (Λ i + γ))) ^ 2 = Pga i ^ 2 * (γ / (Λ i + γ)) ^ 2 by ring] + apply mul_le_mul_of_nonneg_left _ (sq_nonneg _) + nlinarith [mul_nonneg hr0 (sub_nonneg.mpr hr1)] + rcases hdenom_nonneg.lt_or_eq with hpos | h0 + · rw [div_le_one hpos]; exact hle + · rw [← h0, div_zero]; exact zero_le_one + +/-- **`Z_test` spectral invariance.** Replacing the data by `ga = V·z` removes `V` from the statistic: +`varNoiseFn Λ γ (projFn V (V·z)) = varNoiseFn Λ γ z`. So the functional `Z_test` samples depends on the +kernel only through its eigenvalues. -/ +theorem varNoiseFn_projFn_mulVec {V : Matrix (Fin n) (Fin n) ℝ} (hV : Vᵀ * V = 1) + (Λ : Fin n → ℝ) (γ : ℝ) (z : Fin n → ℝ) : + Spec.varNoiseFn Λ γ (Spec.projFn V (V *ᵥ z)) = Spec.varNoiseFn Λ γ z := by + rw [projFn_mulVec_self hV] + +end Spec.Factorization diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index 98c2b42..b06f667 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -376,4 +376,58 @@ def svdSpec {m n : Nat} (A : Tensor α (.dim m (.dim n .scalar))) (sweeps : Nat else 0 (ofMatFn U, ofVecFn σ, ofMatFn (fun i j => arrGet Vf i.val j.val)) +/-! ## CHD variational solve, noise, and γ-selection (eigendecomposition form) + +CHD's `perform_regression_and_find_gamma` does not use the Cholesky route above; it works through the +*eigendecomposition* `K = V · diag(λ) · Vᵀ` returned by `eigh(K)` (`symEigJacobiSpec`). Three routines +share one arithmetic core — the *projected data* `Pga = Vᵀ · ga` and the *shrinkage coefficients* +`rᵢ = γ/(λᵢ + γ)`: + +* `solve_variationnal` returns the solution `yb = -V·(Pga/(λ+γ))` (`= -(K+γ·I)⁻¹·ga`) and a scalar + `noise` level; +* `find_gamma` minimises that *same* `noise` functional over `γ`; +* `Z_test` evaluates the `noise` functional on random Gaussian data to obtain a null distribution. + +The definitions below mirror `interpolatory.py` verbatim, taking the eigenpairs `(Λ, V)` the solver +returns — exactly as CHD passes `eigh(K)` into them. Their algebraic identities (the solve is the +regularized inverse; the noise is a spectral quadratic-form ratio in `[0,1]`) are proved in +[`NN.Proofs.Tensor.Basic.FactorizationsVariational`](../../../Proofs/Tensor/Basic/FactorizationsVariational.lean). +-/ + +/-- Projected data `Pga = Vᵀ · ga`: component `i` is `⟨vᵢ, ga⟩`, the coordinate of `ga` along +eigenvector `i` (column `i` of `V`). Mirrors `np.dot(eigenvectors.T, ga)`. -/ +def projFn {n : Nat} (V : Fin n → Fin n → α) (ga : Fin n → α) : Fin n → α := + fun i => dotFn (fun k => V k i) ga + +/-- Shrinkage coefficient `rᵢ = γ/(λᵢ + γ)`. For a PSD spectrum (`λᵢ ≥ 0`) and `γ > 0` this lies in +`(0, 1]`. Mirrors `coeffs = gamma / (eigenvalues + gamma)`. -/ +def ridgeCoeffFn {n : Nat} (Λ : Fin n → α) (γ : α) : Fin n → α := + fun i => γ / (Λ i + γ) + +/-- The CHD variational solution `yb = -V·(Pga / (λ + γ))` in eigendecomposition form. Equal to +`-(K + γ·I)⁻¹·ga`. Mirrors `yb = -np.dot(eigenvectors, Pga / (eigenvalues + gamma))`. -/ +def variationalSolveFn {n : Nat} (Λ : Fin n → α) (V : Fin n → Fin n → α) (γ : α) (ga : Fin n → α) : + Fin n → α := + let Pga := projFn V ga + fun i => -dotFn (V i) (fun j => Pga j / (Λ j + γ)) + +/-- The CHD `noise` level (also the `find_gamma` loss and the `Z_test` per-sample statistic): +`Σᵢ (Pgaᵢ·rᵢ)² / Σᵢ (Pgaᵢ·rᵢ)·Pgaᵢ`, with `rᵢ = γ/(λᵢ + γ)`. Mirrors +`noise = np.dot(Pgacoeff, Pgacoeff) / np.dot(Pgacoeff, Pga)` where `Pgacoeff = Pga * coeffs`. -/ +def varNoiseFn {n : Nat} (Λ : Fin n → α) (γ : α) (Pga : Fin n → α) : α := + let pc := fun i => Pga i * ridgeCoeffFn Λ γ i + dotFn pc pc / dotFn pc Pga + +/-- Tensor-level CHD variational solve `yb = -(K + γ·I)⁻¹·ga`, from eigenpairs `(evals, V)`. -/ +def variationalSolveSpec {n : Nat} (evals : Tensor α (.dim n .scalar)) + (V : Tensor α (.dim n (.dim n .scalar))) (γ : α) (ga : Tensor α (.dim n .scalar)) : + Tensor α (.dim n .scalar) := + ofVecFn (variationalSolveFn (toVecFn evals) (toMatFn V) γ (toVecFn ga)) + +/-- Tensor-level CHD `noise` / `find_gamma` loss, from eigenvalues, `γ` and the data `ga` (projected +internally as `Pga = Vᵀ·ga`). -/ +def varNoiseSpec {n : Nat} (evals : Tensor α (.dim n .scalar)) + (V : Tensor α (.dim n (.dim n .scalar))) (γ : α) (ga : Tensor α (.dim n .scalar)) : α := + varNoiseFn (toVecFn evals) γ (projFn (toMatFn V) (toVecFn ga)) + end Spec diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index d60d5a0..1c0b959 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -177,6 +177,56 @@ an inverse*. The `RidgeSolve` example makes this concrete: solving against each `eⱼ` produces column `j` of `(K + γI)⁻¹`, and the assembled matrix satisfies `(K + γI) · (K + γI)⁻¹ = I` to machine precision, every column coming from the verified Cholesky solve. +# The CHD routines: variational solve, `find_gamma`, and `Z_test` + +The two solve routes above invert `K + γI`. But CHD's `perform_regression_and_find_gamma` +(`interpolatory.py`) does not stop there: it takes the *eigendecomposition* route — `eigh(K)` once, then +three routines computed from the eigenpairs `(λ, V)`. They share one arithmetic core: the *projected +data* `Pga = Vᵀ ga` and the *shrinkage coefficients* `rᵢ = γ/(λᵢ + γ)`. +[`NN.Proofs.Tensor.Basic.FactorizationsVariational`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsVariational.lean) +identifies what each computes; everything is exact over `ℝ`, proved from the *specification* +`IsSymEig` (so it holds for whatever eigendecomposition the solver returns, asymptotic or not). + +The variational solution `solve_variationnal` returns, in eigendecomposition form, +`yb = -V (Pga/(λ+γ))`. `variationalSolveFn_eq_neg_inv_mulVec` proves this *is* the regularized-inverse +solve, directly from `add_smul_inv`: + +$$`\texttt{variationalSolveFn}\,\Lambda\,V\,\gamma\,ga \;=\; -\,(K + \gamma I)^{-1} ga.` + +So the eigendecomposition route and the Cholesky route compute the *same* `solve_variationnal`: +`variationalSolveFn_eq_neg_solveRidgeFn` proves `variationalSolveFn = -\,\texttt{solveRidgeFn}` for a +positive-semidefinite kernel `K` and `γ > 0` — two independent implementations agreeing on the one +closed form `-(K + γI)⁻¹ ga`. The supporting fact `IsSymEig.eigenvalues_nonneg` (a PSD matrix's +eigenvalues are `≥ 0`, via the congruence `Vᵀ A V` being positive-semidefinite) discharges the +`λᵢ + γ ≠ 0` side condition from `γ > 0`. + +`find_gamma` and `Z_test` share a second quantity, the `noise` level. `varNoiseFn_eq_ratio` exhibits it +as a spectral quadratic-form ratio: + +$$`\texttt{noise} \;=\; \frac{\sum_i (Pga_i\, r_i)^2}{\sum_i Pga_i^2\, r_i}, +\qquad r_i = \frac{\gamma}{\lambda_i + \gamma}.` + +`find_gamma` minimises this functional over `γ`; `Z_test` evaluates it on random Gaussian data. The +load-bearing invariant is that the `noise` is a genuine *fraction*: for a PSD spectrum (`λᵢ ≥ 0`) and +`γ > 0`, each coefficient satisfies `0 < rᵢ ≤ 1` (`ridgeCoeffFn_pos`, `ridgeCoeffFn_le_one`), so +`rᵢ² ≤ rᵢ` makes the numerator dominated by the denominator, giving + +$$`0 \;\le\; \texttt{noise} \;\le\; 1` + +(`varNoiseFn_nonneg`, `varNoiseFn_le_one`). Finally, the `Z_test` statistic depends on the kernel only +through its *spectrum*: replacing the data by `ga = V z` makes `V` cancel, `projFn V (V z) = z` +(`projFn_mulVec_self`), so `varNoiseFn Λ γ (projFn V (V z)) = varNoiseFn Λ γ z` +(`varNoiseFn_projFn_mulVec`). This is the deterministic content of "the `Z_test` null distribution +depends only on the eigenvalues"; the *distributional* step — Gaussian sampling and the 5%/95% +percentiles — is statistical rather than algebraic and is left to runtime, exercised numerically. + +The `Variational` example confirms all four on a concrete SPD kernel: `(K + γI)·yb = -ga` and +`yb = -\texttt{solveRidgeSpec}` to machine precision, `noise ∈ [0,1]`, and the spectral invariance +`noise(V z) = noise(z)` to machine precision. Its *negative controls* show the hypotheses biting: +feeding the *wrong* eigenvectors (the identity in place of `V`) makes the solve residual large, and +`γ < 0` pushes the `noise` outside `[0,1]` — so the true eigendecomposition and `γ > 0` are both +necessary. + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact @@ -359,4 +409,9 @@ Everything else is exact: the algebraic faithfulness of the decomposition (ortho similarity, the residual identity, the per-rotation decrease, the classical-strategy linear rate, and correctness in the zero-residual limit), the finite Cholesky/QR reconstructions, and the Cholesky-based regularized solve are proved, and the specification-level facts the kernel methods rely -on are independent of the convergence step, so the CHD foundation is complete. +on are independent of the convergence step. The three concrete CHD routines built on them are now +identified too: the eigendecomposition-form `solve_variationnal` equals `-(K + γI)⁻¹ ga` and agrees +with the Cholesky route, and the `noise`/`find_gamma`-loss/`Z_test` statistic is a spectral ratio +provably in `[0,1]` that depends on the kernel only through its spectrum. So the CHD foundation is +complete, the one remaining open item being statistical, not algebraic — the `Z_test`'s Gaussian +sampling and percentiles, exercised numerically rather than proved. From 2b82b8faa9a56efd46f3483abaa836b33230bb13 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 13:11:09 -0700 Subject: [PATCH 12/22] Discharge the PSD hypothesis: the CHD linear-mode kernel is positive-semidefinite MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Every verified CHD solve/find_gamma/Z_test theorem assumes hK : (Matrix.of K).PosSemidef, but CHD builds K from data (Modes/kernels.py) — that hypothesis was never discharged. This proves it for the linear mode, the same move as the positive-pivot keystone: turn an assumed precondition into a theorem. Spec (NN/Spec/Core/Tensor/Factorizations.lean): maskColsFn (which_dim column masking), linearKernelFn (K[i,j] = 1 + scale·⟨Φi,Φj⟩ = 𝟙𝟙ᵀ + scale·ΦΦᵀ, mirroring LinearMode.vectorized_kernel), and the tensor wrapper linearKernelSpec. Proofs (new NN/Proofs/Tensor/Basic/FactorizationsKernels.lean, sorry/omega-free): - linearKernelFn_posSemidef: K is PSD for scale ≥ 0 — 𝟙𝟙ᵀ is a rank-one Gram (PSD), ΦΦᵀ is a Gram (posSemidef_self_mul_conjTranspose), scale ≥ 0 keeps it PSD (PosSemidef.smul), PosSemidef.add closes the sum. - linearKernelFn_symm: symmetry, a corollary of PosSemidef.isHermitian. - linearKernelSpec_posSemidef: the tensor-level statement the solve theorems consume — so solveRidgeSpec (linearKernelSpec X w scale) γ b is now an unconditional exact solve for γ > 0, no PSD hypothesis left to assume. Examples (new NN/Examples/Factorization/LinearKernel.lean): 6 green #eval checks — K = Kᵀ, matches the CHD LinearMode formula, all Jacobi eigenvalues ≥ 0 (feature masking preserved), the PSD kernel feeds an exact ridge solve (residual 0); negative control: scale = -1 makes 𝟙𝟙ᵀ − ΦΦᵀ indefinite (two negative eigenvalues). Blueprint: new "Building the kernel" section; flags the follow-ons — quadratic via the Schur product theorem (PosSemidef.hadamard, in Mathlib) and Gaussian via Schoenberg (Bochner not in Mathlib v4.30.0, the new research-grade item). Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 7 + NN/Examples/Factorization/LinearKernel.lean | 134 ++++++++++++++++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Tensor/Basic/FactorizationsKernels.lean | 99 +++++++++++++ NN/Spec/Core/Tensor/Factorizations.lean | 32 +++++ .../Ch4_Verification/Factorizations.lean | 31 ++++ 6 files changed, 304 insertions(+) create mode 100644 NN/Examples/Factorization/LinearKernel.lean create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsKernels.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 3d06494..bae41f9 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -15,6 +15,7 @@ public import NN.Examples.Factorization.JacobiDecrease public import NN.Examples.Factorization.JacobiRate public import NN.Examples.Factorization.RidgeSolve public import NN.Examples.Factorization.Variational +public import NN.Examples.Factorization.LinearKernel /-! # Matrix factorization examples @@ -59,6 +60,12 @@ factorization misbehaves. (`varNoiseFn_nonneg`, `varNoiseFn_le_one`), and `Z_test` spectral invariance (`varNoiseFn_projFn_mulVec`); **negative controls**: wrong eigenvectors break the solve, and `γ < 0` pushes the noise outside `[0,1]`. +- `LinearKernel` — CHD *builds* the kernel from data (`Modes/kernels.py`); the linear mode is + `K = 𝟙𝟙ᵀ + scale·Φ·Φᵀ`, proven symmetric positive-semidefinite for `scale ≥ 0` + (`linearKernelFn_posSemidef`), which discharges the `PosSemidef` hypothesis every solve/`find_gamma` + theorem assumes. Checks: `K = Kᵀ`, matches the CHD `LinearMode` formula, all Jacobi eigenvalues `≥ 0` + (masking a feature preserved), and the PSD kernel feeds an exact ridge solve; **negative control**: + `scale < 0` makes `K` indefinite (a negative eigenvalue appears). Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/LinearKernel.lean b/NN/Examples/Factorization/LinearKernel.lean new file mode 100644 index 0000000..4d5d617 --- /dev/null +++ b/NN/Examples/Factorization/LinearKernel.lean @@ -0,0 +1,134 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the CHD linear-mode kernel is symmetric positive-semidefinite + +These checks corroborate `NN.Proofs.Tensor.Basic.FactorizationsKernels`. The whole verified CHD +solve / `find_gamma` / `Z_test` development assumes the kernel matrix `K` is positive-semidefinite; +CHD *builds* `K` from data (`Modes/kernels.py`). For the linear mode, + +`K = 𝟙𝟙ᵀ + scale · Φ·Φᵀ` (`Φ` = column-masked data), + +which `linearKernelFn_posSemidef` proves PSD for `scale ≥ 0` — discharging that standing hypothesis for +the real linear kernel. We exhibit: + +* **symmetric** — `K = Kᵀ` to machine precision (`linearKernelFn_symm`); +* **matches CHD** — `K[i,j] = 1 + scale·⟨xᵢ, xⱼ⟩` agrees with the direct `LinearMode` formula; +* **positive-semidefinite** — every Jacobi eigenvalue is `≥ 0` (the numeric witness of + `linearKernelFn_posSemidef`), and masking a feature (`w = [1,0]`) keeps it PSD; +* **feeds the verified solve** — because `K` is PSD, `solveRidgeSpec K γ b` is the exact regularized + solve for `γ > 0` (`(K+γ·I)·x = b` to machine precision), with no PSD hypothesis left to assume. + +**Negative control**: with `scale = -1` the kernel `𝟙𝟙ᵀ − Φ·Φᵀ` is indefinite — a Jacobi eigenvalue +goes negative — so `scale ≥ 0` is necessary for the PSD guarantee. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.LinearKernel + +/-- Build a length-`n` `Float` vector from a list (missing entries `0`). -/ +def mkVec {n : Nat} (xs : List Float) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => xs.getD i.val 0.0) + +/-- Count Jacobi eigenvalues that are negative (below `−10⁻⁹`). `0` certifies positive-semidefiniteness +numerically; `≥ 1` certifies an indefinite matrix. -/ +def numNegEigs {k : Nat} (M : Spec.Tensor Float (.dim k (.dim k .scalar))) : Float := + let evals := (Spec.symEigJacobiSpec M 12).1 + (List.finRange k).foldl + (fun a i => a + (if Spec.Tensor.toScalar (Spec.get evals i) < -1e-9 then 1.0 else 0.0)) 0.0 + +/-- The regularized matrix `K + γ·I` as a tensor. -/ +def addGammaI {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) : + Spec.Tensor Float (.dim n (.dim n .scalar)) := + Spec.ofMatFn (fun i j => Spec.get2 K i j + (if i.val == j.val then γ else 0.0)) + +/-- `ℓ¹` magnitude `Σᵢ |vᵢ|` (a sum, so a `NaN` propagates). -/ +def vecAbsErr {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : Float := + (List.finRange n).foldl (fun a i => a + Float.abs (Spec.Tensor.toScalar (Spec.get v i))) 0.0 + +/-- Residual `(K + γ·I)·x − b`. -/ +def ridgeResidual {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) + (b x : Spec.Tensor Float (.dim n .scalar)) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (Spec.matVecMulSpec (addGammaI K γ) x) i) + - Spec.Tensor.toScalar (Spec.get b i)) + +/-- A `4 × 2` data matrix (4 samples, 2 features). -/ +def X : Spec.Tensor Float (.dim 4 (.dim 2 .scalar)) := + mkMat [[1, 0], + [0, 1], + [1, 1], + [2, 1]] + +/-- Selection mask `which_dim = [1,1]` (both features active). -/ +def wAll : Spec.Tensor Float (.dim 2 .scalar) := mkVec [1, 1] +def scale : Float := 2.0 + +/-- The linear-mode kernel `K = 𝟙𝟙ᵀ + scale·Φ·Φᵀ` (4×4). -/ +def K : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.linearKernelSpec X wAll scale + +/-- Direct CHD `LinearMode` formula `Kref[i,j] = 1 + scale·Σ_k X[i,k]·X[j,k]` (mask all-ones). -/ +def Kref : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := + Spec.ofMatFn (fun i j => 1.0 + scale * + (List.finRange 2).foldl (fun a k => a + Spec.get2 X i k * Spec.get2 X j k) 0.0) + +#eval IO.println s!"linear kernel K =\n{(List.finRange 4).map (fun i => + (List.finRange 4).map (fun j => Spec.get2 K i j))}" +#eval IO.println s!"eigenvalues of K = {vecToList (Spec.symEigJacobiSpec K 12).1}" + +-- Positive — `K` is symmetric (`linearKernelFn_symm`). +#eval assertLt "linear kernel is symmetric: K = Kᵀ" (maxMatErr K (tr K)) + +-- Positive — `K` matches the direct CHD `LinearMode` formula. +#eval assertLt "linear kernel matches CHD LinearMode formula" (maxMatErr K Kref) + +-- Positive — `K` is positive-semidefinite: no negative Jacobi eigenvalue (`linearKernelFn_posSemidef`). +#eval assertLt "linear kernel is PSD: no negative eigenvalue" (numNegEigs K) + +/-! ## Masking a feature preserves PSD -/ + +/-- Mask out feature 1: `which_dim = [1,0]`. -/ +def wMask : Spec.Tensor Float (.dim 2 .scalar) := mkVec [1, 0] +def Kmask : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.linearKernelSpec X wMask scale + +#eval IO.println s!"masked-feature kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kmask 12).1}" + +-- Positive — masking a feature keeps the kernel PSD (PSD holds for any mask). +#eval assertLt "masked linear kernel is still PSD" (numNegEigs Kmask) + +/-! ## The PSD kernel feeds the verified ridge solve -/ + +def γ : Float := 0.5 +def b : Spec.Tensor Float (.dim 4 .scalar) := mkVec [1, 2, 3, 4] +/-- The ridge solution against the linear kernel `K`. -/ +def x : Spec.Tensor Float (.dim 4 .scalar) := Spec.solveRidgeSpec K γ b + +#eval IO.println s!"ridge solve on the linear kernel: residual = {vecToList (ridgeResidual K γ b x)}" + +-- Positive — `K` PSD ⟹ `solveRidgeSpec K γ b` is the exact solve of `(K+γI)·x = b` (γ > 0, no +-- PSD hypothesis to assume — it is now proven for this kernel). +#eval assertLt "PSD linear kernel ⟹ exact ridge solve (K+γI)·x = b" + (vecAbsErr (ridgeResidual K γ b x)) + +/-! ## Negative control: `scale < 0` breaks positive-semidefiniteness -/ + +/-- The same kernel with `scale = -1`: `𝟙𝟙ᵀ − Φ·Φᵀ`, indefinite. -/ +def Kneg : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.linearKernelSpec X wAll (-1.0) + +#eval IO.println s!"scale = -1 kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kneg 12).1}" + +-- Negative — with `scale < 0` the kernel is indefinite: at least one eigenvalue is negative. +#eval assertGe "scale < 0 breaks PSD (indefinite kernel)" (numNegEigs Kneg) 1.0 + +end NN.Examples.Factorization.LinearKernel diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index aecf544..7ca7dc3 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -13,6 +13,7 @@ public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.FactorizationsSolve public import NN.Proofs.Tensor.Basic.FactorizationsVariational +public import NN.Proofs.Tensor.Basic.FactorizationsKernels public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi public import NN.Proofs.Tensor.Basic.FactorizationsJacobiDecrease diff --git a/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean b/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean new file mode 100644 index 0000000..dc399aa --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean @@ -0,0 +1,99 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.Factorizations +public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal +public import Mathlib.Data.Real.StarOrdered + +/-! +# CHD mode kernels are symmetric positive-semidefinite + +The entire verified CHD solve / `find_gamma` / `Z_test` development takes the kernel matrix `K` as +input under the hypothesis `(Matrix.of K).PosSemidef`. CHD does not receive `K`; it *builds* it from +data (`Modes/kernels.py`). This file discharges that standing hypothesis for the **linear mode** — the +first and simplest of CHD's kernels — exactly as the positive-pivot keystone discharged the Cholesky +success condition. + +The linear-mode kernel is `K[i,j] = 1 + scale · ⟨Φ i, Φ j⟩` with `Φ` the column-masked data, i.e. + +`K = 𝟙𝟙ᵀ + scale · Φ·Φᵀ`, + +a sum of the all-ones matrix (a rank-one Gram, PSD) and a scaled Gram matrix `Φ·Φᵀ` (PSD for +`scale ≥ 0` by `posSemidef_self_mul_conjTranspose`). `PosSemidef.add` / `PosSemidef.smul` finish it. + +* `linearKernelFn_posSemidef` — `(Matrix.of (linearKernelFn X w scale)).PosSemidef` for `0 ≤ scale`. +* `linearKernelFn_symm` — `K` is symmetric (a corollary, via `PosSemidef.isHermitian`). +* `linearKernelSpec_posSemidef` — the tensor-level statement, the form the solve theorems consume. + +Quadratic mode (`PosSemidef.hadamard`, the Schur product theorem) and Gaussian mode (Bochner / +Schoenberg, not in Mathlib v4.30.0) are the natural follow-ons. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Matrix +open scoped BigOperators +open Spec.Factorization.Reconstruction + +variable {n d : Nat} + +/-- Over `ℝ`, `Φᴴ = Φᵀ` (the star is trivial), for any rectangular matrix. -/ +private theorem conjTranspose_eq_transpose {m k : Nat} (Φ : Matrix (Fin m) (Fin k) ℝ) : + (Φᴴ : Matrix (Fin k) (Fin m) ℝ) = Φᵀ := by + ext a b; simp [Matrix.conjTranspose_apply, Matrix.transpose_apply] + +/-- The Gram matrix `Φ·Φᵀ` is positive-semidefinite (real form of +`posSemidef_self_mul_conjTranspose`). -/ +private theorem posSemidef_mul_transpose_self {m k : Nat} (Φ : Matrix (Fin m) (Fin k) ℝ) : + (Φ * Φᵀ).PosSemidef := by + have h := Matrix.posSemidef_self_mul_conjTranspose Φ + rwa [conjTranspose_eq_transpose Φ] at h + +/-- **The linear-mode kernel is symmetric positive-semidefinite.** For data `X`, selection mask `w`, +and `scale ≥ 0`, `K = 𝟙𝟙ᵀ + scale·Φ·Φᵀ` is PSD — discharging the `PosSemidef` hypothesis of the CHD +solve / `find_gamma` development for the real linear kernel. -/ +theorem linearKernelFn_posSemidef (X : Fin n → Fin d → ℝ) (w : Fin d → ℝ) {scale : ℝ} + (hscale : 0 ≤ scale) : (Matrix.of (Spec.linearKernelFn X w scale)).PosSemidef := by + -- the masked data as a matrix, and the all-ones column + set Φ : Matrix (Fin n) (Fin d) ℝ := Matrix.of (Spec.maskColsFn X w) with hΦ + set Ψ : Matrix (Fin n) (Fin 1) ℝ := Matrix.of (fun _ _ => 1) with hΨ + -- `K = Ψ·Ψᵀ + scale • (Φ·Φᵀ)` + have hKeq : Matrix.of (Spec.linearKernelFn X w scale) = Ψ * Ψᵀ + scale • (Φ * Φᵀ) := by + ext i j + simp only [Matrix.of_apply, Matrix.add_apply, Matrix.smul_apply, smul_eq_mul, + Matrix.mul_apply, Matrix.transpose_apply, Spec.linearKernelFn, hΦ, hΨ] + rw [dotFn_eq_sum, Fin.sum_univ_one] + simp only [Spec.maskColsFn] + ring + rw [hKeq] + exact (posSemidef_mul_transpose_self Ψ).add ((posSemidef_mul_transpose_self Φ).smul hscale) + +/-- The linear-mode kernel is symmetric: `K[i,j] = K[j,i]`. -/ +theorem linearKernelFn_symm (X : Fin n → Fin d → ℝ) (w : Fin d → ℝ) {scale : ℝ} + (hscale : 0 ≤ scale) (i j : Fin n) : + Spec.linearKernelFn X w scale i j = Spec.linearKernelFn X w scale j i := by + have h := (linearKernelFn_posSemidef X w hscale).isHermitian + have e : (Matrix.of (Spec.linearKernelFn X w scale))ᴴ i j + = (Matrix.of (Spec.linearKernelFn X w scale)) i j := by rw [h] + simpa [Matrix.conjTranspose_apply, Matrix.of_apply] using e.symm + +/-- **Tensor-level: the linear-mode kernel is positive-semidefinite.** The form the verified solve +consumes: `(Matrix.of (toMatFn (linearKernelSpec X w scale))).PosSemidef` for `scale ≥ 0`, so e.g. +`solveRidgeSpec (linearKernelSpec X w scale) γ b` is the exact regularized solve for any `γ > 0`. -/ +theorem linearKernelSpec_posSemidef (X : Spec.Tensor ℝ (.dim n (.dim d .scalar))) + (w : Spec.Tensor ℝ (.dim d .scalar)) {scale : ℝ} (hscale : 0 ≤ scale) : + (Matrix.of (Spec.toMatFn (Spec.linearKernelSpec X w scale))).PosSemidef := by + have hround : Spec.toMatFn (Spec.linearKernelSpec X w scale) + = Spec.linearKernelFn (Spec.toMatFn X) (Spec.toVecFn w) scale := by + funext i j; rfl + rw [hround] + exact linearKernelFn_posSemidef _ _ hscale + +end Spec.Factorization diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index b06f667..eac2076 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -430,4 +430,36 @@ def varNoiseSpec {n : Nat} (evals : Tensor α (.dim n .scalar)) (V : Tensor α (.dim n (.dim n .scalar))) (γ : α) (ga : Tensor α (.dim n .scalar)) : α := varNoiseFn (toVecFn evals) γ (projFn (toMatFn V) (toVecFn ga)) +/-! ## CHD mode kernels (`Modes/kernels.py`) + +Everything above takes the kernel matrix `K` as input, assuming it is symmetric positive-semidefinite. +CHD *builds* `K` from data: for each pair of samples, a mode kernel evaluates a feature inner product. +The simplest is the **linear mode** (`LinearMode.vectorized_kernel`): + +`K[i,j] = 1 + scale · Σ_k (which_dim_k · X[i,k]) · X[j,k]`, + +a constant `1` plus a scaled Gram matrix of the (column-masked) data. For the binary selection mask +`which_dim_k ∈ {0,1}` CHD uses, `which_dim_k · X[i,k] · X[j,k] = (which_dim_k X[i,k])·(which_dim_k X[j,k])`, +so `K = 𝟙𝟙ᵀ + scale · Φ·Φᵀ` with `Φ` the masked data — manifestly symmetric positive-semidefinite for +`scale ≥ 0`. That PSD fact (proved in `FactorizationsKernels`) discharges the standing `PosSemidef` +hypothesis of the solve/`find_gamma` development for the real linear kernel. -/ + +/-- Column-mask the data matrix by a per-feature weight `w` (CHD `which_dim`): zero out / scale +feature `k` by `w k`. -/ +def maskColsFn {n d : Nat} (X : Fin n → Fin d → α) (w : Fin d → α) : Fin n → Fin d → α := + fun i k => w k * X i k + +/-- Linear-mode kernel matrix `K[i,j] = 1 + scale · ⟨Φ i, Φ j⟩`, `Φ = maskColsFn X w` the masked data. +For a binary selection mask this is exactly CHD `LinearMode.vectorized_kernel`. -/ +def linearKernelFn {n d : Nat} (X : Fin n → Fin d → α) (w : Fin d → α) (scale : α) : + Fin n → Fin n → α := + fun i j => 1 + scale * dotFn (maskColsFn X w i) (maskColsFn X w j) + +/-- Tensor-level linear-mode kernel: `K = 𝟙𝟙ᵀ + scale · Φ·Φᵀ` from data `X` and selection mask `w`. + +PyTorch analogue: `1 + scale * (X * which_dim).matmul(X.T)`. -/ +def linearKernelSpec {n d : Nat} (X : Tensor α (.dim n (.dim d .scalar))) + (w : Tensor α (.dim d .scalar)) (scale : α) : Tensor α (.dim n (.dim n .scalar)) := + ofMatFn (linearKernelFn (toMatFn X) (toVecFn w) scale) + end Spec diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 1c0b959..7a3dcb9 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -227,6 +227,37 @@ feeding the *wrong* eigenvectors (the identity in place of `V`) makes the solve `γ < 0` pushes the `noise` outside `[0,1]` — so the true eigendecomposition and `γ > 0` are both necessary. +# Building the kernel: the linear mode is positive-semidefinite + +Every result above takes the kernel `K` as input *under the hypothesis* that it is positive +-semidefinite — the solve needs `K + γI` to be SPD, the noise bound needs `λᵢ ≥ 0`. But CHD does not +receive `K`; it *builds* it from data (`Modes/kernels.py`). Discharging that standing `PosSemidef` +hypothesis for the kernels CHD actually constructs is the same move as the positive-pivot keystone: +turn an assumed precondition into a theorem. +[`NN.Proofs.Tensor.Basic.FactorizationsKernels`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean) +takes the first and simplest mode. The *linear* kernel is + +$$`K[i,j] = 1 + \texttt{scale}\cdot\langle \Phi_i, \Phi_j\rangle, +\qquad K = \mathbf{1}\mathbf{1}^\top + \texttt{scale}\cdot \Phi\,\Phi^\top,` + +with `Φ` the column-masked data (`which_dim`). `linearKernelFn_posSemidef` proves this is symmetric +positive-semidefinite whenever `scale ≥ 0`: the all-ones matrix `𝟙𝟙ᵀ` is a rank-one Gram (hence PSD), +`Φ Φᵀ` is a Gram matrix (PSD by `posSemidef_self_mul_conjTranspose`), `scale ≥ 0` keeps the scaling +PSD (`PosSemidef.smul`), and `PosSemidef.add` closes the sum. Symmetry (`linearKernelFn_symm`) is then +a corollary of `PosSemidef.isHermitian`. Composed with the solve development, this makes +`solveRidgeSpec (linearKernelSpec X w scale) γ b` an *unconditional* exact solve for `γ > 0` — no PSD +hypothesis left to assume. The `LinearKernel` example confirms `K = Kᵀ`, the match with the CHD +`LinearMode` formula, all-nonnegative Jacobi eigenvalues (with feature masking preserved), and the +downstream exact ridge solve; its *negative control* takes `scale = -1`, where `𝟙𝟙ᵀ − Φ Φᵀ` is +indefinite and a Jacobi eigenvalue goes negative — so `scale ≥ 0` is necessary. + +The other two modes are the natural follow-ons. The *quadratic* kernel is an entrywise square of a +PSD matrix, so it is PSD by the *Schur product theorem* (`PosSemidef.hadamard`, available in +Mathlib), under the hyperparameter sign conditions. The *Gaussian* (RBF) kernel +`exp(-(xᵢ-xⱼ)²/2\ell²)` is PSD by Schoenberg's theorem — writing `exp(xy/\ell²)` as a power series whose +terms are Hadamard powers of a rank-one Gram — but Mathlib v4.30.0 has no Bochner/Gaussian-kernel PSD +theory, so it is the new honest research-grade item, parallel to the cyclic-Jacobi rate. + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact From 4a1c17c1e24a46d0b86833c57e2f9246753c7cc8 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 14:45:20 -0700 Subject: [PATCH 13/22] Prove the CHD quadratic-mode kernel is positive-semidefinite MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Discharge the standing PosSemidef hypothesis for the second of CHD's non-interpolatory kernels. The quadratic mode K[i,j] = scale·(alpha + ⟨Φ i, Φ j⟩)² + (1 − alpha²·scale) expands algebraically to K = 𝟙𝟙ᵀ + (2·scale·alpha)·Φ·Φᵀ + scale·(Φ·Φᵀ ⊙ Φ·Φᵀ), a sum of three PSD pieces: the all-ones Gram, a nonnegative multiple of the data Gram Φ·Φᵀ, and a nonnegative multiple of its Hadamard square — PSD by the Schur product theorem (PosSemidef.hadamard). So K is PSD for scale ≥ 0 and alpha ≥ 0. - Spec: quadraticKernelFn / quadraticKernelSpec (NN/Spec/.../Factorizations.lean), the exact CHD QuadraticMode.vectorized_kernel. - Proof: quadraticKernelFn_posSemidef + _symm + tensor-level quadraticKernelSpec_posSemidef (FactorizationsKernels), sorry/omega-free. - Example: NN/Examples/Factorization/QuadraticKernel.lean — 8 green checks (symmetry, CHD-formula match, all eigenvalues ≥ 0 with masking preserved, exact ridge solve), two genuine negative controls (alpha = −1 → 2 negative eigenvalues, scale = −1 → 3). - Blueprint: dedicated quadratic-mode section (Ch4 Factorizations). Gaussian mode (Schoenberg/Bochner, absent from Mathlib v4.30.0) remains the research-grade follow-on. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 8 + .../Factorization/QuadraticKernel.lean | 150 ++++++++++++++++++ .../Tensor/Basic/FactorizationsKernels.lean | 63 +++++++- NN/Spec/Core/Tensor/Factorizations.lean | 18 +++ .../Ch4_Verification/Factorizations.lean | 31 +++- 5 files changed, 262 insertions(+), 8 deletions(-) create mode 100644 NN/Examples/Factorization/QuadraticKernel.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index bae41f9..424a2f1 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -16,6 +16,7 @@ public import NN.Examples.Factorization.JacobiRate public import NN.Examples.Factorization.RidgeSolve public import NN.Examples.Factorization.Variational public import NN.Examples.Factorization.LinearKernel +public import NN.Examples.Factorization.QuadraticKernel /-! # Matrix factorization examples @@ -66,6 +67,13 @@ factorization misbehaves. theorem assumes. Checks: `K = Kᵀ`, matches the CHD `LinearMode` formula, all Jacobi eigenvalues `≥ 0` (masking a feature preserved), and the PSD kernel feeds an exact ridge solve; **negative control**: `scale < 0` makes `K` indefinite (a negative eigenvalue appears). +- `QuadraticKernel` — CHD's *quadratic* mode (`Modes/kernels.py`), + `K = scale·(alpha + Φ·Φᵀ)² + (1 − alpha²·scale) = 𝟙𝟙ᵀ + (2·scale·alpha)·Φ·Φᵀ + scale·(Φ·Φᵀ ⊙ Φ·Φᵀ)`, + proven symmetric positive-semidefinite for `scale ≥ 0` and `alpha ≥ 0` via the **Schur product + theorem** on the Hadamard square (`quadraticKernelFn_posSemidef`). Checks mirror the linear mode: + `K = Kᵀ`, matches the CHD `QuadraticMode` formula, all Jacobi eigenvalues `≥ 0` (masking preserved), + PSD kernel feeds an exact ridge solve; **negative controls**: both `alpha < 0` and `scale < 0` make + `K` indefinite, so both bounds are necessary. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/QuadraticKernel.lean b/NN/Examples/Factorization/QuadraticKernel.lean new file mode 100644 index 0000000..4b6196e --- /dev/null +++ b/NN/Examples/Factorization/QuadraticKernel.lean @@ -0,0 +1,150 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the CHD quadratic-mode kernel is symmetric positive-semidefinite + +These checks corroborate `NN.Proofs.Tensor.Basic.FactorizationsKernels`. As with the linear mode, the +whole verified CHD solve / `find_gamma` / `Z_test` development assumes the kernel `K` is +positive-semidefinite, and CHD *builds* `K` from data (`Modes/kernels.py`). For the quadratic mode, + +`K[i,j] = scale · (alpha + ⟨Φ i, Φ j⟩)² + (1 − alpha²·scale)` (`Φ` = column-masked data), + +which expands to `K = 𝟙𝟙ᵀ + (2·scale·alpha)·Φ·Φᵀ + scale·(Φ·Φᵀ ⊙ Φ·Φᵀ)` — the last term a **Hadamard +square**, PSD by the Schur product theorem. `quadraticKernelFn_posSemidef` proves `K` PSD for +`scale ≥ 0` and `alpha ≥ 0`, discharging that standing hypothesis for the real quadratic kernel. We +exhibit: + +* **symmetric** — `K = Kᵀ` to machine precision (`quadraticKernelFn_symm`); +* **matches CHD** — `K[i,j] = scale·(alpha + ⟨xᵢ, xⱼ⟩)² + (1 − alpha²·scale)` agrees with the direct + `QuadraticMode.vectorized_kernel` formula; +* **positive-semidefinite** — every Jacobi eigenvalue is `≥ 0` (the numeric witness of + `quadraticKernelFn_posSemidef`), and masking a feature (`w = [1,0]`) keeps it PSD; +* **feeds the verified solve** — because `K` is PSD, `solveRidgeSpec K γ b` is the exact regularized + solve for `γ > 0` (`(K+γ·I)·x = b` to machine precision). + +**Negative controls**: with `alpha < 0` the middle term `2·scale·alpha·Φ·Φᵀ` goes negative (the +diagonal `scale·alpha² + … ` drops below zero) and with `scale < 0` the whole quadratic part flips sign +— in both cases a Jacobi eigenvalue goes negative, so `scale ≥ 0` *and* `alpha ≥ 0` are both necessary. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.QuadraticKernel + +/-- Build a length-`n` `Float` vector from a list (missing entries `0`). -/ +def mkVec {n : Nat} (xs : List Float) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => xs.getD i.val 0.0) + +/-- Count Jacobi eigenvalues that are negative (below `−10⁻⁹`). `0` certifies positive-semidefiniteness +numerically; `≥ 1` certifies an indefinite matrix. -/ +def numNegEigs {k : Nat} (M : Spec.Tensor Float (.dim k (.dim k .scalar))) : Float := + let evals := (Spec.symEigJacobiSpec M 12).1 + (List.finRange k).foldl + (fun a i => a + (if Spec.Tensor.toScalar (Spec.get evals i) < -1e-9 then 1.0 else 0.0)) 0.0 + +/-- The regularized matrix `K + γ·I` as a tensor. -/ +def addGammaI {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) : + Spec.Tensor Float (.dim n (.dim n .scalar)) := + Spec.ofMatFn (fun i j => Spec.get2 K i j + (if i.val == j.val then γ else 0.0)) + +/-- `ℓ¹` magnitude `Σᵢ |vᵢ|` (a sum, so a `NaN` propagates). -/ +def vecAbsErr {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : Float := + (List.finRange n).foldl (fun a i => a + Float.abs (Spec.Tensor.toScalar (Spec.get v i))) 0.0 + +/-- Residual `(K + γ·I)·x − b`. -/ +def ridgeResidual {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) + (b x : Spec.Tensor Float (.dim n .scalar)) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (Spec.matVecMulSpec (addGammaI K γ) x) i) + - Spec.Tensor.toScalar (Spec.get b i)) + +/-- A `4 × 2` data matrix (4 samples, 2 features). -/ +def X : Spec.Tensor Float (.dim 4 (.dim 2 .scalar)) := + mkMat [[1, 0], + [0, 1], + [1, 1], + [2, 1]] + +/-- Selection mask `which_dim = [1,1]` (both features active). -/ +def wAll : Spec.Tensor Float (.dim 2 .scalar) := mkVec [1, 1] +/-- Kernel scale (CHD `QuadraticMode._scale`). -/ +def scale : Float := 2.0 +/-- Quadratic offset (CHD `alpha = 0.5·scales["linear"]/scale`; here `0.5·2.0/2.0 = 0.5`). -/ +def alpha : Float := 0.5 + +/-- The quadratic-mode kernel `K = scale·(alpha + Φ·Φᵀ)² + (1 − alpha²·scale)` (4×4). -/ +def K : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.quadraticKernelSpec X wAll scale alpha + +/-- Direct CHD `QuadraticMode.vectorized_kernel` formula (mask all-ones): +`Kref[i,j] = scale·(alpha + Σ_k X[i,k]·X[j,k])² + (1 − alpha²·scale)`. -/ +def Kref : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := + Spec.ofMatFn (fun i j => + let m := (List.finRange 2).foldl (fun a k => a + Spec.get2 X i k * Spec.get2 X j k) 0.0 + scale * (alpha + m) ^ 2 + (1.0 - alpha ^ 2 * scale)) + +#eval IO.println s!"quadratic kernel K =\n{(List.finRange 4).map (fun i => + (List.finRange 4).map (fun j => Spec.get2 K i j))}" +#eval IO.println s!"eigenvalues of K = {vecToList (Spec.symEigJacobiSpec K 12).1}" + +-- Positive — `K` is symmetric (`quadraticKernelFn_symm`). +#eval assertLt "quadratic kernel is symmetric: K = Kᵀ" (maxMatErr K (tr K)) + +-- Positive — `K` matches the direct CHD `QuadraticMode` formula. +#eval assertLt "quadratic kernel matches CHD QuadraticMode formula" (maxMatErr K Kref) + +-- Positive — `K` is PSD: no negative Jacobi eigenvalue (`quadraticKernelFn_posSemidef`). +#eval assertLt "quadratic kernel is PSD: no negative eigenvalue" (numNegEigs K) + +/-! ## Masking a feature preserves PSD -/ + +/-- Mask out feature 1: `which_dim = [1,0]`. -/ +def wMask : Spec.Tensor Float (.dim 2 .scalar) := mkVec [1, 0] +def Kmask : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.quadraticKernelSpec X wMask scale alpha + +#eval IO.println s!"masked-feature kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kmask 12).1}" + +-- Positive — masking a feature keeps the kernel PSD (PSD holds for any mask). +#eval assertLt "masked quadratic kernel is still PSD" (numNegEigs Kmask) + +/-! ## The PSD kernel feeds the verified ridge solve -/ + +def γ : Float := 0.5 +def b : Spec.Tensor Float (.dim 4 .scalar) := mkVec [1, 2, 3, 4] +/-- The ridge solution against the quadratic kernel `K`. -/ +def x : Spec.Tensor Float (.dim 4 .scalar) := Spec.solveRidgeSpec K γ b + +#eval IO.println s!"ridge solve on the quadratic kernel: residual = {vecToList (ridgeResidual K γ b x)}" + +-- Positive — `K` PSD ⟹ `solveRidgeSpec K γ b` is the exact solve of `(K+γI)·x = b` (γ > 0). +#eval assertLt "PSD quadratic kernel ⟹ exact ridge solve (K+γI)·x = b" + (vecAbsErr (ridgeResidual K γ b x)) + +/-! ## Negative controls: `alpha < 0` and `scale < 0` break positive-semidefiniteness -/ + +/-- The same kernel with `alpha = −1`: the linear term `2·scale·alpha·Φ·Φᵀ` is negative. -/ +def Kalpha : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.quadraticKernelSpec X wAll scale (-1.0) + +#eval IO.println s!"alpha = -1 kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kalpha 12).1}" + +-- Negative — with `alpha < 0` the kernel is indefinite: at least one eigenvalue is negative. +#eval assertGe "alpha < 0 breaks PSD (indefinite kernel)" (numNegEigs Kalpha) 1.0 + +/-- The same kernel with `scale = −1`: the whole quadratic part flips sign. -/ +def Kscale : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.quadraticKernelSpec X wAll (-1.0) alpha + +#eval IO.println s!"scale = -1 kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kscale 12).1}" + +-- Negative — with `scale < 0` the kernel is indefinite: at least one eigenvalue is negative. +#eval assertGe "scale < 0 breaks PSD (indefinite kernel)" (numNegEigs Kscale) 1.0 + +end NN.Examples.Factorization.QuadraticKernel diff --git a/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean b/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean index dc399aa..fd698b3 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean @@ -9,6 +9,7 @@ module public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import Mathlib.Data.Real.StarOrdered +public import Mathlib.Analysis.Matrix.Order /-! # CHD mode kernels are symmetric positive-semidefinite @@ -30,8 +31,21 @@ a sum of the all-ones matrix (a rank-one Gram, PSD) and a scaled Gram matrix `Φ * `linearKernelFn_symm` — `K` is symmetric (a corollary, via `PosSemidef.isHermitian`). * `linearKernelSpec_posSemidef` — the tensor-level statement, the form the solve theorems consume. -Quadratic mode (`PosSemidef.hadamard`, the Schur product theorem) and Gaussian mode (Bochner / -Schoenberg, not in Mathlib v4.30.0) are the natural follow-ons. +The **quadratic mode** is `K[i,j] = scale·(alpha + ⟨Φ i, Φ j⟩)² + (1 − alpha²·scale)`, which expands +algebraically to + +`K = 𝟙𝟙ᵀ + (2·scale·alpha)·Φ·Φᵀ + scale·(Φ·Φᵀ ⊙ Φ·Φᵀ)`, + +a sum of: the all-ones Gram (PSD), a nonnegative multiple of the Gram `Φ·Φᵀ` (PSD), and a nonnegative +multiple of the **Hadamard square** of that Gram — PSD by the **Schur product theorem** +(`PosSemidef.hadamard`). So `K` is PSD whenever `scale ≥ 0` and `alpha ≥ 0`. + +* `quadraticKernelFn_posSemidef` — `(Matrix.of (quadraticKernelFn X w scale alpha)).PosSemidef` for + `0 ≤ scale` and `0 ≤ alpha`. +* `quadraticKernelFn_symm` / `quadraticKernelSpec_posSemidef` — symmetry and the tensor-level form. + +Gaussian mode (Bochner / Schoenberg positive-definiteness, not in Mathlib v4.30.0) is the natural +remaining follow-on. -/ @[expose] public section @@ -96,4 +110,49 @@ theorem linearKernelSpec_posSemidef (X : Spec.Tensor ℝ (.dim n (.dim d .scalar rw [hround] exact linearKernelFn_posSemidef _ _ hscale +/-- **The quadratic-mode kernel is positive-semidefinite.** For data `X`, selection mask `w`, and +`scale ≥ 0`, `alpha ≥ 0`, `K[i,j] = scale·(alpha + ⟨Φ i, Φ j⟩)² + (1 − alpha²·scale)` is PSD. The proof +expands `K = 𝟙𝟙ᵀ + (2·scale·alpha)·Φ·Φᵀ + scale·(Φ·Φᵀ ⊙ Φ·Φᵀ)` and adds three PSD pieces, the last via +the **Schur product theorem** `PosSemidef.hadamard`. -/ +theorem quadraticKernelFn_posSemidef (X : Fin n → Fin d → ℝ) (w : Fin d → ℝ) {scale alpha : ℝ} + (hscale : 0 ≤ scale) (halpha : 0 ≤ alpha) : + (Matrix.of (Spec.quadraticKernelFn X w scale alpha)).PosSemidef := by + -- the masked data as a matrix, the all-ones column, and the data Gram `M = Φ·Φᵀ` + set Φ : Matrix (Fin n) (Fin d) ℝ := Matrix.of (Spec.maskColsFn X w) with hΦ + set Ψ : Matrix (Fin n) (Fin 1) ℝ := Matrix.of (fun _ _ => 1) with hΨ + -- `K = Ψ·Ψᵀ + (2·scale·alpha)·(Φ·Φᵀ) + scale·((Φ·Φᵀ) ⊙ (Φ·Φᵀ))` + have hKeq : Matrix.of (Spec.quadraticKernelFn X w scale alpha) + = Ψ * Ψᵀ + (2 * scale * alpha) • (Φ * Φᵀ) + scale • ((Φ * Φᵀ) ⊙ (Φ * Φᵀ)) := by + ext i j + simp only [Matrix.of_apply, Matrix.add_apply, Matrix.smul_apply, smul_eq_mul, + Matrix.mul_apply, Matrix.transpose_apply, Matrix.hadamard_apply, Spec.quadraticKernelFn, hΦ, hΨ] + rw [dotFn_eq_sum, Fin.sum_univ_one] + simp only [Spec.maskColsFn] + ring + rw [hKeq] + have hM : (Φ * Φᵀ).PosSemidef := posSemidef_mul_transpose_self Φ + have hc : (0 : ℝ) ≤ 2 * scale * alpha := by positivity + exact ((posSemidef_mul_transpose_self Ψ).add (hM.smul hc)).add ((hM.hadamard hM).smul hscale) + +/-- The quadratic-mode kernel is symmetric: `K[i,j] = K[j,i]`. -/ +theorem quadraticKernelFn_symm (X : Fin n → Fin d → ℝ) (w : Fin d → ℝ) {scale alpha : ℝ} + (hscale : 0 ≤ scale) (halpha : 0 ≤ alpha) (i j : Fin n) : + Spec.quadraticKernelFn X w scale alpha i j = Spec.quadraticKernelFn X w scale alpha j i := by + have h := (quadraticKernelFn_posSemidef X w hscale halpha).isHermitian + have e : (Matrix.of (Spec.quadraticKernelFn X w scale alpha))ᴴ i j + = (Matrix.of (Spec.quadraticKernelFn X w scale alpha)) i j := by rw [h] + simpa [Matrix.conjTranspose_apply, Matrix.of_apply] using e.symm + +/-- **Tensor-level: the quadratic-mode kernel is positive-semidefinite.** The form the verified solve +consumes, so e.g. `solveRidgeSpec (quadraticKernelSpec X w scale alpha) γ b` is the exact regularized +solve for any `γ > 0` whenever `scale ≥ 0` and `alpha ≥ 0`. -/ +theorem quadraticKernelSpec_posSemidef (X : Spec.Tensor ℝ (.dim n (.dim d .scalar))) + (w : Spec.Tensor ℝ (.dim d .scalar)) {scale alpha : ℝ} (hscale : 0 ≤ scale) (halpha : 0 ≤ alpha) : + (Matrix.of (Spec.toMatFn (Spec.quadraticKernelSpec X w scale alpha))).PosSemidef := by + have hround : Spec.toMatFn (Spec.quadraticKernelSpec X w scale alpha) + = Spec.quadraticKernelFn (Spec.toMatFn X) (Spec.toVecFn w) scale alpha := by + funext i j; rfl + rw [hround] + exact quadraticKernelFn_posSemidef _ _ hscale halpha + end Spec.Factorization diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index eac2076..20c1c8a 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -462,4 +462,22 @@ def linearKernelSpec {n d : Nat} (X : Tensor α (.dim n (.dim d .scalar))) (w : Tensor α (.dim d .scalar)) (scale : α) : Tensor α (.dim n (.dim n .scalar)) := ofMatFn (linearKernelFn (toMatFn X) (toVecFn w) scale) +/-- Quadratic-mode kernel matrix `K[i,j] = scale · (alpha + ⟨Φ i, Φ j⟩)² + (1 − alpha²·scale)`, +`Φ = maskColsFn X w` the masked data. This is exactly CHD `QuadraticMode.vectorized_kernel`. Algebraically +it expands to `K = 𝟙𝟙ᵀ + (2·scale·alpha)·Φ·Φᵀ + scale·(Φ·Φᵀ ⊙ Φ·Φᵀ)` (the last a Hadamard square), so it +is positive-semidefinite for `scale ≥ 0` and `alpha ≥ 0` by the Schur product theorem — see +`FactorizationsKernels`. The square is written as a product to stay polymorphic over `Context α`. -/ +def quadraticKernelFn {n d : Nat} (X : Fin n → Fin d → α) (w : Fin d → α) (scale alpha : α) : + Fin n → Fin n → α := + fun i j => + let m := dotFn (maskColsFn X w i) (maskColsFn X w j) + scale * ((alpha + m) * (alpha + m)) + (1 - alpha * alpha * scale) + +/-- Tensor-level quadratic-mode kernel from data `X`, selection mask `w`, `scale`, and offset `alpha`. + +PyTorch analogue: `scale * (alpha + (X * which_dim).matmul(X.T))**2 + (1 - alpha**2 * scale)`. -/ +def quadraticKernelSpec {n d : Nat} (X : Tensor α (.dim n (.dim d .scalar))) + (w : Tensor α (.dim d .scalar)) (scale alpha : α) : Tensor α (.dim n (.dim n .scalar)) := + ofMatFn (quadraticKernelFn (toMatFn X) (toVecFn w) scale alpha) + end Spec diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 7a3dcb9..f764e37 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -251,12 +251,31 @@ hypothesis left to assume. The `LinearKernel` example confirms `K = Kᵀ`, the m downstream exact ridge solve; its *negative control* takes `scale = -1`, where `𝟙𝟙ᵀ − Φ Φᵀ` is indefinite and a Jacobi eigenvalue goes negative — so `scale ≥ 0` is necessary. -The other two modes are the natural follow-ons. The *quadratic* kernel is an entrywise square of a -PSD matrix, so it is PSD by the *Schur product theorem* (`PosSemidef.hadamard`, available in -Mathlib), under the hyperparameter sign conditions. The *Gaussian* (RBF) kernel -`exp(-(xᵢ-xⱼ)²/2\ell²)` is PSD by Schoenberg's theorem — writing `exp(xy/\ell²)` as a power series whose -terms are Hadamard powers of a rank-one Gram — but Mathlib v4.30.0 has no Bochner/Gaussian-kernel PSD -theory, so it is the new honest research-grade item, parallel to the cyclic-Jacobi rate. +# Building the kernel: the quadratic mode is positive-semidefinite + +The *quadratic* mode (`QuadraticMode.vectorized_kernel`) is the second kernel CHD builds: + +$$`K[i,j] = \texttt{scale}\cdot(\alpha + \langle \Phi_i, \Phi_j\rangle)^2 + (1 - \alpha^2\texttt{scale}).` + +Squaring and collecting terms makes the PSD structure explicit: + +$$`K = \mathbf{1}\mathbf{1}^\top + (2\,\texttt{scale}\,\alpha)\cdot \Phi\,\Phi^\top + + \texttt{scale}\cdot\bigl(\Phi\,\Phi^\top \odot \Phi\,\Phi^\top\bigr),` + +a sum of three PSD pieces: the all-ones Gram, a nonnegative multiple of the data Gram `Φ Φᵀ`, and a +nonnegative multiple of its *Hadamard square* `Φ Φᵀ ⊙ Φ Φᵀ`. The last is PSD by the *Schur product +theorem* `PosSemidef.hadamard` (the Hadamard product of PSD matrices is PSD), which Mathlib v4.30.0 +provides. `quadraticKernelFn_posSemidef` assembles the three with `PosSemidef.add`/`PosSemidef.smul` +and proves `K` PSD whenever `scale ≥ 0` *and* `alpha ≥ 0` — both conditions are real: the +`QuadraticKernel` example's two *negative controls* take `alpha = -1` and `scale = -1`, and each makes +a Jacobi eigenvalue go negative. As with the linear mode, this discharges the standing `PosSemidef` +hypothesis, so `solveRidgeSpec (quadraticKernelSpec X w scale alpha) γ b` is an unconditional exact +solve for `γ > 0`, and `quadraticKernelFn_symm` gives symmetry from `PosSemidef.isHermitian`. + +The remaining mode is the *Gaussian* (RBF) kernel `exp(-(xᵢ-xⱼ)²/2\ell²)`, PSD by Schoenberg's theorem +— writing `exp(xy/\ell²)` as a power series whose terms are Hadamard powers of a rank-one Gram — but +Mathlib v4.30.0 has no Bochner/Gaussian-kernel PSD theory, so it remains the honest research-grade +item, parallel to the cyclic-Jacobi rate. # The a-posteriori residual certificate From 7713616e58d57e902963fc30ce934acc72658ab6 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 16:11:24 -0700 Subject: [PATCH 14/22] Prove the CHD Gaussian-mode kernel is positive-semidefinite MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Discharge the third and last CHD kernel mode's PosSemidef hypothesis, without Bochner/Schoenberg (absent from Mathlib v4.30.0), via an elementary Hadamard-exponential route that reuses the Schur product theorem already used for the quadratic mode. Spec (Factorizations.lean): gaussianKernelFn/gaussianKernelSpec, K[i,j] = scale·∏_dim (1 + w[dim]·exp(−(X[i,dim]−X[j,dim])²/2l²)), matching CHD GaussianMode (foldl product + squared-diff-as-product to stay polymorphic over the law-free Context). Proof (FactorizationsKernels.lean), sorry/admit/omega-free: - posSemidef_of_tendsto — the PSD cone is closed under entrywise limits (the one genuinely new, upstreamable lemma: the quadratic form is continuous in the entries, {≥0} is closed). - posSemidef_map_exp — entrywise exp of a PSD matrix is PSD via the Hadamard-power series exp∘G = Σ G^∘k/k!. - posSemidef_gaussianCol — a single Gaussian exp(−c(yᵢ−yⱼ)²) is PSD by diagonal congruence of the entrywise exponential of the rank-one Gram. - gaussianKernelFn_posSemidef — each feature factor 𝟙𝟙ᵀ + w·Gaussian is PSD, product over features via the Schur product theorem; PSD for scale ≥ 0 and a nonnegative mask w ≥ 0. Plus _symm and the tensor-level gaussianKernelSpec_posSemidef. Example (GaussianKernel.lean): 7/7 #eval checks — symmetric, matches the CHD GaussianMode formula, PSD (no negative eigenvalue), masked PSD, exact ridge solve; negative controls scale=−1 and a negative mask weight w=[−2,0] both correctly rejected. Blueprint Ch4: replaced the "research-grade" Gaussian paragraph with a section documenting the discharge; all three modes now PSD-verified. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 10 + NN/Examples/Factorization/GaussianKernel.lean | 155 +++++++++++ .../Tensor/Basic/FactorizationsKernels.lean | 241 +++++++++++++++++- NN/Spec/Core/Tensor/Factorizations.lean | 25 ++ .../Ch4_Verification/Factorizations.lean | 55 +++- 5 files changed, 477 insertions(+), 9 deletions(-) create mode 100644 NN/Examples/Factorization/GaussianKernel.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 424a2f1..d6774e7 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -17,6 +17,7 @@ public import NN.Examples.Factorization.RidgeSolve public import NN.Examples.Factorization.Variational public import NN.Examples.Factorization.LinearKernel public import NN.Examples.Factorization.QuadraticKernel +public import NN.Examples.Factorization.GaussianKernel /-! # Matrix factorization examples @@ -74,6 +75,15 @@ factorization misbehaves. `K = Kᵀ`, matches the CHD `QuadraticMode` formula, all Jacobi eigenvalues `≥ 0` (masking preserved), PSD kernel feeds an exact ridge solve; **negative controls**: both `alpha < 0` and `scale < 0` make `K` indefinite, so both bounds are necessary. +- `GaussianKernel` — CHD's *Gaussian* (fully-nonlinear) mode (`Modes/kernels.py`), + `K = scale·∏_dim (1 + w[dim]·exp(−(X[i,dim]−X[j,dim])²/2l²))`, proven symmetric positive-semidefinite + for `scale ≥ 0` and a nonnegative mask `w ≥ 0` (`gaussianKernelFn_posSemidef`) — *without* + Bochner/Schoenberg, via the entrywise-exponential Hadamard-power series (the PSD cone closed under + limits) and the **Schur product theorem** over features. Checks mirror the other modes: `K = Kᵀ`, + matches the CHD `GaussianMode` product formula, all Jacobi eigenvalues `≥ 0` (masking preserved), PSD + kernel feeds an exact ridge solve; **negative controls**: `scale < 0` and a *negative mask weight* + (`w = [−2,0]`, which drives the diagonal below zero) both make `K` indefinite. With the linear, + quadratic, and Gaussian modes all discharged, every CHD kernel build is now PSD-verified. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/GaussianKernel.lean b/NN/Examples/Factorization/GaussianKernel.lean new file mode 100644 index 0000000..112b7f4 --- /dev/null +++ b/NN/Examples/Factorization/GaussianKernel.lean @@ -0,0 +1,155 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the CHD Gaussian-mode product kernel is symmetric positive-semidefinite + +These checks corroborate `NN.Proofs.Tensor.Basic.FactorizationsKernels`. As with the linear and +quadratic modes, the whole verified CHD solve / `find_gamma` / `Z_test` development assumes the kernel +`K` is positive-semidefinite, and CHD *builds* `K` from data (`Modes/kernels.py`). The Gaussian +(fully-nonlinear) mode introduces the per-feature Gaussian `exp(−Δ²/2l²)`, whose product contribution is + +`K[i,j] = scale · ∏_dim (1 + w[dim] · exp(−(X[i,dim]−X[j,dim])²/2l²))` (`scale · jnp.prod(1 + w·exps)`). + +`gaussianKernelFn_posSemidef` proves `K` PSD for `scale ≥ 0` and a nonnegative mask `w ≥ 0`, *without* +Bochner/Schoenberg (absent from Mathlib): the entrywise exponential of a PSD matrix is PSD (a +Hadamard-power series, the PSD cone closed under limits), each feature factor `𝟙𝟙ᵀ + w·Gaussian` is PSD, +and the product over features is PSD by the **Schur product theorem**. We exhibit: + +* **symmetric** — `K = Kᵀ` to machine precision (`gaussianKernelFn_symm`); +* **matches CHD** — `K[i,j]` agrees with the direct `scale · ∏_dim (1 + w·exp(−Δ²/2l²))` formula; +* **positive-semidefinite** — every Jacobi eigenvalue is `≥ 0` (the numeric witness of + `gaussianKernelFn_posSemidef`), and masking a feature (`w = [1,0]`) keeps it PSD; +* **feeds the verified solve** — because `K` is PSD, `solveRidgeSpec K γ b` is the exact regularized + solve for `γ > 0` (`(K+γ·I)·x = b` to machine precision). + +**Negative controls**: with `scale < 0` the whole kernel flips sign, and with a *negative* mask weight +(`w = [−2,0]`) a feature factor `1 − 2·exp(−Δ²/2l²)` drives the diagonal below zero — in both cases a +Jacobi eigenvalue goes negative, so `scale ≥ 0` *and* `w ≥ 0` are both necessary. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.GaussianKernel + +/-- Build a length-`n` `Float` vector from a list (missing entries `0`). -/ +def mkVec {n : Nat} (xs : List Float) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => xs.getD i.val 0.0) + +/-- Count Jacobi eigenvalues that are negative (below `−10⁻⁹`). `0` certifies positive-semidefiniteness +numerically; `≥ 1` certifies an indefinite matrix. -/ +def numNegEigs {k : Nat} (M : Spec.Tensor Float (.dim k (.dim k .scalar))) : Float := + let evals := (Spec.symEigJacobiSpec M 12).1 + (List.finRange k).foldl + (fun a i => a + (if Spec.Tensor.toScalar (Spec.get evals i) < -1e-9 then 1.0 else 0.0)) 0.0 + +/-- The regularized matrix `K + γ·I` as a tensor. -/ +def addGammaI {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) : + Spec.Tensor Float (.dim n (.dim n .scalar)) := + Spec.ofMatFn (fun i j => Spec.get2 K i j + (if i.val == j.val then γ else 0.0)) + +/-- `ℓ¹` magnitude `Σᵢ |vᵢ|` (a sum, so a `NaN` propagates). -/ +def vecAbsErr {n : Nat} (v : Spec.Tensor Float (.dim n .scalar)) : Float := + (List.finRange n).foldl (fun a i => a + Float.abs (Spec.Tensor.toScalar (Spec.get v i))) 0.0 + +/-- Residual `(K + γ·I)·x − b`. -/ +def ridgeResidual {n : Nat} (K : Spec.Tensor Float (.dim n (.dim n .scalar))) (γ : Float) + (b x : Spec.Tensor Float (.dim n .scalar)) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => + Spec.Tensor.toScalar (Spec.get (Spec.matVecMulSpec (addGammaI K γ) x) i) + - Spec.Tensor.toScalar (Spec.get b i)) + +/-- A `4 × 2` data matrix (4 samples, 2 features). -/ +def X : Spec.Tensor Float (.dim 4 (.dim 2 .scalar)) := + mkMat [[1, 0], + [0, 1], + [1, 1], + [2, 1]] + +/-- Selection mask `which_dim = [1,1]` (both features active). -/ +def wAll : Spec.Tensor Float (.dim 2 .scalar) := mkVec [1, 1] +/-- Kernel scale (CHD `GaussianMode._scale`). -/ +def scale : Float := 1.0 +/-- Gaussian length scale (CHD `GaussianMode.l`). -/ +def l : Float := 1.0 + +/-- The Gaussian-mode product kernel `K = scale · ∏_dim (1 + w·exp(−Δ²/2l²))` (4×4). -/ +def K : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.gaussianKernelSpec X wAll scale l + +/-- Direct CHD `GaussianMode` product formula (mask all-ones): +`Kref[i,j] = scale · ∏_k (1 + w[k]·exp(−(X[i,k]−X[j,k])²/2l²))`. -/ +def Kref : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := + Spec.ofMatFn (fun i j => + scale * (List.finRange 2).foldl + (fun acc k => + let dx := Spec.get2 X i k - Spec.get2 X j k + acc * (1.0 + Spec.Tensor.toScalar (Spec.get wAll k) * Float.exp (-(dx * dx) / (2.0 * l * l)))) + 1.0) + +#eval IO.println s!"Gaussian kernel K =\n{(List.finRange 4).map (fun i => + (List.finRange 4).map (fun j => Spec.get2 K i j))}" +#eval IO.println s!"eigenvalues of K = {vecToList (Spec.symEigJacobiSpec K 12).1}" + +-- Positive — `K` is symmetric (`gaussianKernelFn_symm`). +#eval assertLt "Gaussian kernel is symmetric: K = Kᵀ" (maxMatErr K (tr K)) + +-- Positive — `K` matches the direct CHD `GaussianMode` product formula. +#eval assertLt "Gaussian kernel matches CHD GaussianMode formula" (maxMatErr K Kref) + +-- Positive — `K` is PSD: no negative Jacobi eigenvalue (`gaussianKernelFn_posSemidef`). +#eval assertLt "Gaussian kernel is PSD: no negative eigenvalue" (numNegEigs K) + +/-! ## Masking a feature preserves PSD -/ + +/-- Mask out feature 1: `which_dim = [1,0]` (still `w ≥ 0`). -/ +def wMask : Spec.Tensor Float (.dim 2 .scalar) := mkVec [1, 0] +def Kmask : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.gaussianKernelSpec X wMask scale l + +#eval IO.println s!"masked-feature kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kmask 12).1}" + +-- Positive — masking a feature keeps the kernel PSD (PSD holds for any nonnegative mask). +#eval assertLt "masked Gaussian kernel is still PSD" (numNegEigs Kmask) + +/-! ## The PSD kernel feeds the verified ridge solve -/ + +def γ : Float := 0.5 +def b : Spec.Tensor Float (.dim 4 .scalar) := mkVec [1, 2, 3, 4] +/-- The ridge solution against the Gaussian kernel `K`. -/ +def x : Spec.Tensor Float (.dim 4 .scalar) := Spec.solveRidgeSpec K γ b + +#eval IO.println s!"ridge solve on the Gaussian kernel: residual = {vecToList (ridgeResidual K γ b x)}" + +-- Positive — `K` PSD ⟹ `solveRidgeSpec K γ b` is the exact solve of `(K+γI)·x = b` (γ > 0). +#eval assertLt "PSD Gaussian kernel ⟹ exact ridge solve (K+γI)·x = b" + (vecAbsErr (ridgeResidual K γ b x)) + +/-! ## Negative controls: `scale < 0` and a negative mask weight break positive-semidefiniteness -/ + +/-- The same kernel with `scale = −1`: the whole product is negated. -/ +def Kscale : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.gaussianKernelSpec X wAll (-1.0) l + +#eval IO.println s!"scale = -1 kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kscale 12).1}" + +-- Negative — with `scale < 0` the kernel is indefinite: at least one eigenvalue is negative. +#eval assertGe "scale < 0 breaks PSD (indefinite kernel)" (numNegEigs Kscale) 1.0 + +/-- A negative mask weight `w = [−2,0]`: the feature factor `1 − 2·exp(−Δ²/2l²)` makes the diagonal +(`Δ = 0 ⟹ 1 − 2 = −1`) negative, so the kernel cannot be PSD. -/ +def wNeg : Spec.Tensor Float (.dim 2 .scalar) := mkVec [-2, 0] +def Kw : Spec.Tensor Float (.dim 4 (.dim 4 .scalar)) := Spec.gaussianKernelSpec X wNeg scale l + +#eval IO.println s!"w = [-2,0] kernel eigenvalues = {vecToList (Spec.symEigJacobiSpec Kw 12).1}" + +-- Negative — a negative mask weight makes the kernel indefinite: at least one eigenvalue is negative. +#eval assertGe "negative mask weight breaks PSD (indefinite kernel)" (numNegEigs Kw) 1.0 + +end NN.Examples.Factorization.GaussianKernel diff --git a/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean b/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean index fd698b3..864c6df 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsKernels.lean @@ -10,6 +10,9 @@ public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import Mathlib.Data.Real.StarOrdered public import Mathlib.Analysis.Matrix.Order +public import Mathlib.Analysis.Normed.Algebra.Exponential +public import Mathlib.Analysis.SpecialFunctions.Exponential +public import Mathlib.Topology.Instances.Matrix /-! # CHD mode kernels are symmetric positive-semidefinite @@ -44,8 +47,22 @@ multiple of the **Hadamard square** of that Gram — PSD by the **Schur product `0 ≤ scale` and `0 ≤ alpha`. * `quadraticKernelFn_symm` / `quadraticKernelSpec_posSemidef` — symmetry and the tensor-level form. -Gaussian mode (Bochner / Schoenberg positive-definiteness, not in Mathlib v4.30.0) is the natural -remaining follow-on. +The **Gaussian mode** product kernel +`K[i,j] = scale · ∏_dim (1 + w[dim]·exp(−(X[i,dim]−X[j,dim])²/2l²))` is also discharged here, *without* +Bochner/Schoenberg (absent from Mathlib v4.30.0), via an elementary Hadamard-exponential route that +reuses the same Schur product theorem: + +* `posSemidef_of_tendsto` — the PSD cone is **closed under entrywise limits** (the one genuinely new, + independently-useful lemma: the quadratic form is continuous in the entries, `{≥0}` is closed). +* `posSemidef_map_exp` — the **entrywise exponential** of a PSD matrix is PSD: `exp∘G = Σₖ G^∘k/k!`, + each Hadamard power `G^∘k` PSD by the Schur product theorem, the partial sums PSD, the limit PSD. +* `posSemidef_gaussianCol` — a single Gaussian matrix `exp(−c·(yᵢ−yⱼ)²)` is PSD (`c ≥ 0`), by writing it + as `D·(exp∘(2c·yyᵀ))·Dᵀ` — a diagonal congruence of an entrywise-exponential of the rank-one Gram. +* `gaussianKernelFn_posSemidef` — each feature factor `𝟙𝟙ᵀ + w[dim]·Gaussian` is PSD, and the product + over features is PSD by the Schur product theorem; so `K` is PSD for `scale ≥ 0` and a mask `w ≥ 0`. + `gaussianKernelFn_symm` / `gaussianKernelSpec_posSemidef` give symmetry and the tensor-level form. + +All three CHD non-interpolatory/nonlinear modes (linear, quadratic, Gaussian) are now PSD-discharged. -/ @[expose] public section @@ -155,4 +172,224 @@ theorem quadraticKernelSpec_posSemidef (X : Spec.Tensor ℝ (.dim n (.dim d .sca rw [hround] exact quadraticKernelFn_posSemidef _ _ hscale halpha +/-! ## The Gaussian mode: an elementary Hadamard-exponential PSD proof + +CHD's Gaussian (fully-nonlinear) kernel introduces `exp(−Δ²/2l²)`, which has no *finite* algebraic +PSD identity. We discharge it without Bochner/Schoenberg by the classical Schur route: the entrywise +exponential of a PSD matrix is PSD (Hadamard-power series), and the Gaussian is a diagonal congruence +of such an exponential. The PSD-cone-closed-under-limits lemma is the only genuinely new ingredient. -/ + +open scoped Topology +open Filter + +variable {N : Nat} + +/-- The all-ones matrix `𝟙𝟙ᵀ` is positive-semidefinite (a rank-one Gram). -/ +private theorem posSemidef_ones : (Matrix.of (fun _ _ : Fin N => (1 : ℝ))).PosSemidef := by + have h := Matrix.posSemidef_vecMulVec_self_star (fun _ : Fin N => (1 : ℝ)) + have he : Matrix.vecMulVec (fun _ : Fin N => (1 : ℝ)) (star (fun _ : Fin N => (1 : ℝ))) + = Matrix.of (fun _ _ : Fin N => (1 : ℝ)) := by + ext i j; simp [Matrix.vecMulVec_apply, Pi.star_apply] + rwa [he] at h + +/-- **The PSD cone is closed under entrywise limits.** If real positive-semidefinite matrices `A k` +converge entrywise to `B`, then `B` is positive-semidefinite. The quadratic form `xᵀ·M·x` is continuous +in `M`'s entries, and `{y | 0 ≤ y}` is closed. -/ +private theorem posSemidef_of_tendsto {A : ℕ → Matrix (Fin N) (Fin N) ℝ} + {B : Matrix (Fin N) (Fin N) ℝ} (hA : ∀ k, (A k).PosSemidef) + (hlim : Tendsto A atTop (𝓝 B)) : B.PosSemidef := by + have hentry : ∀ i j, Tendsto (fun k => A k i j) atTop (𝓝 (B i j)) := + fun i j => (hlim.apply_nhds i).apply_nhds j + -- entry symmetry of any real Hermitian matrix + have hsymm_entry : ∀ (M : Matrix (Fin N) (Fin N) ℝ), M.IsHermitian → ∀ i j, M i j = M j i := by + intro M hM i j + have e : Mᴴ i j = M i j := congrFun (congrFun hM i) j + rw [Matrix.conjTranspose_apply, star_trivial] at e + exact e.symm + -- B is Hermitian (symmetric over ℝ) + have hBsymm : B.IsHermitian := by + ext i j + rw [Matrix.conjTranspose_apply, star_trivial] + refine tendsto_nhds_unique (hentry j i) ?_ + have hfun : (fun k => A k j i) = (fun k => A k i j) := + funext fun k => (hsymm_entry (A k) (hA k).isHermitian j i) + rw [hfun]; exact hentry i j + refine Matrix.PosSemidef.of_dotProduct_mulVec_nonneg hBsymm (fun x => ?_) + have hquad : ∀ (M : Matrix (Fin N) (Fin N) ℝ), + star x ⬝ᵥ (M *ᵥ x) = ∑ i, ∑ j, star (x i) * (M i j * x j) := by + intro M + simp only [dotProduct, Matrix.mulVec, Pi.star_apply, Finset.mul_sum] + have key : Tendsto (fun k => star x ⬝ᵥ (A k *ᵥ x)) atTop (𝓝 (star x ⬝ᵥ (B *ᵥ x))) := by + simp only [hquad] + refine tendsto_finsetSum _ (fun i _ => ?_) + refine tendsto_finsetSum _ (fun j _ => ?_) + exact tendsto_const_nhds.mul ((hentry i j).mul tendsto_const_nhds) + exact ge_of_tendsto' key (fun k => (hA k).dotProduct_mulVec_nonneg x) + +/-- The `k`-fold Hadamard (entrywise) power of `G`, with `G^∘0 = 𝟙𝟙ᵀ` (the all-ones matrix). -/ +private def hadamardPow (G : Matrix (Fin N) (Fin N) ℝ) : ℕ → Matrix (Fin N) (Fin N) ℝ + | 0 => Matrix.of (fun _ _ => 1) + | (k + 1) => G ⊙ hadamardPow G k + +private theorem hadamardPow_apply (G : Matrix (Fin N) (Fin N) ℝ) (k : ℕ) (i j : Fin N) : + hadamardPow G k i j = (G i j) ^ k := by + induction k with + | zero => simp [hadamardPow] + | succ k ih => + rw [hadamardPow, Matrix.hadamard_apply, ih, pow_succ]; ring + +private theorem posSemidef_hadamardPow {G : Matrix (Fin N) (Fin N) ℝ} (hG : G.PosSemidef) (k : ℕ) : + (hadamardPow G k).PosSemidef := by + induction k with + | zero => exact posSemidef_ones + | succ k ih => exact hG.hadamard ih + +/-- **The entrywise exponential of a PSD matrix is PSD.** `exp∘G = Σₖ G^∘k/k!`: each Hadamard power is +PSD by the Schur product theorem, the partial sums are PSD, and the PSD cone is closed under limits. -/ +private theorem posSemidef_map_exp {G : Matrix (Fin N) (Fin N) ℝ} (hG : G.PosSemidef) : + (G.map Real.exp).PosSemidef := by + set S : ℕ → Matrix (Fin N) (Fin N) ℝ := + fun n => ∑ k ∈ Finset.range n, ((k.factorial : ℝ)⁻¹) • hadamardPow G k with hS + have hSpsd : ∀ n, (S n).PosSemidef := by + intro n + refine Matrix.posSemidef_sum _ (fun k _ => ?_) + exact (posSemidef_hadamardPow hG k).smul (by positivity) + have hlim : Tendsto S atTop (𝓝 (G.map Real.exp)) := by + refine tendsto_pi_nhds.mpr (fun i => ?_) + refine tendsto_pi_nhds.mpr (fun j => ?_) + have hentry : (fun n => S n i j) + = (fun n => ∑ k ∈ Finset.range n, ((k.factorial : ℝ)⁻¹) * (G i j) ^ k) := by + funext n + simp only [hS, Matrix.sum_apply, Matrix.smul_apply, smul_eq_mul, hadamardPow_apply] + rw [hentry] + have hsum : HasSum (fun k => ((k.factorial : ℝ)⁻¹) * (G i j) ^ k) (Real.exp (G i j)) := by + have h := NormedSpace.exp_series_hasSum_exp' (𝕂 := ℝ) (G i j) + simp only [smul_eq_mul] at h + rwa [← Real.exp_eq_exp_ℝ] at h + have hmap : (G.map Real.exp) i j = Real.exp (G i j) := by simp [Matrix.map_apply] + rw [hmap] + exact hsum.tendsto_sum_nat + exact posSemidef_of_tendsto hSpsd hlim + +/-- **A single Gaussian matrix is positive-semidefinite.** For `c ≥ 0`, the matrix +`exp(−c·(yᵢ−yⱼ)²)` is PSD: writing the exponent as `−c·yᵢ² + 2c·yᵢyⱼ − c·yⱼ²`, it is the diagonal +congruence `D·(exp∘(2c·yyᵀ))·Dᵀ` of the entrywise exponential of the (PSD) rank-one Gram `yyᵀ`. -/ +private theorem posSemidef_gaussianCol (y : Fin N → ℝ) {c : ℝ} (hc : 0 ≤ c) : + (Matrix.of (fun i j => Real.exp (-(c * ((y i - y j) * (y i - y j)))))).PosSemidef := by + set G : Matrix (Fin N) (Fin N) ℝ := (2 * c) • Matrix.vecMulVec y y with hG + have hGpsd : G.PosSemidef := by + have hv : (Matrix.vecMulVec y (star y)).PosSemidef := Matrix.posSemidef_vecMulVec_self_star y + have hstar : Matrix.vecMulVec y (star y) = Matrix.vecMulVec y y := by + ext i j; simp [Matrix.vecMulVec_apply] + rw [hstar] at hv + rw [hG]; exact hv.smul (mul_nonneg (by norm_num) hc) + have hMpsd : (G.map Real.exp).PosSemidef := posSemidef_map_exp hGpsd + set D : Matrix (Fin N) (Fin N) ℝ := Matrix.diagonal (fun i => Real.exp (-(c * (y i * y i)))) with hD + have hcong : (D * (G.map Real.exp) * Dᴴ).PosSemidef := hMpsd.mul_mul_conjTranspose_same D + have hDH : (Dᴴ : Matrix (Fin N) (Fin N) ℝ) = D := by + rw [hD]; simp + rw [hDH] at hcong + have heq : D * (G.map Real.exp) * D + = Matrix.of (fun i j => Real.exp (-(c * ((y i - y j) * (y i - y j))))) := by + ext i j + rw [hD, Matrix.mul_diagonal, Matrix.diagonal_mul] + simp only [Matrix.of_apply, hG, Matrix.map_apply, Matrix.smul_apply, Matrix.vecMulVec_apply, + smul_eq_mul] + rw [← Real.exp_add, ← Real.exp_add] + congr 1; ring + rwa [heq] at hcong + +/-- Folding scalar multiplication over a list is the product of the mapped list. -/ +private theorem foldl_mul_eq_prod {ι : Type} (l : List ι) (g : ι → ℝ) (a : ℝ) : + l.foldl (fun acc x => acc * g x) a = a * (l.map g).prod := by + induction l generalizing a with + | nil => simp + | cons x xs ih => simp only [List.foldl_cons, List.map_cons, List.prod_cons, ih]; ring + +/-- A Hadamard product (over a finset) of positive-semidefinite matrices is positive-semidefinite — +the Schur product theorem, iterated. -/ +private theorem posSemidef_prod_hadamard {ι : Type} [DecidableEq ι] + (F : ι → Matrix (Fin N) (Fin N) ℝ) (s : Finset ι) (hF : ∀ k ∈ s, (F k).PosSemidef) : + (Matrix.of (fun i j => ∏ k ∈ s, (F k) i j)).PosSemidef := by + induction s using Finset.induction with + | empty => simpa only [Finset.prod_empty] using (posSemidef_ones (N := N)) + | @insert a s ha ih => + rw [show (Matrix.of (fun i j => ∏ k ∈ insert a s, (F k) i j)) + = (F a) ⊙ Matrix.of (fun i j => ∏ k ∈ s, (F k) i j) from by + ext i j; simp only [Matrix.hadamard_apply, Matrix.of_apply, Finset.prod_insert ha]] + exact (hF a (Finset.mem_insert_self a s)).hadamard + (ih (fun k hk => hF k (Finset.mem_insert_of_mem hk))) + +variable {n d : Nat} + +/-- **The Gaussian-mode product kernel is positive-semidefinite.** For data `X`, a nonnegative +selection mask `w ≥ 0`, and `scale ≥ 0`, +`K[i,j] = scale · ∏_dim (1 + w[dim]·exp(−(X[i,dim]−X[j,dim])²/2l²))` is PSD. Each feature factor +`𝟙𝟙ᵀ + w[dim]·Gaussian` is PSD (`posSemidef_ones` + `posSemidef_gaussianCol`), and the product over +features is PSD by the **Schur product theorem** (`posSemidef_prod_hadamard`). -/ +theorem gaussianKernelFn_posSemidef (X : Fin n → Fin d → ℝ) (w : Fin d → ℝ) {scale l : ℝ} + (hscale : 0 ≤ scale) (hw : ∀ k, 0 ≤ w k) : + (Matrix.of (Spec.gaussianKernelFn X w scale l)).PosSemidef := by + -- the per-feature factor matrices + set F : Fin d → Matrix (Fin n) (Fin n) ℝ := + fun k => Matrix.of (fun i j => 1 + w k * + Real.exp (-((X i k - X j k) * (X i k - X j k)) / ((1 + 1) * l * l))) with hF + have hFpsd : ∀ k, (F k).PosSemidef := by + intro k + -- the per-feature Gaussian is PSD via `posSemidef_gaussianCol` + have hGauss : (Matrix.of (fun i j => + Real.exp (-((X i k - X j k) * (X i k - X j k)) / ((1 + 1) * l * l)))).PosSemidef := by + have h := posSemidef_gaussianCol (fun i => X i k) + (c := ((1 + 1) * l * l)⁻¹) (inv_nonneg.mpr (by nlinarith [mul_self_nonneg l])) + have he : (Matrix.of (fun i j => + Real.exp (-(((1 + 1) * l * l)⁻¹ * ((X i k - X j k) * (X i k - X j k)))))) + = Matrix.of (fun i j => + Real.exp (-((X i k - X j k) * (X i k - X j k)) / ((1 + 1) * l * l))) := by + ext i j + show Real.exp _ = Real.exp _ + congr 1; ring + rwa [he] at h + -- F k = 𝟙𝟙ᵀ + w k • Gaussian + have hsplit : F k = Matrix.of (fun _ _ : Fin n => (1 : ℝ)) + + (w k) • Matrix.of (fun i j => + Real.exp (-((X i k - X j k) * (X i k - X j k)) / ((1 + 1) * l * l))) := by + rw [hF]; ext i j + simp only [Matrix.add_apply, Matrix.smul_apply, Matrix.of_apply, smul_eq_mul] + rw [hsplit] + exact posSemidef_ones.add (hGauss.smul (hw k)) + -- the product matrix is PSD + have hPpsd : (Matrix.of (fun i j => ∏ k, (F k) i j)).PosSemidef := + posSemidef_prod_hadamard F Finset.univ (fun k _ => hFpsd k) + -- the kernel is `scale • (product matrix)` + have hKeq : Matrix.of (Spec.gaussianKernelFn X w scale l) + = scale • Matrix.of (fun i j => ∏ k, (F k) i j) := by + ext i j + rw [Matrix.smul_apply, Matrix.of_apply, Matrix.of_apply, smul_eq_mul, Spec.gaussianKernelFn, + foldl_mul_eq_prod, one_mul, ← List.ofFn_eq_map, List.prod_ofFn] + rfl + rw [hKeq] + exact hPpsd.smul hscale + +/-- The Gaussian-mode product kernel is symmetric: `K[i,j] = K[j,i]`. -/ +theorem gaussianKernelFn_symm (X : Fin n → Fin d → ℝ) (w : Fin d → ℝ) {scale l : ℝ} + (hscale : 0 ≤ scale) (hw : ∀ k, 0 ≤ w k) (i j : Fin n) : + Spec.gaussianKernelFn X w scale l i j = Spec.gaussianKernelFn X w scale l j i := by + have h := (gaussianKernelFn_posSemidef (scale := scale) (l := l) X w hscale hw).isHermitian + have e : (Matrix.of (Spec.gaussianKernelFn X w scale l))ᴴ i j + = (Matrix.of (Spec.gaussianKernelFn X w scale l)) i j := by rw [h] + simpa [Matrix.conjTranspose_apply, Matrix.of_apply] using e.symm + +/-- **Tensor-level: the Gaussian-mode product kernel is positive-semidefinite.** The form the verified +solve consumes, so e.g. `solveRidgeSpec (gaussianKernelSpec X w scale l) γ b` is the exact regularized +solve for any `γ > 0` whenever `scale ≥ 0` and the mask `w ≥ 0`. -/ +theorem gaussianKernelSpec_posSemidef (X : Spec.Tensor ℝ (.dim n (.dim d .scalar))) + (w : Spec.Tensor ℝ (.dim d .scalar)) {scale l : ℝ} (hscale : 0 ≤ scale) + (hw : ∀ k, 0 ≤ Spec.toVecFn w k) : + (Matrix.of (Spec.toMatFn (Spec.gaussianKernelSpec X w scale l))).PosSemidef := by + have hround : Spec.toMatFn (Spec.gaussianKernelSpec X w scale l) + = Spec.gaussianKernelFn (Spec.toMatFn X) (Spec.toVecFn w) scale l := by + funext i j; rfl + rw [hround] + exact gaussianKernelFn_posSemidef _ _ hscale hw + end Spec.Factorization diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index 20c1c8a..6e6c675 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -480,4 +480,29 @@ def quadraticKernelSpec {n d : Nat} (X : Tensor α (.dim n (.dim d .scalar))) (w : Tensor α (.dim d .scalar)) (scale alpha : α) : Tensor α (.dim n (.dim n .scalar)) := ofMatFn (quadraticKernelFn (toMatFn X) (toVecFn w) scale alpha) +/-- Gaussian-mode product kernel matrix +`K[i,j] = scale · ∏_dim (1 + w[dim] · exp(−(X[i,dim]−X[j,dim])²/(2·l²)))` — the Gaussian contribution of +CHD `GaussianMode` (`scale * jnp.prod(1 + which_dim * exps, axis=2)`, with +`exps[i,j,dim] = exp(−(X[i,dim]−X[j,dim])²/(2·l²))`). + +Each feature factor `1 + w[dim]·k` (with `k` the per-feature Gaussian) is positive-semidefinite — `𝟙𝟙ᵀ` +plus a nonnegative multiple of the Gaussian PSD matrix — and the product over features is PSD by the +**Schur product theorem**, so `K` is PSD for `scale ≥ 0` and a nonnegative mask `w ≥ 0` +(see `FactorizationsKernels`). The product is an explicit `foldl` and the squared difference a product, to +stay polymorphic over the law-free `Context α`. -/ +def gaussianKernelFn {n d : Nat} (X : Fin n → Fin d → α) (w : Fin d → α) (scale l : α) : + Fin n → Fin n → α := + fun i j => scale * (List.finRange d).foldl + (fun acc dim => + acc * (1 + w dim * + MathFunctions.exp (-((X i dim - X j dim) * (X i dim - X j dim)) / ((1 + 1) * l * l)))) 1 + +/-- Tensor-level Gaussian-mode product kernel from data `X`, selection mask `w`, `scale`, and length +scale `l`. + +PyTorch analogue: `scale * torch.prod(1 + which_dim * torch.exp(-(dx**2)/(2*l**2)), dim=2)`. -/ +def gaussianKernelSpec {n d : Nat} (X : Tensor α (.dim n (.dim d .scalar))) + (w : Tensor α (.dim d .scalar)) (scale l : α) : Tensor α (.dim n (.dim n .scalar)) := + ofMatFn (gaussianKernelFn (toMatFn X) (toVecFn w) scale l) + end Spec diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index f764e37..0624096 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -272,10 +272,47 @@ a Jacobi eigenvalue go negative. As with the linear mode, this discharges the st hypothesis, so `solveRidgeSpec (quadraticKernelSpec X w scale alpha) γ b` is an unconditional exact solve for `γ > 0`, and `quadraticKernelFn_symm` gives symmetry from `PosSemidef.isHermitian`. -The remaining mode is the *Gaussian* (RBF) kernel `exp(-(xᵢ-xⱼ)²/2\ell²)`, PSD by Schoenberg's theorem -— writing `exp(xy/\ell²)` as a power series whose terms are Hadamard powers of a rank-one Gram — but -Mathlib v4.30.0 has no Bochner/Gaussian-kernel PSD theory, so it remains the honest research-grade -item, parallel to the cyclic-Jacobi rate. +# Building the kernel: the Gaussian mode is positive-semidefinite + +The third and last mode is the *Gaussian* (fully-nonlinear) kernel. CHD's `GaussianMode` builds, per +feature, the Gaussian `exp(-(X_{i,d}-X_{j,d})^2/2\ell^2)` and takes their masked product: + +$$`K[i,j] = \texttt{scale}\cdot\prod_{d} \bigl(1 + w_d\,\exp(-(X_{i,d}-X_{j,d})^2/2\ell^2)\bigr).` + +Unlike the linear and quadratic modes, the Gaussian has *no finite algebraic PSD identity* — `exp` is a +genuine limit. The textbook proof is Schoenberg/Bochner, which Mathlib v4.30.0 does not have. But there +is an elementary route that reuses the *same* Schur product theorem, and +`gaussianKernelFn_posSemidef` carries it out. It rests on one genuinely new, independently useful +lemma and three assembly steps: + +- *The PSD cone is closed under entrywise limits* (`posSemidef_of_tendsto`): if real PSD matrices `A_k` + converge entrywise to `B`, then `B` is PSD. The quadratic form `x^\top M x` is a finite polynomial in + the entries, hence continuous, and `\{y \mid 0 \le y\}` is closed — so `0 \le x^\top A_k x` passes to + the limit. This is the only piece Mathlib lacked, and it belongs in Mathlib. +- *The entrywise exponential of a PSD matrix is PSD* (`posSemidef_map_exp`): writing + `\exp\circ G = \sum_k G^{\odot k}/k!` (Hadamard powers), each `G^{\odot k}` is PSD by the Schur + product theorem, each partial sum is PSD (a finite sum of PSD matrices), and the partial sums converge + entrywise to `\exp\circ G` (the real exponential series) — so the limit is PSD by the lemma above. +- *A single Gaussian matrix is PSD* (`posSemidef_gaussianCol`): for `c \ge 0`, the matrix + `\exp(-c\,(y_i-y_j)^2)` factors as the diagonal congruence + `D\,(\exp\circ(2c\,yy^\top))\,D^\top` with `D = \operatorname{diag}(\exp(-c\,y_i^2))`; the middle + factor is the entrywise exponential of the (PSD, rank-one) Gram `yy^\top`, and congruence preserves + PSD. +- *Each feature factor and their product* (`gaussianKernelFn_posSemidef`): `\mathbf{1}\mathbf{1}^\top + + w_d\cdot\text{Gaussian}_d` is PSD for `w_d \ge 0`, and the product over features is a Hadamard product + of PSD matrices — PSD by the Schur product theorem again. Scaling by `\texttt{scale} \ge 0` finishes. + +So `K` is PSD whenever `scale ≥ 0` and the mask is nonnegative (`w ≥ 0`) — discharging the standing +`PosSemidef` hypothesis for the Gaussian mode, and `gaussianKernelFn_symm` gives symmetry from +`PosSemidef.isHermitian`. The `GaussianKernel` example confirms `K = Kᵀ`, the match with the CHD +`GaussianMode` product formula, all-nonnegative Jacobi eigenvalues (with feature masking preserved), +and the downstream exact ridge solve; its two *negative controls* take `scale = -1` and a *negative +mask weight* `w = [-2,0]` (whose factor `1 - 2\exp(-\Delta^2/2\ell^2)` drives the diagonal below zero), +each producing a negative eigenvalue — so `scale ≥ 0` and `w ≥ 0` are both necessary. + +With the linear, quadratic, and Gaussian modes all discharged, *every kernel CHD builds is now +PSD-verified*: there is no `PosSemidef` hypothesis left to assume anywhere in the solve / `find_gamma` / +`Z_test` development. # The a-posteriori residual certificate @@ -462,6 +499,10 @@ Cholesky-based regularized solve are proved, and the specification-level facts t on are independent of the convergence step. The three concrete CHD routines built on them are now identified too: the eigendecomposition-form `solve_variationnal` equals `-(K + γI)⁻¹ ga` and agrees with the Cholesky route, and the `noise`/`find_gamma`-loss/`Z_test` statistic is a spectral ratio -provably in `[0,1]` that depends on the kernel only through its spectrum. So the CHD foundation is -complete, the one remaining open item being statistical, not algebraic — the `Z_test`'s Gaussian -sampling and percentiles, exercised numerically rather than proved. +provably in `[0,1]` that depends on the kernel only through its spectrum. And the kernel build itself +is now PSD-verified for *all three* CHD modes — linear, quadratic, and Gaussian — so the standing +`PosSemidef` hypothesis is discharged from data, not assumed, even for the fully-nonlinear kernel. So +the CHD foundation is complete; the two remaining open items are the cyclic-Jacobi convergence *rate* +(captured exactly by the a-posteriori residual certificate, never by `sorry`) and the `Z_test`'s +Gaussian sampling and percentiles — one a proof-only gap on a quantity CHD does not need to *run*, the +other statistical rather than algebraic and exercised numerically. From 83c90a0e7d17ce65b21632d8765414e1d4eb3e4a Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 16:58:42 -0700 Subject: [PATCH 15/22] Formalize the CHD discovery decision layer (sound & complete) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CHD's outer hypergraph-discovery loop (decision.py, _GraphDiscoveryMain.py) turns the verified `noise` statistic into graph-structure decisions. This proves each of those decisions correct, over the same executable specs that run on Float. Spec (NN/Spec/Core/Tensor/Factorizations.lean), mirroring the Python verbatim: - argMinFn / argMaxFn — the np.argmin / np.argmax folds - kernelChooserFn — MinNoiseKernelChooser (valid = noise < Z_low; the `2` sentinel written `1+1` to stay Context-polymorphic) - modeIncrementFn / modeChooserFn — MaxIncrementModeChooser - allPrunedFn — the np.all(active_modes == 0) stopping rule Proofs (NN/Proofs/Tensor/Basic/FactorizationsDecision.lean, new, sorry-free). A single generic fold-selection lemma (foldl_select) underwrites both argmin and argmax; the comparison proofs bridge the Context order test to the real `<` (gtBool_eq_decide): - argMinFn_le / argMaxFn_le — the prune step removes a least-activated ancestor - kernelChooserFn_eq_some / _eq_none — MinNoiseKernelChooser is sound AND complete: returns `some s` with s valid and of least noise among all valid kernels exactly when a valid kernel exists, else `none`. Its `2`-sentinel correctness rests directly on the verified varNoiseFn_le_one (hypothesis `hbound`), so the decision is a proved selection over a statistic whose [0,1] range was itself proved. - modeChooserFn_ge — MaxIncrementModeChooser reports the largest noise-jump iteration - allPrunedFn_iff — the stopping test holds iff every ancestor is pruned Examples (NN/Examples/Factorization/Discovery.lean, new): 13 #eval checks with negative controls. The end-to-end block eigendecomposes an SPD kernel and runs a find_gamma sweep feeding the verified varNoiseSpec at several gamma straight into argMinFn (noises [0.004, 0.040, 0.287], all in [0,1], argmin = 0). Blueprint Ch4 gains a discovery-decision-layer section and an updated closing summary. Registrations: Basic.lean (proofs), Factorization.lean (examples). Verified: NN.Examples.Factorization + NN.Proofs.Tensor.Basic green (2705 jobs); banned-tactic sweep clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 12 ++ NN/Examples/Factorization/Discovery.lean | 174 ++++++++++++++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Tensor/Basic/FactorizationsDecision.lean | 197 ++++++++++++++++++ NN/Spec/Core/Tensor/Factorizations.lean | 56 +++++ .../Ch4_Verification/Factorizations.lean | 63 +++++- 6 files changed, 497 insertions(+), 6 deletions(-) create mode 100644 NN/Examples/Factorization/Discovery.lean create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsDecision.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index d6774e7..f887644 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -18,6 +18,7 @@ public import NN.Examples.Factorization.Variational public import NN.Examples.Factorization.LinearKernel public import NN.Examples.Factorization.QuadraticKernel public import NN.Examples.Factorization.GaussianKernel +public import NN.Examples.Factorization.Discovery /-! # Matrix factorization examples @@ -84,6 +85,17 @@ factorization misbehaves. kernel feeds an exact ridge solve; **negative controls**: `scale < 0` and a *negative mask weight* (`w = [−2,0]`, which drives the diagonal below zero) both make `K` indefinite. With the linear, quadratic, and Gaussian modes all discharged, every CHD kernel build is now PSD-verified. +- `Discovery` — CHD's *discovery decision layer* (`decision.py`, `_GraphDiscoveryMain.py`), which turns + the verified `noise` statistic into graph structure: the activation prune step (`argMinFn`, picks the + least-activated ancestor), the `MinNoiseKernelChooser` (`kernelChooserFn`, the least-noise valid kernel + with `noise < Z_low`, or `none`), the `MaxIncrementModeChooser` (`modeChooserFn`, the largest + `noise`-jump iteration), and the stopping rule (`allPrunedFn`), proved sound/complete in + `FactorizationsDecision`. Checks: argmin picks the least-activated ancestor (and not the most), the + chooser selects the unique valid kernel / least noise among valid / `none` when none valid, the mode + chooser picks the largest-increment iteration, and the stopping rule fires only on the all-zero mask; + an **end-to-end** block then feeds the verified `varNoiseSpec` at several `γ` into `argMinFn`, a + `find_gamma` sweep selecting the least-noise regularization (all noises in `[0,1]`); **negative + controls** confirm the most-activated ancestor and tiny-increment iterations are correctly rejected. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Discovery.lean b/NN/Examples/Factorization/Discovery.lean new file mode 100644 index 0000000..f1bae3a --- /dev/null +++ b/NN/Examples/Factorization/Discovery.lean @@ -0,0 +1,174 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Examples.Factorization.Common +meta import NN.Examples.Factorization.Common + +/-! +# Example: the CHD discovery decision layer + +These checks corroborate `NN.Proofs.Tensor.Basic.FactorizationsDecision`. Once a kernel is built and its +`noise` level computed (`varNoiseSpec`, proven to lie in `[0,1]`), CHD's *discovery loop* turns those +numbers into graph-structure decisions (`decision.py`, `_GraphDiscoveryMain.py`). We exercise the four +deterministic choices the loop makes, each with a positive check and a negative control: + +* **prune the least-activated ancestor** — `argMinFn` returns the index of the smallest activation + (`min_activation = np.argmin(activations)`); the *most*-activated ancestor is correctly **not** chosen; +* **pick the kernel mode that admits an edge** — `kernelChooserFn` (`MinNoiseKernelChooser`) returns the + valid kernel (`noise < Z_low`) of least `noise`, or `none` when no kernel is valid; +* **report the pruning iteration of largest `noise` jump** — `modeChooserFn` (`MaxIncrementModeChooser`) + returns the `argmax` of the increments; +* **stop when every ancestor is pruned** — `allPrunedFn` fires on the all-zero mask and not before. + +The final block closes the loop end-to-end: it builds an SPD kernel, eigendecomposes it, and runs a +`find_gamma`-style sweep — feeding the *verified* `varNoiseSpec` at several `γ` straight into `argMinFn` +to select the regularization with least noise. Every decision runs over `Float`, the executable runtime +scalar. +-/ + +@[expose] public section + + +namespace NN.Examples.Factorization.Discovery + +/-- A length-3 `Float` family `Fin 3 → Float` from three entries. -/ +def vec3 (a b c : Float) : Fin 3 → Float := fun i => [a, b, c].getD i.val 0.0 +/-- A length-4 `Float` family `Fin 4 → Float` from four entries. -/ +def vec4 (a b c d : Float) : Fin 4 → Float := fun i => [a, b, c, d].getD i.val 0.0 + +/-- Build a length-`n` `Float` vector tensor from a list (missing entries `0`). -/ +def mkVec {n : Nat} (xs : List Float) : Spec.Tensor Float (.dim n .scalar) := + Spec.ofVecFn (fun i => xs.getD i.val 0.0) + +/-- Encode a chooser verdict as an `Int`: `-1` for `none` ("no ancestor"), else the chosen index. -/ +def chooserCode {m : Nat} (o : Option (Fin m)) : Int := + match o with + | none => -1 + | some i => Int.ofNat i.val + +/-- Compiled positive assertion that a `Bool` decision is `true`. -/ +def assertTrue (name : String) (b : Bool) : IO Unit := + if b then IO.println s!"{name}: OK" + else throw (IO.userError s!"{name}: FAIL (expected true)") + +/-- Compiled negative-control assertion that a `Bool` decision is `false` (the property correctly does +*not* hold). -/ +def assertFalse (name : String) (b : Bool) : IO Unit := + if b then throw (IO.userError s!"{name}: FAIL (expected false)") + else IO.println s!"{name}: OK (correctly false)" + +/-! ## Pruning: `argMinFn` removes the least-activated ancestor -/ + +/-- Activations of four candidate ancestors; ancestor 1 is the least-activated. -/ +def activations : Fin 4 → Float := vec4 0.8 0.2 0.5 0.9 + +#eval IO.println s!"activations = {(List.finRange 4).map activations}, \ + argMin = {(Spec.argMinFn activations).val}" + +-- Positive — the prune step removes the least-activated ancestor (`argMinFn_le`). +#eval assertTrue "prune picks the least-activated ancestor (argmin = 1)" + ((Spec.argMinFn activations).val == 1) + +-- Negative — it does *not* remove the most-activated ancestor (index 3). +#eval assertFalse "prune does not pick the most-activated ancestor" + ((Spec.argMinFn activations).val == 3) + +/-! ## Kernel chooser: least-noise valid kernel, or `none` -/ + +/-- Three candidate kernels' `noise` levels and `Z_low` lower bounds. Validity is `noise < Z_low`: +kernel 0 invalid (`0.3 ≥ 0.2`), kernel 1 valid (`0.1 < 0.4`), kernel 2 invalid (`0.5 ≥ 0.1`). -/ +def noisesA : Fin 3 → Float := vec3 0.3 0.1 0.5 +def ZlowsA : Fin 3 → Float := vec3 0.2 0.4 0.1 + +#eval IO.println s!"kernel chooser (one valid) -> code {chooserCode (Spec.kernelChooserFn noisesA ZlowsA)}" + +-- Positive — exactly kernel 1 is valid, so the chooser admits an edge via kernel 1 (`kernelChooserFn_eq_some`). +#eval assertTrue "kernel chooser selects the unique valid kernel (some 1)" + (chooserCode (Spec.kernelChooserFn noisesA ZlowsA) == 1) + +/-- Two valid kernels (0 and 1); the chooser must take the one of *least* noise (kernel 0, `0.05`). -/ +def noisesB : Fin 3 → Float := vec3 0.05 0.1 0.5 +def ZlowsB : Fin 3 → Float := vec3 0.2 0.4 0.1 + +-- Positive — among valid kernels the chooser takes least noise (kernel 0 beats kernel 1). +#eval assertTrue "kernel chooser takes least noise among valid (some 0)" + (chooserCode (Spec.kernelChooserFn noisesB ZlowsB) == 0) + +/-- No kernel is valid (`noise ≥ Z_low` everywhere): the chooser reports "no ancestor". -/ +def noisesC : Fin 3 → Float := vec3 0.5 0.6 0.7 +def ZlowsC : Fin 3 → Float := vec3 0.1 0.2 0.3 + +-- Negative — no valid kernel ⟹ no edge (`kernelChooserFn_eq_none`); code `-1`. +#eval assertTrue "kernel chooser reports none when no kernel is valid (code -1)" + (chooserCode (Spec.kernelChooserFn noisesC ZlowsC) == -1) + +/-! ## Mode chooser: the iteration of largest `noise` increment -/ + +/-- The per-iteration `noise` sequence of a pruning run. The big jump `0.08 → 0.9` is between iterations +1 and 2, so `MaxIncrementModeChooser` reports iteration 1 (increment `0.82`). -/ +def noiseSeq : Fin 4 → Float := vec4 0.05 0.08 0.9 0.95 + +#eval IO.println s!"increments = {(List.finRange 4).map (Spec.modeIncrementFn noiseSeq)}, \ + modeChooser = {(Spec.modeChooserFn noiseSeq).val}" + +-- Positive — the mode chooser reports the largest-jump iteration (`modeChooserFn_ge`). +#eval assertTrue "mode chooser picks the largest noise-increment iteration (argmax = 1)" + ((Spec.modeChooserFn noiseSeq).val == 1) + +-- Negative — it does *not* report a tiny-increment iteration (iteration 0, increment 0.03). +#eval assertFalse "mode chooser does not pick a tiny-increment iteration" + ((Spec.modeChooserFn noiseSeq).val == 0) + +/-! ## Stopping rule: fire exactly when all ancestors are pruned -/ + +-- Positive — the loop stops when every ancestor mode is zero (`allPrunedFn_iff`). +#eval assertTrue "stopping rule fires when all ancestors are pruned" + (Spec.allPrunedFn (vec3 0.0 0.0 0.0)) + +-- Negative — it does not fire while an ancestor remains active. +#eval assertFalse "stopping rule does not fire while an ancestor remains" + (Spec.allPrunedFn (vec3 0.0 1.0 0.0)) + +/-! ## End-to-end: `find_gamma` feeds the verified `noise` into `argMinFn` + +A `find_gamma`-style sweep: build an SPD kernel, eigendecompose it, evaluate the verified +`varNoiseSpec` at several `γ`, and let `argMinFn` pick the regularization of least noise — exactly the +discovery layer consuming the verified statistic. More regularization means more noise, so the smallest +`γ` wins (index 0). -/ + +/-- A `3 × 3` symmetric positive-definite kernel. -/ +def K : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := + mkMat [[2.0, 0.5, 0.3], + [0.5, 2.0, 0.4], + [0.3, 0.4, 2.0]] + +/-- Its eigendecomposition `(evals, V)` from the Jacobi solver. -/ +def evals : Spec.Tensor Float (.dim 3 .scalar) := (Spec.symEigJacobiSpec K 12).1 +def V : Spec.Tensor Float (.dim 3 (.dim 3 .scalar)) := (Spec.symEigJacobiSpec K 12).2 + +/-- The data vector `ga`. -/ +def ga : Spec.Tensor Float (.dim 3 .scalar) := mkVec [1.0, 2.0, 3.0] + +/-- The candidate regularizations, increasing. -/ +def gammas : Fin 3 → Float := vec3 0.01 0.1 1.0 + +/-- The verified `noise` at each candidate `γ` (`find_gamma`'s loss). -/ +def noiseAt : Fin 3 → Float := fun i => Spec.varNoiseSpec evals V (gammas i) ga + +#eval IO.println s!"find_gamma noises = {(List.finRange 3).map noiseAt}, \ + argMin γ index = {(Spec.argMinFn noiseAt).val}" + +-- Positive — every swept noise is a genuine fraction in [0,1] (numeric witness of `varNoiseFn_nonneg`/`_le_one`). +#eval assertTrue "find_gamma noises all lie in [0,1]" + ((List.finRange 3).all (fun i => 0.0 ≤ noiseAt i && noiseAt i ≤ 1.0)) + +-- Positive — `find_gamma` (argmin of the verified noise) selects the least-regularized γ (index 0). +#eval assertTrue "find_gamma selects least-noise γ via argMinFn (index 0)" + ((Spec.argMinFn noiseAt).val == 0) + +end NN.Examples.Factorization.Discovery diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index 7ca7dc3..6f094ff 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -13,6 +13,7 @@ public import NN.Proofs.Tensor.Basic.Factorizations public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.FactorizationsSolve public import NN.Proofs.Tensor.Basic.FactorizationsVariational +public import NN.Proofs.Tensor.Basic.FactorizationsDecision public import NN.Proofs.Tensor.Basic.FactorizationsKernels public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi diff --git a/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean b/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean new file mode 100644 index 0000000..c927e86 --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean @@ -0,0 +1,197 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.FactorizationsVariational +public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction + +/-! +# CHD discovery decision layer (`decision.py`, `_GraphDiscoveryMain.py`) + +[`FactorizationsVariational`](./FactorizationsVariational.lean) proved that CHD's `noise` level is a +spectral fraction in `[0,1]`. This file closes the gap up to the *graph-structure decisions* CHD makes +from those numbers — the outer discovery loop. Each is a deterministic comparison over finite data; the +executable specs (`Spec.argMinFn`, `Spec.kernelChooserFn`, …) mirror the Python verbatim, and the +theorems here establish their selection guarantees: + +* **`argMinFn_le` / `argMaxFn_le`** — the fold-based `np.argmin`/`np.argmax` really return the index of a + least / greatest element (the activation prune step, the mode chooser). +* **`kernelChooserFn_eq_some` / `kernelChooserFn_eq_none`** — `MinNoiseKernelChooser` is *sound and + complete*: it returns `some s` with `s` valid and of least `noise` among valid kernels exactly when a + valid kernel exists, and `none` otherwise. The `noise ≤ 1` precondition that makes the `2` sentinel + work is exactly the verified `varNoiseFn_le_one`. +* **`modeChooserFn_ge`** — `MaxIncrementModeChooser` returns the iteration of largest `noise` increment. +* **`allPrunedFn_iff`** — the stopping test `np.all(active_modes == 0)` holds iff every ancestor is + pruned. + +Scope honesty: everything is exact over `ℝ`. The comparisons in the specs go through the `Context` order +test (`gtBool`/`ltBool`); `gtBool_true_iff` (from `FactorizationsReconstruction`) bridges them to the +real `<`, after which the selection proofs are pure order theory over `Fin (n+1)`. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Spec.Factorization.Reconstruction + +variable {n : Nat} + +/-! ## Bridge: the `Context` order tests over `ℝ` -/ + +/-- Over `ℝ`, `gtBool x y` is the decidable `y < x`. -/ +theorem gtBool_eq_decide (x y : ℝ) : Context.gtBool x y = decide (y < x) := by + by_cases h : y < x + · have h1 : Context.gtBool x y = true := gtBool_true_iff.mpr h + rw [h1]; simp [h] + · have h1 : Context.gtBool x y = false := by + cases hc : Context.gtBool x y with + | false => rfl + | true => exact absurd (gtBool_true_iff.mp hc) h + rw [h1]; simp [h] + +/-- Over `ℝ`, `ltBool x y` is the decidable `x < y`. -/ +theorem ltBool_eq_decide (x y : ℝ) : Spec.ltBool x y = decide (x < y) := by + rw [Spec.ltBool, gtBool_eq_decide] + +/-! ## A generic fold-selection lemma + +Both `argMinFn` and `argMaxFn` are `List.foldl`s of the shape "keep the running best, swap in `j` when +the `Bool` test `cmp (key j) (key best)` fires". The next lemma proves such a fold returns a `le`-best +index over `init :: l`, for any preorder `le` whose strict part is decided by `cmp`. Instantiating +`le := (· ≤ ·)` gives the argmax guarantee; `le := (· ≥ ·)` gives argmin. -/ + +private theorem foldl_select {m : Nat} (key : Fin m → ℝ) (cmp : ℝ → ℝ → Bool) + (le : ℝ → ℝ → Prop) (hrefl : ∀ x, le x x) + (htrans : ∀ x y z, le x y → le y z → le x z) + (htrue : ∀ x y, cmp x y = true → le y x) (hfalse : ∀ x y, cmp x y = false → le x y) + (init : Fin m) (l : List (Fin m)) : + le (key init) + (key (l.foldl (fun best j => if cmp (key j) (key best) then j else best) init)) + ∧ ∀ j ∈ l, + le (key j) + (key (l.foldl (fun best j => if cmp (key j) (key best) then j else best) init)) := by + induction l generalizing init with + | nil => exact ⟨hrefl _, by simp⟩ + | cons j₀ t ih => + rw [List.foldl_cons] + set best' := (if cmp (key j₀) (key init) then j₀ else init) with hb + have hstep_init : le (key init) (key best') := by + by_cases hcmp : cmp (key j₀) (key init) = true + · rw [hb, if_pos hcmp]; exact htrue _ _ hcmp + · rw [hb, if_neg hcmp]; exact hrefl _ + have hstep_j0 : le (key j₀) (key best') := by + by_cases hcmp : cmp (key j₀) (key init) = true + · rw [hb, if_pos hcmp]; exact hrefl _ + · rw [hb, if_neg hcmp] + rw [Bool.not_eq_true] at hcmp + exact hfalse _ _ hcmp + obtain ⟨hm, hc⟩ := ih best' + refine ⟨htrans _ _ _ hstep_init hm, ?_⟩ + intro j hj + rcases List.mem_cons.mp hj with rfl | hj' + · exact htrans _ _ _ hstep_j0 hm + · exact hc j hj' + +/-! ## `argmin` / `argmax` -/ + +/-- **`argMinFn` returns the index of a least element.** -/ +theorem argMinFn_le (a : Fin (n + 1) → ℝ) (j : Fin (n + 1)) : + a (Spec.argMinFn a) ≤ a j := by + have h := foldl_select (key := a) (cmp := Spec.ltBool) (le := fun p q => q ≤ p) + (fun x => le_refl x) (fun x y z hxy hyz => le_trans hyz hxy) + (fun x y hh => by rw [ltBool_eq_decide] at hh; exact (of_decide_eq_true hh).le) + (fun x y hh => by rw [ltBool_eq_decide] at hh; exact not_lt.mp (of_decide_eq_false hh)) + (0 : Fin (n + 1)) (List.finRange (n + 1)) + exact h.2 j (List.mem_finRange j) + +/-- **`argMaxFn` returns the index of a greatest element.** -/ +theorem argMaxFn_le (a : Fin (n + 1) → ℝ) (j : Fin (n + 1)) : + a j ≤ a (Spec.argMaxFn a) := by + have h := foldl_select (key := a) (cmp := Context.gtBool) (le := fun p q => p ≤ q) + (fun x => le_refl x) (fun x y z => le_trans) + (fun x y hh => by rw [gtBool_eq_decide] at hh; exact (of_decide_eq_true hh).le) + (fun x y hh => by rw [gtBool_eq_decide] at hh; exact not_lt.mp (of_decide_eq_false hh)) + (0 : Fin (n + 1)) (List.finRange (n + 1)) + exact h.2 j (List.mem_finRange j) + +/-! ## `MinNoiseKernelChooser` -/ + +/-- **`MinNoiseKernelChooser` is sound and complete (some branch).** If some kernel is valid +(`noise < Z_low`) and all noises respect the ceiling `noise ≤ 1` (the verified `varNoiseFn_le_one`), +the chooser returns `some s` with `s` itself valid and of least `noise` among all valid kernels. -/ +theorem kernelChooserFn_eq_some {noises Zlows : Fin (n + 1) → ℝ} + (hbound : ∀ i, noises i ≤ 1) {v : Fin (n + 1)} (hv : noises v < Zlows v) : + ∃ s, Spec.kernelChooserFn noises Zlows = some s ∧ noises s < Zlows s + ∧ ∀ j, noises j < Zlows j → noises s ≤ noises j := by + -- the `np.where`-replaced key (valid ↦ noise, invalid ↦ the `2` sentinel `1 + 1`) + set key : Fin (n + 1) → ℝ := + (fun i => if Spec.ltBool (noises i) (Zlows i) then noises i else (1 : ℝ) + 1) with hkeydef + have hkv : ∀ i, noises i < Zlows i → key i = noises i := by + intro i hi + show (if Spec.ltBool (noises i) (Zlows i) then noises i else (1 : ℝ) + 1) = noises i + rw [ltBool_eq_decide]; simp [hi] + have hkinv : ∀ i, ¬ noises i < Zlows i → key i = (1 : ℝ) + 1 := by + intro i hi + show (if Spec.ltBool (noises i) (Zlows i) then noises i else (1 : ℝ) + 1) = (1 : ℝ) + 1 + rw [ltBool_eq_decide]; simp [hi] + set s := Spec.argMinFn key with hs + have hle : ∀ j, key s ≤ key j := fun j => argMinFn_le key j + -- the chosen `s` is valid: otherwise `key s = 2 ≤ key v = noises v ≤ 1`, impossible + have hsvalid : noises s < Zlows s := by + by_contra hns + have hchain := hle v + rw [hkinv s hns, hkv v hv] at hchain + have := le_trans hchain (hbound v) + norm_num at this + refine ⟨s, ?_, hsvalid, ?_⟩ + · show (if Spec.ltBool (noises s) (Zlows s) then some s else none) = some s + have hbt : Spec.ltBool (noises s) (Zlows s) = true := by rw [ltBool_eq_decide]; simp [hsvalid] + rw [if_pos hbt] + · intro j hj + have hchain := hle j + rwa [hkv s hsvalid, hkv j hj] at hchain + +/-- **`MinNoiseKernelChooser` is sound and complete (none branch).** If no kernel is valid, the chooser +returns `none` — CHD's "no ancestor" verdict. -/ +theorem kernelChooserFn_eq_none {noises Zlows : Fin (n + 1) → ℝ} + (hno : ∀ i, ¬ noises i < Zlows i) : Spec.kernelChooserFn noises Zlows = none := by + set key : Fin (n + 1) → ℝ := + (fun i => if Spec.ltBool (noises i) (Zlows i) then noises i else (1 : ℝ) + 1) with hkeydef + set s := Spec.argMinFn key with hs + show (if Spec.ltBool (noises s) (Zlows s) then some s else none) = none + have hbf : Spec.ltBool (noises s) (Zlows s) = true → False := by + rw [ltBool_eq_decide]; intro h; exact (hno s) (of_decide_eq_true h) + rw [if_neg hbf] + +/-! ## `MaxIncrementModeChooser` -/ + +/-- **`MaxIncrementModeChooser` returns the iteration of largest `noise` increment.** -/ +theorem modeChooserFn_ge (noises : Fin (n + 1) → ℝ) (j : Fin (n + 1)) : + Spec.modeIncrementFn noises j ≤ Spec.modeIncrementFn noises (Spec.modeChooserFn noises) := by + rw [Spec.modeChooserFn] + exact argMaxFn_le (Spec.modeIncrementFn noises) j + +/-! ## The stopping rule -/ + +/-- **The stopping test `np.all(active_modes == 0)` holds iff every ancestor is pruned.** -/ +theorem allPrunedFn_iff {k : Nat} (m : Fin k → ℝ) : + Spec.allPrunedFn m = true ↔ ∀ i, m i = 0 := by + rw [Spec.allPrunedFn, List.all_eq_true] + have key : ∀ i : Fin k, + ((!Context.gtBool (m i) 0 && !Context.gtBool 0 (m i)) = true) ↔ m i = 0 := by + intro i + rw [gtBool_eq_decide, gtBool_eq_decide, ← decide_not, ← decide_not, Bool.and_eq_true, + decide_eq_true_eq, decide_eq_true_eq] + constructor + · rintro ⟨h1, h2⟩; exact le_antisymm (not_lt.mp h1) (not_lt.mp h2) + · intro h; rw [h]; exact ⟨lt_irrefl 0, lt_irrefl 0⟩ + constructor + · intro h i; exact (key i).mp (h i (List.mem_finRange i)) + · intro h i _; exact (key i).mpr (h i) + +end Spec.Factorization diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index 6e6c675..13bd437 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -505,4 +505,60 @@ def gaussianKernelSpec {n d : Nat} (X : Tensor α (.dim n (.dim d .scalar))) (w : Tensor α (.dim d .scalar)) (scale l : α) : Tensor α (.dim n (.dim n .scalar)) := ofMatFn (gaussianKernelFn (toMatFn X) (toVecFn w) scale l) +/-! ## CHD discovery decision layer (`decision.py`, `_GraphDiscoveryMain.py`) + +Everything above turns a kernel `K` into a `noise` level (`varNoiseFn`, proven to lie in `[0,1]`) and a +`Z_test` lower bound `Z_low`. CHD's outer *discovery loop* turns those numbers into graph-structure +decisions. Four deterministic choices, each a comparison over finite data: + +* **which feature to prune next** — the least-*activated* ancestor (`min_activation = np.argmin(activations)` + in `helper_functions.step`); +* **which kernel mode admits an edge** — `MinNoiseKernelChooser`: the valid kernel (`noise < Z_low`) of + least `noise`, or none if no kernel is valid (`decision.py`); +* **which pruning iteration to report** — `MaxIncrementModeChooser`: the iteration of largest `noise` + increment (`decision.py`); +* **when to stop** — `np.all(active_modes == 0)`, every ancestor pruned (`_GraphDiscoveryMain.py`). + +The definitions mirror the Python verbatim; their *selection guarantees* (the chosen index really is the +least/greatest, the chooser is sound and complete against `noise < Z_low`) are proved over `ℝ` in +[`NN.Proofs.Tensor.Basic.FactorizationsDecision`](../../../Proofs/Tensor/Basic/FactorizationsDecision.lean). +Comparisons use the `Context` order test (`ltBool`/`gtBool`), so the same definitions run over `Float`. -/ + +/-- Index of a least element of a nonempty finite family (first on ties), by a left fold — CHD's +`np.argmin(activations)` for picking the least-activated ancestor to prune. -/ +def argMinFn {n : Nat} (a : Fin (n + 1) → α) : Fin (n + 1) := + (List.finRange (n + 1)).foldl (fun best j => if ltBool (a j) (a best) then j else best) + (0 : Fin (n + 1)) + +/-- Index of a greatest element of a nonempty finite family (first on ties), by a left fold — the +`np.argmax` underlying the mode chooser. -/ +def argMaxFn {n : Nat} (a : Fin (n + 1) → α) : Fin (n + 1) := + (List.finRange (n + 1)).foldl (fun best j => if Context.gtBool (a j) (a best) then j else best) + (0 : Fin (n + 1)) + +/-- CHD `MinNoiseKernelChooser`. Among candidate kernels with per-kernel `noise` and `Z_low`, a kernel +is *valid* (admits an edge) when `noise < Z_low`; return the valid kernel of least `noise`, or `none` +if none is valid. Mirrors `valid = noises < Z_lows; argmin(np.where(valid, noises, 2))` — the `2` +sentinel (any value above the `noise` ceiling `1`) written `1 + 1` to stay polymorphic. -/ +def kernelChooserFn {n : Nat} (noises Zlows : Fin (n + 1) → α) : Option (Fin (n + 1)) := + let key := fun i => if ltBool (noises i) (Zlows i) then noises i else 1 + 1 + let s := argMinFn key + if ltBool (noises s) (Zlows s) then some s else none + +/-- Per-iteration `noise` increments of CHD `MaxIncrementModeChooser`: +`increments[i] = noises[i+1] − noises[i]` for an interior `i`, and `1 − noises[last]` at the end (the +gap to the `noise` ceiling, `np.append(increments, 1 - list_of_noises[-1])`). -/ +def modeIncrementFn {n : Nat} (noises : Fin (n + 1) → α) : Fin (n + 1) → α := + fun i => if h : i.val + 1 < n + 1 then noises ⟨i.val + 1, h⟩ - noises i else 1 - noises i + +/-- CHD `MaxIncrementModeChooser`: report the pruning iteration with the largest jump in `noise` +(`np.argmax(increments)`). -/ +def modeChooserFn {n : Nat} (noises : Fin (n + 1) → α) : Fin (n + 1) := + argMaxFn (modeIncrementFn noises) + +/-- CHD stopping rule `np.all(active_modes == 0)`: every ancestor has been pruned. An entry counts as +zero exactly when it is neither positive nor negative in the `Context` order. -/ +def allPrunedFn {k : Nat} (m : Fin k → α) : Bool := + (List.finRange k).all (fun i => !Context.gtBool (m i) 0 && !Context.gtBool 0 (m i)) + end Spec diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 0624096..33f44d6 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -314,6 +314,50 @@ With the linear, quadratic, and Gaussian modes all discharged, *every kernel CHD PSD-verified*: there is no `PosSemidef` hypothesis left to assume anywhere in the solve / `find_gamma` / `Z_test` development. +# The discovery decision layer: turning `noise` into graph structure + +Everything above produces *numbers* — a kernel, its eigendecomposition, and the `noise` level +(`varNoiseFn`, proven to be a fraction in `[0,1]`). CHD's outer *discovery loop* +(`decision.py`, `_GraphDiscoveryMain.py`) turns those numbers into the actual hypergraph: which +ancestors a node depends on, through which kernel mode. +[`NN.Proofs.Tensor.Basic.FactorizationsDecision`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean) +formalizes the four deterministic choices the loop makes and proves each one's selection guarantee. They +are comparisons over finite data, so the specs (`Spec.argMinFn`, `Spec.kernelChooserFn`, …) mirror the +Python verbatim and run over `Float`; the proofs are over `ℝ`, bridged from the `Context` order test +(`gtBool`/`ltBool`) to the real `<` by `gtBool_eq_decide`. + +The backbone is a single generic fold-selection lemma (`foldl_select`): the running-best `List.foldl` +that both `np.argmin` and `np.argmax` compile to returns a `le`-extremal index over the whole family, for +*any* preorder `le` whose strict part the `Bool` test decides. Instantiating `le := (· ≤ ·)` and +`(· ≥ ·)` gives the two endpoints: + +- *Prune the least-activated ancestor.* Each step of `helper_functions.step` drops the candidate of + smallest *activation* (`min_activation = np.argmin(activations)`); `argMinFn_le` proves the fold returns + a global minimizer, `argMaxFn_le` the dual. +- *Choose the kernel mode that admits an edge.* `MinNoiseKernelChooser` calls a kernel *valid* when its + `noise` falls below its `Z_low`, and returns the valid kernel of least `noise` + (`argmin(np.where(valid, noises, 2))`), or "no ancestor" if none is valid. `kernelChooserFn_eq_some` + and `kernelChooserFn_eq_none` prove it *sound and complete*: it returns `some s` with `s` itself valid + and of least `noise` among *all* valid kernels exactly when some kernel is valid, and `none` otherwise. + The `2` sentinel that suppresses invalid kernels only works because `noise ≤ 1` — which is exactly the + verified `varNoiseFn_le_one`, threaded in as the hypothesis `hbound`. The bound proved two sections ago + is what makes the decision correct. +- *Report the pruning iteration of largest `noise` jump.* `MaxIncrementModeChooser` takes the `argmax` + of the successive `noise` increments (with `1 − noise_last` appended); `modeChooserFn_ge` proves the + reported iteration has the maximal increment. +- *Stop when every ancestor is pruned.* `allPrunedFn_iff` proves the stopping test + `np.all(active_modes == 0)` holds iff every entry is zero. + +So the loop's structural decisions are not heuristics layered on top of unverified arithmetic: each is a +proved-correct selection over the `noise` statistic whose `[0,1]` range was itself proved. The +`Discovery` example runs all four on concrete data — argmin picks the least-activated ancestor (and not +the most-activated one), the chooser selects the unique valid kernel, takes least noise among two valid +ones, and reports `none` when all are invalid, the mode chooser picks the largest-jump iteration, and the +stopping rule fires only on the all-zero mask — and then closes the stack end-to-end: it builds an SPD +kernel, eigendecomposes it, and runs a `find_gamma`-style sweep that feeds the verified `varNoiseSpec` at +several `γ` straight into `argMinFn`, selecting the least-noise regularization (the smallest `γ`, every +swept noise landing in `[0,1]` as proved). + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact @@ -499,10 +543,17 @@ Cholesky-based regularized solve are proved, and the specification-level facts t on are independent of the convergence step. The three concrete CHD routines built on them are now identified too: the eigendecomposition-form `solve_variationnal` equals `-(K + γI)⁻¹ ga` and agrees with the Cholesky route, and the `noise`/`find_gamma`-loss/`Z_test` statistic is a spectral ratio -provably in `[0,1]` that depends on the kernel only through its spectrum. And the kernel build itself +provably in `[0,1]` that depends on the kernel only through its spectrum. The kernel build itself is now PSD-verified for *all three* CHD modes — linear, quadratic, and Gaussian — so the standing -`PosSemidef` hypothesis is discharged from data, not assumed, even for the fully-nonlinear kernel. So -the CHD foundation is complete; the two remaining open items are the cyclic-Jacobi convergence *rate* -(captured exactly by the a-posteriori residual certificate, never by `sorry`) and the `Z_test`'s -Gaussian sampling and percentiles — one a proof-only gap on a quantity CHD does not need to *run*, the -other statistical rather than algebraic and exercised numerically. +`PosSemidef` hypothesis is discharged from data, not assumed, even for the fully-nonlinear kernel. And +the *discovery decision layer* on top — the kernel chooser, the activation prune step, the mode +chooser, and the stopping rule — is now proved sound and complete, with the chooser's correctness +resting directly on the verified `noise ≤ 1` bound, so the structural decisions are proved selections +over a statistic whose range was itself proved. + +So the CHD foundation is complete, from the kernel build through the regularized solve and the noise +statistic up to the graph-structure decisions. The two remaining open items are both narrow and +deliberately scoped: the cyclic-Jacobi convergence *rate* (captured exactly by the a-posteriori +residual certificate, never by `sorry`) and the `Z_test`'s Gaussian sampling and percentiles — one a +proof-only gap on a quantity CHD does not need to *run*, the other statistical rather than algebraic +and exercised numerically. From bdfdbbaedb21c94caecc6eeb1c36cbdf0c65ea5c Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 20:18:23 -0700 Subject: [PATCH 16/22] Speed up Cholesky/ridge-solve #eval via strict array @[implemented_by] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The cyclic-Jacobi eigensolver was migrated to strict `Array (Array α)`, but `choleskyColsFn` and the triangular solves were left on the functional `Fin n → Fin n → α` representation. There each Cholesky column is a closure that re-evaluates every previous column, so reading the full factor `L` is exponential — and the spec is compiled without `precompileModules`, so any `#eval` of `solveRidgeSpec`/`choleskySpec` runs that closure in the interpreter. A single 4×4 ridge solve cost ~310 s; the QuadraticKernel example took ~645 s to build, and every ridge/Cholesky-using example was similarly slow. Add two strict, array-backed runtime implementations and register them with `@[implemented_by]`: * `choleskyColsImpl` → `choleskyColsFn` — materializes each column as an `Array α`, so a back-reference `L[i,k]` is an O(1) lookup. * `solveRidgeImpl` → `solveRidgeFn` — factors `K + γ·I = L·Lᵀ` and runs both triangular substitutions entirely over `Array`s, building no `Fin n → α` accumulator closures. `@[implemented_by]` swaps only the compiled/interpreted runtime code; the functional definitions — and every correctness proof that reasons about them (`FactorizationsSolve`, `FactorizationsReconstruction`, …) — are untouched. The examples' residual checks `(K+γ·I)·x − b ≈ 0` / `A = L·Lᵀ` numerically validate that the array path agrees with the proven definition. Result: QuadraticKernel 645 s → 6.8 s; a full clean rebuild of `NN.Examples.Factorization` + `NN.Proofs.Tensor.Basic` is 18.8 s (2705 jobs). No proof changes; sorry/admit/omega-free. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Spec/Core/Tensor/Factorizations.lean | 78 ++++++++++++++++++++++++- 1 file changed, 77 insertions(+), 1 deletion(-) diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index 13bd437..6181663 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -100,11 +100,43 @@ The columns are computed left to right. Column `j` uses only columns `0 .. j-1`: - above: `L[i,j] = 0` for `i < j` -/ +/-- +Strict, array-backed runtime implementation of `choleskyColsFn` (registered via `@[implemented_by]`). +Each column is *materialized* into an `Array α`, so a back-reference `L[i,k]` is an `O(1)` lookup +rather than a closure that re-evaluates the whole prefix. The closure form below is mathematically +clean (and is what the proofs reason about), but reading the full factor `L` from it re-evaluates +columns exponentially — ruinous in the interpreter (`#eval`). This computes the *same* factor strictly; +the numeric examples (`A = L·Lᵀ`, the ridge-solve residual ≈ 0) validate the two agree. +-/ +def choleskyColsImpl {n : Nat} (A : Fin n → Fin n → α) : List (Fin n → α) := + let cols : Array (Array α) := (List.finRange n).foldl (fun cols j => + let jv := j.val + -- Σ_{k if k.val < jv then s + (cols.getD k.val #[]).getD jv 0 * (cols.getD k.val #[]).getD jv 0 + else s) 0 + let Ljj := MathFunctions.sqrt (A j j - sumsq) + let colArr : Array α := Array.ofFn (fun i : Fin n => + if i.val < jv then 0 + else if i.val == jv then Ljj + else + -- Σ_{k if k.val < jv then + acc + (cols.getD k.val #[]).getD i.val 0 * (cols.getD k.val #[]).getD jv 0 else acc) 0 + (A i j - s) / Ljj) + cols.push colArr) #[] + (List.finRange n).map (fun j => fun i => (cols.getD j.val #[]).getD i.val 0) + /-- The list of columns of the Cholesky factor `L`, as length-`n` vectors, computed left to right. Element `j` of the result is column `j` of `L`. Built by a left fold so that when column `j` is formed, `cols` already holds columns `0 .. j-1`. + +The runtime implementation is `choleskyColsImpl` (strict arrays); the closure form here is the one the +correctness proofs reason about. Both compute the same factor. -/ +@[implemented_by choleskyColsImpl] def choleskyColsFn {n : Nat} (A : Fin n → Fin n → α) : List (Fin n → α) := (List.finRange n).foldl (fun cols j => -- Σ_{k K i j + (if i = j then γ else 0) +/-- +Strict, array-backed runtime implementation of `solveRidgeFn` (registered via `@[implemented_by]`). +It factors `K + γ·I = L·Lᵀ` and runs both triangular substitutions entirely over `Array`s, so no step +materializes the deep `Fin n → α` closures the functional definition builds — those re-evaluate +columns / the substitution accumulator exponentially, which is ruinous in the interpreter (`#eval`). +Same linear solve; the numeric examples (residual `(K+γ·I)·x − b ≈ 0`) validate the two agree. +-/ +def solveRidgeImpl {n : Nat} (K : Fin n → Fin n → α) (γ : α) (b : Fin n → α) : Fin n → α := + let A : Fin n → Fin n → α := fun i j => K i j + (if i.val == j.val then γ else 0) + -- Cholesky columns, left to right: `cols[j][i] = L[i][j]` (strict arrays, `O(1)` back-reference). + let cols : Array (Array α) := (List.finRange n).foldl (fun cols j => + let jv := j.val + let sumsq := (List.finRange n).foldl + (fun s k => if k.val < jv then let v := (cols.getD k.val #[]).getD jv 0; s + v * v else s) 0 + let Ljj := MathFunctions.sqrt (A j j - sumsq) + cols.push (Array.ofFn (fun i : Fin n => + if i.val < jv then 0 + else if i.val == jv then Ljj + else + let s := (List.finRange n).foldl (fun acc k => + if k.val < jv then + acc + (cols.getD k.val #[]).getD i.val 0 * (cols.getD k.val #[]).getD jv 0 + else acc) 0 + (A i j - s) / Ljj))) #[] + let Lent : Nat → Nat → α := fun i j => (cols.getD j #[]).getD i 0 + -- Forward solve `L · z = b`: `z[i] = (b[i] − Σ_{k + let iv := i.val + let s := (List.finRange n).foldl + (fun acc k => if k.val < iv then acc + Lent iv k.val * z.getD k.val 0 else acc) 0 + z.push ((b i - s) / Lent iv iv)) #[] + -- Back solve `Lᵀ · x = z`: `x[i] = (z[i] − Σ_{k>i} L[k,i]·x[k]) / L[i,i]`, `i = n−1 … 0`. + let x : Array α := (List.finRange n).reverse.foldl (fun xs i => + let iv := i.val + let s := (List.finRange n).foldl + (fun acc k => if iv < k.val then acc + Lent k.val iv * xs.getD k.val 0 else acc) 0 + xs.set! iv ((z.getD iv 0 - s) / Lent iv iv)) (Array.replicate n 0) + fun i => x.getD i.val 0 + /-- The Tikhonov-regularized (kernel-ridge) solve `(K + γ·I)·x = b`, via the Cholesky factorization -of `K + γ·I`. This is the linear solve at the core of CHD `solve_variationnal`. -/ +of `K + γ·I`. This is the linear solve at the core of CHD `solve_variationnal`. + +The runtime implementation is `solveRidgeImpl` (strict arrays); the closure form here, built from the +verified `choleskyFn` / `triSolve*` pieces, is what the correctness proofs reason about. Both compute +the same solution. -/ +@[implemented_by solveRidgeImpl] def solveRidgeFn {n : Nat} (K : Fin n → Fin n → α) (γ : α) (b : Fin n → α) : Fin n → α := cholSolveFn (choleskyFn (addScaledIdFn K γ)) b From 07ba1b3dacd15d6473de7a759d83f2dfa5633729 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 20:20:16 -0700 Subject: [PATCH 17/22] Formalize the CHD Z_test significance thresholds (well-posed) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CHD's Z_test (interpolatory.py) decides edge significance by comparing the observed `noise` against the null distribution of the same statistic under random data: draw N samples, score each one's `noise`, sort, and read off the 5th/95th percentiles as Z_low/Z_high; an edge is significant when `noise < Z_low`. Spec layer (Factorizations.lean): `sampleNoisesFn` (each draw scored by the *same* `varNoiseFn`), `leBool`, the order statistic `kthSmallestFn` (mergeSort + index), `zLowIdx`/`zHighIdx` (⌊0.05·N⌋ / ⌊0.95·N⌋), `zLowFn`/`zHighFn`, `zSignificantFn`, and tensor wrappers `zLowSpec`/`zHighSpec` — mirroring Z_test verbatim and running on Float. Proofs (FactorizationsDecision, sorry-free): the order-statistic toolkit (`kthSmallestFn_mem`/`_nonneg`/`_le_one`/`_mono`, bridged to Mathlib's `sortedLE_mergeSort` via `leBool_eq_le`) and the Z_test guarantees — `sampleNoisesFn_nonneg`/`_le_one` (every null sample inherits the verified `noise ∈ [0,1]` bound), `zLowFn`/`zHighFn ∈ [0,1]`, `zLowFn_le_zHighFn` (Z_low ≤ Z_high by order-statistic monotonicity), and `zTest_admits_edge` (`noise < Z_low` ⟹ `MinNoiseKernelChooser` returns `some 0`), the chooser's `noise ≤ 1` precondition again being `varNoiseFn_le_one`. The keystone: the same [0,1] bound governs both the data noise and the whole null distribution, so the test is well-posed. Examples (Discovery): builds the null distribution from a real eigendecomposition, checks 0 ≤ Z_low ≤ Z_high ≤ 1, shows dominant-eigenvector data clears the lower tail (significant), and rejects a high noise and a noise at the upper tail (negative controls). Blueprint: new "The Z_test: a null-distribution significance threshold" section and updated scope summary — only the distributional half (Gaussian draws + calibrated percentile, needs Mathlib.Probability) remains. Verified: NN.Examples.Factorization + NN.Proofs.Tensor.Basic 2705 jobs green; blueprint 4949 jobs; sorry/admit/omega-free. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 5 + NN/Examples/Factorization/Discovery.lean | 87 ++++++++- .../Tensor/Basic/FactorizationsDecision.lean | 167 ++++++++++++++++++ NN/Spec/Core/Tensor/Factorizations.lean | 67 +++++++ .../Ch4_Verification/Factorizations.lean | 60 ++++++- 5 files changed, 374 insertions(+), 12 deletions(-) diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index f887644..386e356 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -96,6 +96,11 @@ factorization misbehaves. an **end-to-end** block then feeds the verified `varNoiseSpec` at several `γ` into `argMinFn`, a `find_gamma` sweep selecting the least-noise regularization (all noises in `[0,1]`); **negative controls** confirm the most-activated ancestor and tiny-increment iterations are correctly rejected. + A closing **`Z_test`** block exercises the statistical layer: the null-distribution thresholds + `Z_low`/`Z_high` (5th/95th percentiles of the per-sample `noise`) are well-posed + (`0 ≤ Z_low ≤ Z_high ≤ 1`), data aligned with the dominant eigenvector clears the lower tail + (`noise < Z_low`, **positive**), and a high noise / a noise at the upper tail are rejected + (**negative controls**) — feeding `MinNoiseKernelChooser` exactly as in CHD. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Discovery.lean b/NN/Examples/Factorization/Discovery.lean index f1bae3a..4a40538 100644 --- a/NN/Examples/Factorization/Discovery.lean +++ b/NN/Examples/Factorization/Discovery.lean @@ -25,10 +25,17 @@ deterministic choices the loop makes, each with a positive check and a negative returns the `argmax` of the increments; * **stop when every ancestor is pruned** — `allPrunedFn` fires on the all-zero mask and not before. -The final block closes the loop end-to-end: it builds an SPD kernel, eigendecomposes it, and runs a -`find_gamma`-style sweep — feeding the *verified* `varNoiseSpec` at several `γ` straight into `argMinFn` -to select the regularization with least noise. Every decision runs over `Float`, the executable runtime -scalar. +A `find_gamma`-style block then closes the loop end-to-end: it builds an SPD kernel, eigendecomposes +it, and feeds the *verified* `varNoiseSpec` at several `γ` straight into `argMinFn` to select the +regularization with least noise. + +A final **`Z_test`** block adds the statistical layer (`interpolatory.py`): the observed `noise` is +judged against the null distribution of the *same* statistic under random data — `Z_low`/`Z_high` are +the 5th/95th percentiles of the per-sample noises. We check the thresholds are well-posed +(`0 ≤ Z_low ≤ Z_high ≤ 1`, each percentile inheriting the verified `noise ∈ [0,1]` bound) and that the +verdict `noise < Z_low` flags a real edge — with a genuine positive (data aligned with the dominant +eigenvector clears the lower tail) and negatives (a high noise, and a noise sitting at the upper tail, +are both correctly rejected). Every decision runs over `Float`, the executable runtime scalar. -/ @[expose] public section @@ -171,4 +178,76 @@ def noiseAt : Fin 3 → Float := fun i => Spec.varNoiseSpec evals V (gammas i) g #eval assertTrue "find_gamma selects least-noise γ via argMinFn (index 0)" ((Spec.argMinFn noiseAt).val == 0) +/-! ## `Z_test`: the null-distribution significance thresholds + +CHD decides an edge is real by comparing the observed `noise` against the null distribution of the +*same* statistic under random data: draw `N` samples, score each one's `noise`, sort, and read off the +5th/95th percentiles as `Z_low`/`Z_high` (`Z_test` in `interpolatory.py`). An edge is significant when +`noise < Z_low`. These checks corroborate `FactorizationsDecision`: the thresholds are well-posed +(`0 ≤ Z_low ≤ Z_high ≤ 1`, each percentile inheriting the verified `noise ∈ [0,1]` bound) and the +verdict drives `MinNoiseKernelChooser`. -/ + +/-- An `N = 20` family of pseudo-random null draws `sⱼ ∈ ℝ³` (deterministic, standing in for CHD's +`jax.random.normal` samples). With `N = 20` the percentile indices are `Z_low = ⌊0.05·20⌋ = 1` and +`Z_high = ⌊0.95·20⌋ = 19`. -/ +def zSamples : Fin 20 → Fin 3 → Float := + fun j i => (Float.ofNat ((j.val * 31 + i.val * 17 + 7) % 23) - 11.0) / 7.0 + +/-- The regularization at which we run the `Z_test`. -/ +def gammaZ : Float := 0.1 + +/-- `Z_low`: the 5th percentile of the null `noise` distribution from the verified eigendecomposition. -/ +def zLow : Float := Spec.zLowFn (Spec.toVecFn evals) (Spec.toMatFn V) gammaZ zSamples +/-- `Z_high`: the 95th percentile of the null `noise` distribution. -/ +def zHigh : Float := Spec.zHighFn (Spec.toVecFn evals) (Spec.toMatFn V) gammaZ zSamples + +#eval IO.println s!"Z_test null thresholds: Z_low = {zLow}, Z_high = {zHigh}" + +-- Positive — the thresholds are ordered (`zLowFn_le_zHighFn`); `leBool` is the very key the sort uses. +#eval assertTrue "Z_low ≤ Z_high (order-statistic monotonicity)" (Spec.leBool zLow zHigh) + +-- Positive — both thresholds are genuine fractions in [0,1] (`zLowFn_nonneg`/`_le_one`, `zHighFn_*`). +#eval assertTrue "Z_low and Z_high both lie in [0,1]" + (Spec.leBool 0.0 zLow && Spec.leBool zLow 1.0 && Spec.leBool 0.0 zHigh && Spec.leBool zHigh 1.0) + +/-- The dominant eigen-direction (largest eigenvalue), found by the verified `argMaxFn`. -/ +def domIdx : Fin 3 := Spec.argMaxFn (Spec.toVecFn evals) +/-- A "real signal": data aligned with the dominant eigenvector. Its `noise` is exactly the shrinkage +`γ/(λ_dom+γ)` — the smallest shrinkage, so well below the null tail — the kind of edge CHD keeps. -/ +def signalGa : Fin 3 → Float := fun i => Spec.toMatFn V i domIdx +/-- The observed `noise` of the signal-aligned data (the verified `varNoiseFn`). -/ +def obsSignal : Float := Spec.varNoiseFn (Spec.toVecFn evals) gammaZ (Spec.projFn (Spec.toMatFn V) signalGa) + +#eval IO.println s!"signal-aligned noise = {obsSignal}, significant (noise < Z_low)? \ + {Spec.zSignificantFn obsSignal zLow}" + +-- Positive — the signal-aligned noise is itself a fraction in [0,1] (witness of `varNoiseFn_*`). +#eval assertTrue "signal-aligned noise lies in [0,1]" + (Spec.leBool 0.0 obsSignal && Spec.leBool obsSignal 1.0) + +-- Positive — end-to-end: data aligned with the dominant eigenvector clears the null's lower tail, so +-- the `Z_test` flags a real edge (`noise < Z_low`). +#eval assertTrue "end-to-end: dominant-direction signal is significant (noise < Z_low)" + (Spec.zSignificantFn obsSignal zLow) + +-- Positive — a clearly-significant edge (noise 0.05 below threshold 0.20) is flagged (`zSignificantFn`). +#eval assertTrue "significant edge: noise 0.05 < Z_low 0.20" (Spec.zSignificantFn 0.05 0.20) + +-- Negative — a noise *above* the threshold is correctly not significant. +#eval assertFalse "non-significant: noise 0.50 ≥ Z_low 0.20" (Spec.zSignificantFn 0.50 0.20) + +-- Negative — the 95th-percentile value itself is never below the 5th (`zHigh ≥ zLow`), so feeding it as +-- an "observed" noise is correctly judged non-significant — a faithful negative from the real null. +#eval assertFalse "Z_high is not below Z_low (a noise at the upper tail is not significant)" + (Spec.zSignificantFn zHigh zLow) + +-- Positive — the `Z_test` verdict feeds `MinNoiseKernelChooser` (`zTest_admits_edge`): a significant +-- single kernel is admitted as `some 0`. +#eval assertTrue "significant kernel is admitted (chooser → some 0)" + (chooserCode (Spec.kernelChooserFn (fun _ : Fin 1 => (0.05 : Float)) (fun _ : Fin 1 => 0.20)) == 0) + +-- Negative — a non-significant single kernel is rejected (`none`, code -1). +#eval assertTrue "non-significant kernel is rejected (chooser → none, code -1)" + (chooserCode (Spec.kernelChooserFn (fun _ : Fin 1 => (0.50 : Float)) (fun _ : Fin 1 => 0.20)) == -1) + end NN.Examples.Factorization.Discovery diff --git a/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean b/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean index c927e86..42a7254 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean @@ -194,4 +194,171 @@ theorem allPrunedFn_iff {k : Nat} (m : Fin k → ℝ) : · intro h i; exact (key i).mp (h i (List.mem_finRange i)) · intro h i _; exact (key i).mpr (h i) +/-! ## CHD `Z_test`: the null-distribution significance thresholds + +`Z_test` (`interpolatory.py`) builds the null distribution of the `noise` statistic under random data, +sorts the per-sample noises, and reports the 5th/95th percentiles as `Z_low`/`Z_high`. The numerical +heart — the *value* of each sample's noise — is the **same** `varNoiseFn` whose `[0,1]` bound we already +proved (`varNoiseFn_nonneg`/`_le_one`). So the percentiles inherit that bound, and `Z_low ≤ Z_high` +because a 5th percentile never exceeds a 95th — pure order-statistic monotonicity over the sorted list. + +The order statistic `kthSmallestFn` sorts with the `Context` comparator `leBool`; over `ℝ` that is the +real `≤` (`leBool_eq_le`), letting Mathlib's `sortedLE_mergeSort` supply sortedness. -/ + +/-- Over `ℝ`, the `Context` comparator `leBool x y` is the decidable `x ≤ y`. -/ +theorem leBool_eq_decide (x y : ℝ) : Spec.leBool x y = decide (x ≤ y) := by + rw [Spec.leBool, ltBool_eq_decide, ← decide_not, decide_eq_decide] + exact not_lt + +/-- Over `ℝ`, the `leBool` sort key *is* the decided `(· ≤ ·)`, so `kthSmallestFn` sorts with the +real order (matching Mathlib's `sortedLE_mergeSort`). -/ +private theorem leBool_eq_le : (Spec.leBool : ℝ → ℝ → Bool) = (fun x y => decide (x ≤ y)) := by + funext x y; exact leBool_eq_decide x y + +/-- `getD` at an in-range index is the corresponding `getElem` (the `0` fallback is unused). -/ +private theorem getD_zero_eq {L : List ℝ} {i : Nat} (h : i < L.length) : L.getD i 0 = L[i] := by + rw [List.getD_eq_getElem?_getD, List.getElem?_eq_getElem h, Option.getD_some] + +/-- `kthSmallestFn` over `ℝ` is the `k`-th entry of the list sorted by the *real* order. -/ +theorem kthSmallestFn_eq_sorted_getD {N : Nat} (a : Fin N → ℝ) (k : Nat) : + Spec.kthSmallestFn a k = (((List.finRange N).map a).mergeSort (· ≤ ·)).getD k 0 := by + rw [Spec.kthSmallestFn, leBool_eq_le] + +/-! ### Order-statistic facts -/ + +/-- **`kthSmallestFn` is one of the family's values** (for an in-range `k`): sorting permutes, so the +selected entry came from `a`. -/ +theorem kthSmallestFn_mem {N : Nat} (a : Fin N → ℝ) {k : Nat} (hk : k < N) : + ∃ i, Spec.kthSmallestFn a k = a i := by + have hlen : (((List.finRange N).map a).mergeSort (· ≤ ·)).length = N := by + rw [List.length_mergeSort, List.length_map, List.length_finRange] + have hk' : k < (((List.finRange N).map a).mergeSort (· ≤ ·)).length := by rw [hlen]; exact hk + have hmem : Spec.kthSmallestFn a k ∈ ((List.finRange N).map a).mergeSort (· ≤ ·) := by + rw [kthSmallestFn_eq_sorted_getD, getD_zero_eq hk'] + exact List.getElem_mem hk' + rw [List.mem_mergeSort, List.mem_map] at hmem + obtain ⟨i, _, hi⟩ := hmem + exact ⟨i, hi.symm⟩ + +/-- **An in-range order statistic is `≥ 0`** when every value is. -/ +theorem kthSmallestFn_nonneg {N : Nat} (a : Fin N → ℝ) (hpos : ∀ i, 0 ≤ a i) {k : Nat} + (hk : k < N) : 0 ≤ Spec.kthSmallestFn a k := by + obtain ⟨i, hi⟩ := kthSmallestFn_mem a hk; rw [hi]; exact hpos i + +/-- **An in-range order statistic is `≤ 1`** when every value is. -/ +theorem kthSmallestFn_le_one {N : Nat} (a : Fin N → ℝ) (hle : ∀ i, a i ≤ 1) {k : Nat} + (hk : k < N) : Spec.kthSmallestFn a k ≤ 1 := by + obtain ⟨i, hi⟩ := kthSmallestFn_mem a hk; rw [hi]; exact hle i + +/-- **Order statistics are monotone in their rank** (`k ≤ k' → kₜₕ ≤ k'ₜₕ`): the underlying list is +sorted ascending, so later indices hold larger values. This is exactly why `Z_low ≤ Z_high`. -/ +theorem kthSmallestFn_mono {N : Nat} (a : Fin N → ℝ) {k k' : Nat} (hkk : k ≤ k') (hk' : k' < N) : + Spec.kthSmallestFn a k ≤ Spec.kthSmallestFn a k' := by + have hlen : (((List.finRange N).map a).mergeSort (· ≤ ·)).length = N := by + rw [List.length_mergeSort, List.length_map, List.length_finRange] + have hkL : k < (((List.finRange N).map a).mergeSort (· ≤ ·)).length := by + rw [hlen]; exact lt_of_le_of_lt hkk hk' + have hk'L : k' < (((List.finRange N).map a).mergeSort (· ≤ ·)).length := by rw [hlen]; exact hk' + rw [kthSmallestFn_eq_sorted_getD, kthSmallestFn_eq_sorted_getD, getD_zero_eq hkL, getD_zero_eq hk'L] + exact List.sortedLE_mergeSort.getElem_le_getElem_of_le hkk + +/-! ### The percentile indices -/ + +/-- The 5th-percentile index is in range for a nonempty sample. -/ +theorem zLowIdx_lt {N : Nat} (hN : 0 < N) : Spec.zLowIdx N < N := by + rw [Spec.zLowIdx]; exact Nat.div_lt_self hN (by norm_num) + +/-- The 95th-percentile index is in range for a nonempty sample. -/ +theorem zHighIdx_lt {N : Nat} (hN : 0 < N) : Spec.zHighIdx N < N := by + rw [Spec.zHighIdx, Nat.div_lt_iff_lt_mul (by norm_num : (0 : Nat) < 20)] + nlinarith [hN] + +/-- The 5th-percentile index never exceeds the 95th. -/ +theorem zLowIdx_le_zHighIdx (N : Nat) : Spec.zLowIdx N ≤ Spec.zHighIdx N := by + rw [Spec.zLowIdx, Spec.zHighIdx] + exact Nat.div_le_div_right (Nat.le_mul_of_pos_left N (by norm_num)) + +/-! ### Each null sample's noise inherits the `[0,1]` bound -/ + +/-- **Every `Z_test` null sample is a genuine fraction** (`0 ≤ noise`): it is `varNoiseFn` of the +projected draw, and `varNoiseFn_nonneg` already bounds that. -/ +theorem sampleNoisesFn_nonneg {n N : Nat} {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} + (hγ : 0 < γ) (V : Fin n → Fin n → ℝ) (samples : Fin N → Fin n → ℝ) (j : Fin N) : + 0 ≤ Spec.sampleNoisesFn Λ V γ samples j := by + rw [Spec.sampleNoisesFn]; exact varNoiseFn_nonneg hΛ hγ _ + +/-- **Every `Z_test` null sample is `≤ 1`** (`varNoiseFn_le_one`). -/ +theorem sampleNoisesFn_le_one {n N : Nat} {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} + (hγ : 0 < γ) (V : Fin n → Fin n → ℝ) (samples : Fin N → Fin n → ℝ) (j : Fin N) : + Spec.sampleNoisesFn Λ V γ samples j ≤ 1 := by + rw [Spec.sampleNoisesFn]; exact varNoiseFn_le_one hΛ hγ _ + +/-! ### `Z_low` / `Z_high` are well-posed thresholds -/ + +/-- **`Z_low` is a genuine fraction in `[0,1]` (lower bound).** -/ +theorem zLowFn_nonneg {n N : Nat} {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) (samples : Fin N → Fin n → ℝ) (hN : 0 < N) : + 0 ≤ Spec.zLowFn Λ V γ samples := by + rw [Spec.zLowFn] + exact kthSmallestFn_nonneg _ (fun j => sampleNoisesFn_nonneg hΛ hγ V samples j) (zLowIdx_lt hN) + +/-- **`Z_low ≤ 1`.** -/ +theorem zLowFn_le_one {n N : Nat} {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) (samples : Fin N → Fin n → ℝ) (hN : 0 < N) : + Spec.zLowFn Λ V γ samples ≤ 1 := by + rw [Spec.zLowFn] + exact kthSmallestFn_le_one _ (fun j => sampleNoisesFn_le_one hΛ hγ V samples j) (zLowIdx_lt hN) + +/-- **`Z_high` is a genuine fraction in `[0,1]` (lower bound).** -/ +theorem zHighFn_nonneg {n N : Nat} {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) (samples : Fin N → Fin n → ℝ) (hN : 0 < N) : + 0 ≤ Spec.zHighFn Λ V γ samples := by + rw [Spec.zHighFn] + exact kthSmallestFn_nonneg _ (fun j => sampleNoisesFn_nonneg hΛ hγ V samples j) (zHighIdx_lt hN) + +/-- **`Z_high ≤ 1`.** -/ +theorem zHighFn_le_one {n N : Nat} {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) (samples : Fin N → Fin n → ℝ) (hN : 0 < N) : + Spec.zHighFn Λ V γ samples ≤ 1 := by + rw [Spec.zHighFn] + exact kthSmallestFn_le_one _ (fun j => sampleNoisesFn_le_one hΛ hγ V samples j) (zHighIdx_lt hN) + +/-- **`Z_low ≤ Z_high`.** The lower percentile of the null distribution never exceeds the upper one — +the order-statistic monotonicity over the shared sorted noises. The test `Z_low ≤ noise ≤ Z_high` it +implies (the "no anomaly" window of `_GraphDiscoveryMain.py`) is therefore non-degenerate. -/ +theorem zLowFn_le_zHighFn {n N : Nat} (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) + (samples : Fin N → Fin n → ℝ) (hN : 0 < N) : + Spec.zLowFn Λ V γ samples ≤ Spec.zHighFn Λ V γ samples := by + rw [Spec.zLowFn, Spec.zHighFn] + exact kthSmallestFn_mono _ (zLowIdx_le_zHighIdx N) (zHighIdx_lt hN) + +/-! ### Tying the `Z_test` verdict back to the kernel chooser -/ + +/-- **A significant edge is never anomalously noisy.** If the observed `noise` clears the lower tail +(`noise < Z_low`), it also sits below the upper tail (`noise < Z_high`), because `Z_low ≤ Z_high`. -/ +theorem zSignificant_lt_zHighFn {n N : Nat} (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) + (samples : Fin N → Fin n → ℝ) (hN : 0 < N) {obs : ℝ} + (hsig : obs < Spec.zLowFn Λ V γ samples) : obs < Spec.zHighFn Λ V γ samples := + lt_of_lt_of_le hsig (zLowFn_le_zHighFn Λ V γ samples hN) + +/-- **The `Z_test` decision feeds the kernel chooser.** When the observed `noise` of the data clears the +`Z_low` threshold (`zSignificantFn = true`), the single-kernel `MinNoiseKernelChooser` admits the edge — +returns `some 0`. This connects the statistical layer (`Z_test`) to the discovery decision layer +(`kernelChooserFn`, proved sound/complete above); the `noise ≤ 1` ceiling the chooser needs is the +verified `varNoiseFn_le_one`. -/ +theorem zTest_admits_edge {n N : Nat} {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) (samples : Fin N → Fin n → ℝ) (ga : Fin n → ℝ) + (hsig : Spec.zSignificantFn (Spec.varNoiseFn Λ γ (Spec.projFn V ga)) + (Spec.zLowFn Λ V γ samples) = true) : + Spec.kernelChooserFn (fun _ : Fin 1 => Spec.varNoiseFn Λ γ (Spec.projFn V ga)) + (fun _ : Fin 1 => Spec.zLowFn Λ V γ samples) = some 0 := by + have hlt : Spec.varNoiseFn Λ γ (Spec.projFn V ga) < Spec.zLowFn Λ V γ samples := by + rw [Spec.zSignificantFn, ltBool_eq_decide] at hsig; exact of_decide_eq_true hsig + have hb : ∀ i : Fin 1, (fun _ : Fin 1 => Spec.varNoiseFn Λ γ (Spec.projFn V ga)) i ≤ 1 := + fun _ => varNoiseFn_le_one hΛ hγ _ + obtain ⟨s, hs, _, _⟩ := kernelChooserFn_eq_some + (noises := fun _ : Fin 1 => Spec.varNoiseFn Λ γ (Spec.projFn V ga)) + (Zlows := fun _ : Fin 1 => Spec.zLowFn Λ V γ samples) hb (v := 0) hlt + rw [hs]; exact congrArg some (Fin.fin_one_eq_zero s) + end Spec.Factorization diff --git a/NN/Spec/Core/Tensor/Factorizations.lean b/NN/Spec/Core/Tensor/Factorizations.lean index 6181663..27c6bc3 100644 --- a/NN/Spec/Core/Tensor/Factorizations.lean +++ b/NN/Spec/Core/Tensor/Factorizations.lean @@ -506,6 +506,73 @@ def varNoiseSpec {n : Nat} (evals : Tensor α (.dim n .scalar)) (V : Tensor α (.dim n (.dim n .scalar))) (γ : α) (ga : Tensor α (.dim n .scalar)) : α := varNoiseFn (toVecFn evals) γ (projFn (toMatFn V) (toVecFn ga)) +/-! ## CHD `Z_test`: null-distribution significance thresholds (`interpolatory.py`) + +`varNoiseFn` gives the `noise` of the *observed* data. To decide whether that noise is small *enough* +to signal a real edge, CHD compares it against the null distribution of the **same** statistic under +random data (`Z_test`): draw `N` standard-Gaussian samples, score each one's `noise`, sort them, and +take the 5th and 95th percentiles as `Z_low`/`Z_high`. An edge is significant when the observed +`noise < Z_low` — strictly below the null's lower tail. + +The random draws enter the spec as an explicit family `samples : Fin N → Fin n → α` (row `j` a draw); +the randomness itself is the caller's, exactly as CHD threads a `jax.random` key into `Z_test`. The +selection guarantees (each threshold lies in `[0,1]`, and `Z_low ≤ Z_high`) are proved over `ℝ` in +[`NN.Proofs.Tensor.Basic.FactorizationsDecision`](../../../Proofs/Tensor/Basic/FactorizationsDecision.lean), +reusing the verified `noise ∈ [0,1]` bound for *every* null sample. -/ + +/-- The `Z_test` null sample of per-draw `noise` levels: each random draw `samples j` is projected +(`Pga = Vᵀ·sⱼ`) and scored by the **same** `varNoiseFn` as the data. Mirrors +`noises = vecdot(Pgas_coeffs, Pgas_coeffs) / vecdot(Pgas_coeffs, Pgas)` in `Z_test`. -/ +def sampleNoisesFn {n N : Nat} (Λ : Fin n → α) (V : Fin n → Fin n → α) (γ : α) + (samples : Fin N → Fin n → α) : Fin N → α := + fun j => varNoiseFn Λ γ (projFn V (samples j)) + +/-- `x ≤ y` as a `Bool` via the `Context` order (`x ≤ y` is `¬ y < x`); the sort key for the order +statistics below. -/ +def leBool (x y : α) : Bool := !ltBool y x + +/-- The `k`-th smallest of a finite family `a : Fin N → α`, by sorting the values (ascending, via the +`Context` order) and indexing. The `getD … 0` fallback is total; for `k < N` it is a genuine order +statistic (see `kthSmallestFn_mem`/`_mono` in `FactorizationsDecision`). -/ +def kthSmallestFn {N : Nat} (a : Fin N → α) (k : Nat) : α := + (((List.finRange N).map a).mergeSort leBool).getD k 0 + +/-- The 5th-percentile index of an `N`-sample null distribution (`int(0.05·N)`, i.e. `⌊N/20⌋`). -/ +def zLowIdx (N : Nat) : Nat := N / 20 + +/-- The 95th-percentile index of an `N`-sample null distribution (`int(0.95·N)`, i.e. `⌊19·N/20⌋`). -/ +def zHighIdx (N : Nat) : Nat := 19 * N / 20 + +/-- CHD `Z_low`: the 5th percentile of the null `noise` distribution — the significance threshold an +observed `noise` must beat (`B_samples[int(0.05·N)]`). -/ +def zLowFn {n N : Nat} (Λ : Fin n → α) (V : Fin n → Fin n → α) (γ : α) + (samples : Fin N → Fin n → α) : α := + kthSmallestFn (sampleNoisesFn Λ V γ samples) (zLowIdx N) + +/-- CHD `Z_high`: the 95th percentile of the null `noise` distribution (`B_samples[int(0.95·N)]`). -/ +def zHighFn {n N : Nat} (Λ : Fin n → α) (V : Fin n → Fin n → α) (γ : α) + (samples : Fin N → Fin n → α) : α := + kthSmallestFn (sampleNoisesFn Λ V γ samples) (zHighIdx N) + +/-- CHD's per-kernel significance verdict: the observed `noise` beats the null's lower tail +(`noise < Z_low`), i.e. the edge is real. This is exactly the validity test `MinNoiseKernelChooser` +folds over (`noises < Z_lows`). -/ +def zSignificantFn (noise Zlow : α) : Bool := ltBool noise Zlow + +/-- Tensor-level `Z_low` threshold from eigenpairs `(evals, V)`, regularization `γ`, and a family of +null draws (the rows of `S : Tensor (.dim N (.dim n .scalar))`). -/ +def zLowSpec {n N : Nat} (evals : Tensor α (.dim n .scalar)) + (V : Tensor α (.dim n (.dim n .scalar))) (γ : α) + (S : Tensor α (.dim N (.dim n .scalar))) : α := + zLowFn (toVecFn evals) (toMatFn V) γ (toMatFn S) + +/-- Tensor-level `Z_high` threshold from eigenpairs `(evals, V)`, regularization `γ`, and a family of +null draws (the rows of `S`). -/ +def zHighSpec {n N : Nat} (evals : Tensor α (.dim n .scalar)) + (V : Tensor α (.dim n (.dim n .scalar))) (γ : α) + (S : Tensor α (.dim N (.dim n .scalar))) : α := + zHighFn (toVecFn evals) (toMatFn V) γ (toMatFn S) + /-! ## CHD mode kernels (`Modes/kernels.py`) Everything above takes the kernel matrix `K` as input, assuming it is symmetric positive-semidefinite. diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 33f44d6..42db8eb 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -358,6 +358,45 @@ kernel, eigendecomposes it, and runs a `find_gamma`-style sweep that feeds the v several `γ` straight into `argMinFn`, selecting the least-noise regularization (the smallest `γ`, every swept noise landing in `[0,1]` as proved). +# The `Z_test`: a null-distribution significance threshold + +The kernel chooser of the previous section asks whether the observed `noise` falls below a threshold +`Z_low`. Where does `Z_low` come from? It is not a hand-set constant — it is the *5th percentile of the +null distribution* of the very same `noise` statistic. CHD's `Z_test` (`interpolatory.py`) draws `N` +standard-Gaussian samples, scores each one's `noise` with the same `varNoiseFn`, sorts the `N` values, +and reads off the 5th and 95th percentiles as `Z_low` and `Z_high`. An edge is *significant* — a real +dependency rather than fitting noise — when the observed `noise` falls below `Z_low`, i.e. strictly +inside the lower tail of what random data would produce. + +[`NN.Proofs.Tensor.Basic.FactorizationsDecision`](https://github.com/lean-dojo/TorchLean/blob/main/NN/Proofs/Tensor/Basic/FactorizationsDecision.lean) +formalizes this statistical layer. The spec `Spec.zLowFn` / `Spec.zHighFn` mirror `Z_test`: the random +draws are an explicit family `samples : Fin N → Fin n → α` (the caller's randomness, exactly as CHD +threads a PRNG key), each is scored by `Spec.sampleNoisesFn` (the *same* `varNoiseFn` again), and the +percentiles are order statistics `Spec.kthSmallestFn` — the `k`-th entry of the list sorted by the +`Context` order. Over `ℝ` that sort key is the real `≤` (`leBool_eq_le`), so Mathlib's +`sortedLE_mergeSort` supplies sortedness and `mergeSort_perm` supplies membership. + +The payoff is that the threshold is *well-posed*, and provably so. The keystone is that the `[0,1]` +bound governing the data noise governs *every null sample too* — it is the same `varNoiseFn`. So: + +- `sampleNoisesFn_nonneg` / `_le_one` — each of the `N` null noises is a genuine fraction in `[0,1]`, + directly from `varNoiseFn_nonneg` / `varNoiseFn_le_one`. +- `zLowFn_nonneg` / `zLowFn_le_one` and the `zHighFn` pair — hence each percentile lies in `[0,1]`, + because an order statistic is one of the sampled values (`kthSmallestFn_mem`). +- `zLowFn_le_zHighFn` — and `Z_low ≤ Z_high`, because a 5th percentile never exceeds a 95th. This is + *pure order-statistic monotonicity* (`kthSmallestFn_mono`): the underlying list is sorted ascending + and `⌊0.05 N⌋ ≤ ⌊0.95 N⌋`. The comparison window `Z_low ≤ noise ≤ Z_high` the loop uses (the + "no anomaly" band of `_GraphDiscoveryMain.py`) is therefore non-degenerate. + +Finally `zTest_admits_edge` ties the statistical verdict back to the decision layer: when the observed +`noise` clears `Z_low` (`zSignificantFn = true`), the single-kernel `MinNoiseKernelChooser` admits the +edge — returns `some 0`. The `noise ≤ 1` ceiling that proof needs is, once more, the verified +`varNoiseFn_le_one`. The whole statistical decision thus rests on the one spectral bound proved three +sections ago. The `Discovery` example exhibits the layer end-to-end: it builds the null distribution +from a real eigendecomposition, checks `0 ≤ Z_low ≤ Z_high ≤ 1`, shows data aligned with the *dominant* +eigenvector (smallest shrinkage noise) clears the lower tail and is flagged significant, and confirms a +high noise — and a noise sitting at the upper tail — are both correctly rejected. + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact @@ -549,11 +588,16 @@ is now PSD-verified for *all three* CHD modes — linear, quadratic, and Gaussia the *discovery decision layer* on top — the kernel chooser, the activation prune step, the mode chooser, and the stopping rule — is now proved sound and complete, with the chooser's correctness resting directly on the verified `noise ≤ 1` bound, so the structural decisions are proved selections -over a statistic whose range was itself proved. - -So the CHD foundation is complete, from the kernel build through the regularized solve and the noise -statistic up to the graph-structure decisions. The two remaining open items are both narrow and -deliberately scoped: the cyclic-Jacobi convergence *rate* (captured exactly by the a-posteriori -residual certificate, never by `sorry`) and the `Z_test`'s Gaussian sampling and percentiles — one a -proof-only gap on a quantity CHD does not need to *run*, the other statistical rather than algebraic -and exercised numerically. +over a statistic whose range was itself proved. The `Z_test` *significance thresholds* are now proved +well-posed too: `Z_low` and `Z_high` are order statistics of the null `noise` distribution, each +inheriting the `[0,1]` bound from the shared `varNoiseFn`, with `Z_low ≤ Z_high` by order-statistic +monotonicity — and the verdict `noise < Z_low` is shown to feed `MinNoiseKernelChooser`. + +So the CHD foundation is complete, from the kernel build through the regularized solve, the noise +statistic, and the `Z_test` thresholds up to the graph-structure decisions. The two remaining open +items are both narrow and deliberately scoped: the cyclic-Jacobi convergence *rate* (captured exactly +by the a-posteriori residual certificate, never by `sorry`), and the *distributional* content of the +`Z_test` — that the draws are Gaussian and the empirical percentile is a calibrated confidence level +(a probability-theory statement needing `Mathlib.Probability`, distinct from the now-proved +order-statistic well-posedness). One is a proof-only gap on a quantity CHD does not need to *run*; the +other is statistical rather than algebraic and exercised numerically. From b32e2c550084f889767339217400f92827c79034 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 20:54:50 -0700 Subject: [PATCH 18/22] Formalize the CHD Z_test distributional layer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Close the provable half of the Z_test's distributional content (the part left open after the well-posed thresholds), in a new sorry-free module NN/Proofs/Tensor/Basic/FactorizationsZTest.lean: * Finite-sample calibration (counting, no probability theory): a sorted list has at most k entries below its k-th element, so via countP permutation- invariance over mergeSort the threshold's own empirical false-positive rate is bounded exactly — at most ⌊N/20⌋ ≈ 5% of the N null draws fall below Z_low (zLow_null_exceedance_le) and ≈ 5% rise above Z_high (zHigh_null_exceedance_le). This is the exact, non-asymptotic guarantee. * The Gaussian null law (measure theory): modelling the draws as i.i.d. standard Gaussian (nullGaussian = Measure.pi (gaussianReal 0 1)), the per-draw noise is measurable (measurable_noiseMap), so its pushforward noiseLaw is a probability measure (IsProbabilityMeasure) concentrated on [0,1] (noiseLaw_Icc_eq_one) — the verified varNoiseFn ∈ [0,1] bound lifted to the law. sampleNoisesFn_eq_noiseMap ties CHD's executable statistic to the model. Scope honesty: the asymptotic frontier — empirical→true quantile convergence (Glivenko–Cantelli/DKW) and the exchangeability rank rate k/(N+1) — needs an empirical-process theory absent from Mathlib v4.30.0, and is flagged rather than stubbed with sorry. Discovery.lean adds positive/negative #eval checks (exactly 1/20 below Z_low, 0 above Z_high, 19/20 below the slack Z_high as a negative control); the blueprint Ch4 chapter gains a "Z_test distributional layer" section and the "What remains" note now scopes only the asymptotic half. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 9 +- NN/Examples/Factorization/Discovery.lean | 43 ++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Tensor/Basic/FactorizationsZTest.lean | 234 ++++++++++++++++++ .../Ch4_Verification/Factorizations.lean | 62 ++++- 5 files changed, 339 insertions(+), 10 deletions(-) create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsZTest.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 386e356..ec34bbf 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -100,7 +100,14 @@ factorization misbehaves. `Z_low`/`Z_high` (5th/95th percentiles of the per-sample `noise`) are well-posed (`0 ≤ Z_low ≤ Z_high ≤ 1`), data aligned with the dominant eigenvector clears the lower tail (`noise < Z_low`, **positive**), and a high noise / a noise at the upper tail are rejected - (**negative controls**) — feeding `MinNoiseKernelChooser` exactly as in CHD. + (**negative controls**) — feeding `MinNoiseKernelChooser` exactly as in CHD. A final + **distributional** sub-block checks the *finite-sample calibration* proved in + `FactorizationsZTest`: across the `N = 20` null draws, at most `⌊N/20⌋ ≈ 5%` fall below `Z_low` + (`zLow_null_exceedance_le`, here exactly `1/20`) and at most `≈ 5%` rise above `Z_high` + (`zHigh_null_exceedance_le`, here `0`); a **negative control** confirms the slack `Z_high` + threshold admits `≈ 95%` of the draws, so the `5%` calibration is specific to `Z_low`. (The + companion measure-theoretic fact — the i.i.d.-Gaussian null law is a probability measure on + `[0,1]`, `noiseLaw_Icc_eq_one` — is noncomputable and lives in the proofs.) Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Discovery.lean b/NN/Examples/Factorization/Discovery.lean index 4a40538..2acfdd9 100644 --- a/NN/Examples/Factorization/Discovery.lean +++ b/NN/Examples/Factorization/Discovery.lean @@ -250,4 +250,47 @@ def obsSignal : Float := Spec.varNoiseFn (Spec.toVecFn evals) gammaZ (Spec.projF #eval assertTrue "non-significant kernel is rejected (chooser → none, code -1)" (chooserCode (Spec.kernelChooserFn (fun _ : Fin 1 => (0.50 : Float)) (fun _ : Fin 1 => 0.20)) == -1) +/-! ### The distributional layer: finite-sample calibration of the thresholds + +The `noise` of each null draw, scored by the same functional as the data (`sampleNoisesFn`). The +percentile thresholds carry a *non-asymptotic* false-positive guarantee, proved in +`FactorizationsZTest`: at most `⌊N/20⌋ ≈ 5%` of the `N` draws fall below `Z_low` +(`zLow_null_exceedance_le`) and at most `N-1-⌊19N/20⌋ ≈ 5%` fall above `Z_high` +(`zHigh_null_exceedance_le`). On the measure side, modelling the draws as i.i.d. standard Gaussian +makes the null law a probability measure on `[0,1]` (`noiseLaw_Icc_eq_one`); that part is +noncomputable, so it is exercised by the proofs rather than `#eval`. -/ + +/-- The per-draw `noise` levels of the `Z_test` null sample (`N = 20` draws). -/ +def zNullNoises : Fin 20 → Float := + Spec.sampleNoisesFn (Spec.toVecFn evals) (Spec.toMatFn V) gammaZ zSamples + +/-- How many of the 20 null draws score strictly below a threshold (the empirical lower-tail count, +using the very `ltBool` comparator the `Z_test` decision uses). -/ +def countBelow (thr : Float) : Nat := + ((List.finRange 20).filter (fun j => Spec.ltBool (zNullNoises j) thr)).length + +/-- How many of the 20 null draws score strictly above a threshold (the empirical upper-tail count). -/ +def countAbove (thr : Float) : Nat := + ((List.finRange 20).filter (fun j => Spec.ltBool thr (zNullNoises j))).length + +#eval IO.println s!"null-draw tail counts: below Z_low = {countBelow zLow} (≤ ⌊20/20⌋ = {Spec.zLowIdx 20}), \ + above Z_high = {countAbove zHigh} (≤ 19 - {Spec.zHighIdx 20} = {20 - 1 - Spec.zHighIdx 20}), \ + below Z_high = {countBelow zHigh}" + +-- Positive — `zLow_null_exceedance_le`: at most `⌊N/20⌋` (≈ 5%) of the null draws beat `Z_low`, i.e. +-- the threshold's own empirical false-positive rate is bounded by the 5th-percentile rank. +#eval assertTrue "≤ 5% of null draws fall below Z_low (zLow_null_exceedance_le)" + (decide (countBelow zLow ≤ Spec.zLowIdx 20)) + +-- Positive — `zHigh_null_exceedance_le`: at most `N-1-⌊19N/20⌋` (≈ 5%) of the null draws exceed +-- `Z_high`. With `N = 20`, `Z_high` is the top order statistic, so nothing strictly exceeds it. +#eval assertTrue "≤ 5% of null draws rise above Z_high (zHigh_null_exceedance_le)" + (decide (countAbove zHigh ≤ 20 - 1 - Spec.zHighIdx 20)) + +-- Negative control — the *slack* upper threshold `Z_high` admits far more than 5% of the null mass +-- below it (≈ 95%), so the 5% lower-tail calibration is specific to `Z_low`, not an artifact of any +-- threshold: a test against `Z_high` would over-reject the null. +#eval assertTrue "Z_high is a slack threshold: > 5% of null draws fall below it (calibration is specific to Z_low)" + (decide (Spec.zLowIdx 20 < countBelow zHigh)) + end NN.Examples.Factorization.Discovery diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index 6f094ff..b81aa8a 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -14,6 +14,7 @@ public import NN.Proofs.Tensor.Basic.FactorizationsReconstruction public import NN.Proofs.Tensor.Basic.FactorizationsSolve public import NN.Proofs.Tensor.Basic.FactorizationsVariational public import NN.Proofs.Tensor.Basic.FactorizationsDecision +public import NN.Proofs.Tensor.Basic.FactorizationsZTest public import NN.Proofs.Tensor.Basic.FactorizationsKernels public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi diff --git a/NN/Proofs/Tensor/Basic/FactorizationsZTest.lean b/NN/Proofs/Tensor/Basic/FactorizationsZTest.lean new file mode 100644 index 0000000..775bdce --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsZTest.lean @@ -0,0 +1,234 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.FactorizationsDecision +public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal +public import Mathlib.Probability.Distributions.Gaussian.Real +public import Mathlib.MeasureTheory.Constructions.Pi +public import Mathlib.MeasureTheory.Measure.Map +public import Mathlib.MeasureTheory.Measure.Typeclasses.Probability + +/-! +# CHD `Z_test`: the distributional layer (`interpolatory.py`) + +[`FactorizationsDecision`](./FactorizationsDecision.lean) proved the `Z_test` thresholds are +*well-posed* — each of `Z_low`/`Z_high` is a genuine order statistic of the per-sample `noise`, +lying in `[0,1]`, with `Z_low ≤ Z_high`, and the chooser consumes `Z_low`. That is the +*deterministic* half of the test. This file closes the **distributional** half, in two pieces that +are honestly provable over Mathlib v4.30.0: + +* **Finite-sample calibration (counting).** The operational meaning of "`Z_low` is the 5th + percentile" is that the threshold's *own* empirical false-positive rate is controlled: among the + `N` null draws, **at most `⌊N/20⌋ ≈ 5%`** score strictly below `Z_low` + (`zLow_null_exceedance_le`), and **at most `N-1-⌊19N/20⌋ ≈ 5%`** score strictly above `Z_high` + (`zHigh_null_exceedance_le`). These are exact consequences of order-statistic sortedness — a + sorted list has at most `k` entries below its `k`-th element — and need no probability theory. + +* **The Gaussian null law (measure theory).** CHD draws the null samples i.i.d. standard Gaussian. + We model that draw as `nullGaussian n`, the product of `n` standard normals on `Fin n → ℝ` + (`Measure.pi (fun _ => gaussianReal 0 1)`), a genuine probability measure. The per-sample `noise` + is a *measurable* map (`measurable_noiseMap`), so its **null law** `noiseLaw` is a probability + measure (`IsProbabilityMeasure`) **supported in `[0,1]`** (`noiseLaw_Icc_eq_one`) — the verified + `varNoiseFn ∈ [0,1]` bound, lifted to the law. `sampleNoisesFn_eq_noiseMap` identifies CHD's + executable per-draw statistic with this measurable map, tying the counting layer to the measure. + +Scope honesty: what remains genuinely *research-grade* (beyond Mathlib v4.30.0) is the +*asymptotic* calibration — that the empirical 5%/95% percentiles converge to the true quantiles of +`noiseLaw` (Glivenko–Cantelli / DKW), and that, under exchangeability of a fresh null draw with the +sample, the false-positive rate is exactly the rank level `k/(N+1)`. Those need an empirical-process +theory Mathlib does not yet carry; we do not stub them with `sorry`. The finite-sample false-positive +*bound* proved here is the exact, non-asymptotic statement the test actually guarantees. +-/ + +@[expose] public section + +namespace Spec.Factorization + +open Spec.Factorization.Reconstruction +open MeasureTheory ProbabilityTheory + +variable {n : Nat} + +/-! ## Finite-sample calibration: order-statistic tail counts + +A sorted list has at most `k` entries strictly below its `k`-th element, and at most +`length - 1 - k` entries strictly above it. Pushed through the sort-is-a-permutation invariance of +`countP`, this bounds how many of the `N` null draws fall on the wrong side of a percentile +threshold — the test's empirical false-positive rate. -/ + +/-- In an ascending-sorted list, at most `k` entries are strictly below a cutoff `c ≤ s[k]`: every +entry from index `k` onward is `≥ s[k] ≥ c`, so all sub-`c` entries live in the length-`k` prefix. -/ +private theorem sortedLE_countP_lt_le {s : List ℝ} (hs : s.SortedLE) {k : Nat} + (hk : k < s.length) {c : ℝ} (hc : c ≤ s[k]) : + s.countP (fun x => decide (x < c)) ≤ k := by + conv_lhs => rw [← List.take_append_drop k s, List.countP_append] + have hge : ∀ x ∈ s.drop k, c ≤ x := by + intro x hx + rw [List.mem_iff_getElem] at hx + obtain ⟨i, hi, rfl⟩ := hx + rw [List.getElem_drop] + exact le_trans hc (hs.getElem_le_getElem_of_le (Nat.le_add_right k i)) + have hdrop : (s.drop k).countP (fun x => decide (x < c)) = 0 := by + rw [List.countP_eq_zero] + intro x hx + simp only [decide_eq_true_eq] + exact not_lt.mpr (hge x hx) + have htake : (s.take k).countP (fun x => decide (x < c)) ≤ k := by + refine le_trans List.countP_le_length ?_ + rw [List.length_take]; exact Nat.min_le_left _ _ + rw [hdrop, Nat.add_zero]; exact htake + +/-- In an ascending-sorted list, at most `length - 1 - k` entries are strictly above a cutoff +`s[k] ≤ c`: every entry up to index `k` is `≤ s[k] ≤ c`, so all super-`c` entries live in the +length-`(length-(k+1))` suffix. -/ +private theorem sortedLE_countP_gt_le {s : List ℝ} (hs : s.SortedLE) {k : Nat} + (hk : k < s.length) {c : ℝ} (hc : s[k] ≤ c) : + s.countP (fun x => decide (c < x)) ≤ s.length - 1 - k := by + conv_lhs => rw [← List.take_append_drop (k + 1) s, List.countP_append] + have htake : (s.take (k + 1)).countP (fun x => decide (c < x)) = 0 := by + rw [List.countP_eq_zero] + intro x hx + rw [List.mem_iff_getElem] at hx + obtain ⟨i, hi, rfl⟩ := hx + have hi' : i < k + 1 := by + rw [List.length_take] at hi; exact lt_of_lt_of_le hi (Nat.min_le_left _ _) + rw [List.getElem_take] + simp only [decide_eq_true_eq] + exact not_lt.mpr (le_trans (hs.getElem_le_getElem_of_le (Nat.lt_succ_iff.mp hi')) hc) + have hdrop : (s.drop (k + 1)).countP (fun x => decide (c < x)) ≤ s.length - (k + 1) := by + refine le_trans List.countP_le_length ?_ + rw [List.length_drop] + have heq : s.length - (k + 1) = s.length - 1 - k := by rw [Nat.sub_sub, Nat.add_comm] + rw [htake, Nat.zero_add] + exact le_trans hdrop (le_of_eq heq) + +/-- The ascending-`(· ≤ ·)` mergeSort of the family `a`, whose `k`-th entry is `kthSmallestFn a k`. -/ +private theorem kthSmallestFn_eq_getElem {N : Nat} (a : Fin N → ℝ) {k : Nat} + (hks : k < (((List.finRange N).map a).mergeSort (· ≤ ·)).length) : + Spec.kthSmallestFn a k = (((List.finRange N).map a).mergeSort (· ≤ ·))[k] := by + rw [kthSmallestFn_eq_sorted_getD, List.getD_eq_getElem?_getD, List.getElem?_eq_getElem hks, + Option.getD_some] + +/-- **At most `k` of the family's values are strictly below its `k`-th order statistic.** Sorting is +a permutation (so `countP` is unchanged) and the sorted list has at most `k` entries below `s[k]`. -/ +theorem kthSmallestFn_strictBelow_count_le {N : Nat} (a : Fin N → ℝ) {k : Nat} (hk : k < N) : + ((List.finRange N).map a).countP (fun x => decide (x < Spec.kthSmallestFn a k)) ≤ k := by + have hlen : (((List.finRange N).map a).mergeSort (· ≤ ·)).length = N := by + rw [List.length_mergeSort, List.length_map, List.length_finRange] + have hks : k < (((List.finRange N).map a).mergeSort (· ≤ ·)).length := by rw [hlen]; exact hk + rw [kthSmallestFn_eq_getElem a hks, + ← List.Perm.countP_eq _ (List.mergeSort_perm ((List.finRange N).map a) (· ≤ ·))] + exact sortedLE_countP_lt_le List.sortedLE_mergeSort hks (le_refl _) + +/-- **At most `N-1-k` of the family's values are strictly above its `k`-th order statistic.** -/ +theorem kthSmallestFn_strictAbove_count_le {N : Nat} (a : Fin N → ℝ) {k : Nat} (hk : k < N) : + ((List.finRange N).map a).countP (fun x => decide (Spec.kthSmallestFn a k < x)) ≤ N - 1 - k := by + have hlen : (((List.finRange N).map a).mergeSort (· ≤ ·)).length = N := by + rw [List.length_mergeSort, List.length_map, List.length_finRange] + have hks : k < (((List.finRange N).map a).mergeSort (· ≤ ·)).length := by rw [hlen]; exact hk + rw [kthSmallestFn_eq_getElem a hks, + ← List.Perm.countP_eq _ (List.mergeSort_perm ((List.finRange N).map a) (· ≤ ·))] + have hcount := sortedLE_countP_gt_le List.sortedLE_mergeSort hks (le_refl _) + rw [hlen] at hcount + exact hcount + +/-! ### The `Z_test` empirical false-positive bounds -/ + +/-- **`Z_low` controls the lower-tail false-positive rate.** At most `⌊N/20⌋ ≈ 5%` of the `N` null +draws score strictly below the empirical `Z_low` threshold — exactly the rank that defines the 5th +percentile. This is the finite-sample, non-asymptotic guarantee the significance test carries. -/ +theorem zLow_null_exceedance_le {n N : Nat} (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) + (samples : Fin N → Fin n → ℝ) (hN : 0 < N) : + ((List.finRange N).map (Spec.sampleNoisesFn Λ V γ samples)).countP + (fun x => decide (x < Spec.zLowFn Λ V γ samples)) ≤ Spec.zLowIdx N := by + rw [Spec.zLowFn] + exact kthSmallestFn_strictBelow_count_le _ (zLowIdx_lt hN) + +/-- **`Z_high` controls the upper-tail false-positive rate.** At most `N-1-⌊19N/20⌋ ≈ 5%` of the `N` +null draws score strictly above the empirical `Z_high` threshold. -/ +theorem zHigh_null_exceedance_le {n N : Nat} (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) + (samples : Fin N → Fin n → ℝ) (hN : 0 < N) : + ((List.finRange N).map (Spec.sampleNoisesFn Λ V γ samples)).countP + (fun x => decide (Spec.zHighFn Λ V γ samples < x)) ≤ N - 1 - Spec.zHighIdx N := by + rw [Spec.zHighFn] + exact kthSmallestFn_strictAbove_count_le _ (zHighIdx_lt hN) + +/-! ## The Gaussian null law + +CHD's `Z_test` draws each null sample i.i.d. standard Gaussian. We model one draw as +`nullGaussian n`: the product of `n` standard normals on `Fin n → ℝ`. The per-sample `noise` is a +measurable map, so its pushforward — the null law of the statistic — is a probability measure +concentrated on `[0,1]`. -/ + +noncomputable section + +/-- The per-draw `noise` statistic as a map on raw draws `s : Fin n → ℝ` (one null sample): +`noiseMap Λ V γ s = varNoiseFn Λ γ (Vᵀ·s)`, the same functional `Z_test` scores each draw with. -/ +noncomputable def noiseMap (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) : (Fin n → ℝ) → ℝ := + fun s => Spec.varNoiseFn Λ γ (Spec.projFn V s) + +/-- CHD's executable per-draw null statistic is exactly `noiseMap` applied to that draw. This bridges +the counting layer (`sampleNoisesFn`) to the measure-theoretic model. -/ +theorem sampleNoisesFn_eq_noiseMap {N : Nat} (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) + (samples : Fin N → Fin n → ℝ) (j : Fin N) : + Spec.sampleNoisesFn Λ V γ samples j = noiseMap Λ V γ (samples j) := rfl + +/-- A `dotFn` whose entries each depend measurably on a parameter is measurable in that parameter +(it is the finite sum `∑ₖ f k · g k`). -/ +private theorem measurable_dotFn₂ {β : Type*} [MeasurableSpace β] {f g : β → Fin n → ℝ} + (hf : ∀ k, Measurable (fun b => f b k)) (hg : ∀ k, Measurable (fun b => g b k)) : + Measurable (fun b => Spec.dotFn (f b) (g b)) := by + simp_rw [fun b => dotFn_eq_sum (f b) (g b)] + exact Finset.measurable_sum _ (fun k _ => (hf k).mul (hg k)) + +/-- **The per-draw `noise` statistic is measurable.** It is a ratio of finite sums of products of the +(measurable) draw coordinates, hence Borel-measurable. -/ +theorem measurable_noiseMap (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) : + Measurable (noiseMap Λ V γ) := by + have hproj : ∀ k, Measurable (fun s : Fin n → ℝ => Spec.projFn V s k) := fun k => + measurable_dotFn₂ (fun _ => measurable_const) (fun j => measurable_pi_apply j) + have hpc : ∀ k, Measurable + (fun s : Fin n → ℝ => Spec.projFn V s k * Spec.ridgeCoeffFn Λ γ k) := fun k => + (hproj k).mul measurable_const + exact (measurable_dotFn₂ hpc hpc).div (measurable_dotFn₂ hpc hproj) + +/-- The standard Gaussian draw of a single `Z_test` null sample: `n` i.i.d. standard normals on +`Fin n → ℝ`. A genuine probability measure (the product of probability measures). -/ +noncomputable def nullGaussian (n : Nat) : Measure (Fin n → ℝ) := + Measure.pi (fun _ : Fin n => gaussianReal 0 1) + +instance instIsProbabilityMeasureNullGaussian (n : Nat) : IsProbabilityMeasure (nullGaussian n) := by + unfold nullGaussian; infer_instance + +/-- The **null law** of the `Z_test` statistic: the pushforward of the standard-Gaussian draw under +the per-sample `noise`. This is the distribution `Z_low`/`Z_high` are percentiles of. -/ +noncomputable def noiseLaw (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) : Measure ℝ := + (nullGaussian n).map (noiseMap Λ V γ) + +/-- **The null law is a probability measure** (pushforward of one under a measurable map). -/ +instance instIsProbabilityMeasureNoiseLaw (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) : + IsProbabilityMeasure (noiseLaw Λ V γ) := + Measure.isProbabilityMeasure_map (measurable_noiseMap Λ V γ).aemeasurable + +/-- **The null `noise` distribution lives entirely in `[0,1]`.** Every draw's statistic is in `[0,1]` +(the verified `varNoiseFn_nonneg`/`varNoiseFn_le_one`), so the law assigns full mass to `[0,1]` — +the percentiles `Z_low`/`Z_high` are therefore percentiles of a genuine `[0,1]`-valued random +variable. -/ +theorem noiseLaw_Icc_eq_one {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) : + noiseLaw Λ V γ (Set.Icc 0 1) = 1 := by + rw [noiseLaw, Measure.map_apply (measurable_noiseMap Λ V γ) measurableSet_Icc] + have hpre : noiseMap Λ V γ ⁻¹' Set.Icc 0 1 = Set.univ := by + ext s + simp only [Set.mem_preimage, Set.mem_Icc, Set.mem_univ, iff_true] + exact ⟨varNoiseFn_nonneg hΛ hγ _, varNoiseFn_le_one hΛ hγ _⟩ + rw [hpre, measure_univ] + +end + +end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 42db8eb..cd962d4 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -218,7 +218,9 @@ through its *spectrum*: replacing the data by `ga = V z` makes `V` cancel, `proj (`projFn_mulVec_self`), so `varNoiseFn Λ γ (projFn V (V z)) = varNoiseFn Λ γ z` (`varNoiseFn_projFn_mulVec`). This is the deterministic content of "the `Z_test` null distribution depends only on the eigenvalues"; the *distributional* step — Gaussian sampling and the 5%/95% -percentiles — is statistical rather than algebraic and is left to runtime, exercised numerically. +percentiles — is taken up later (*The `Z_test` distributional layer*), where the finite-sample +false-positive rate is bounded and the i.i.d.-Gaussian null law is shown to be a probability measure +on `[0,1]`, leaving only the asymptotic quantile-consistency to runtime. The `Variational` example confirms all four on a concrete SPD kernel: `(K + γI)·yb = -ga` and `yb = -\texttt{solveRidgeSpec}` to machine precision, `noise ∈ [0,1]`, and the spectral invariance @@ -397,6 +399,41 @@ from a real eigendecomposition, checks `0 ≤ Z_low ≤ Z_high ≤ 1`, shows dat eigenvector (smallest shrinkage noise) clears the lower tail and is flagged significant, and confirms a high noise — and a noise sitting at the upper tail — are both correctly rejected. +# The `Z_test` distributional layer + +The section above proved the thresholds *well-posed*; what it deferred was the *distributional* +question — what `Z_low` being "the 5th percentile" actually buys, and what it means that the draws +are Gaussian. `FactorizationsZTest` closes that gap in two honestly-provable halves. + +*Finite-sample calibration (counting).* The operational promise of a 5th-percentile threshold is a +bound on its *own* false-positive rate: of the `N` null draws, only a `5%` minority should beat it. +That is exactly true, and exact (not asymptotic): in an ascending-sorted list at most `k` entries lie +strictly below the `k`-th, so — since sorting is a permutation and `List.countP` is permutation-invariant +— at most `⌊N/20⌋` of the null noises fall below `Z_low` (`zLow_null_exceedance_le`) and at most +`N-1-⌊19N/20⌋` rise above `Z_high` (`zHigh_null_exceedance_le`). These rest on the same sortedness +(`List.sortedLE_mergeSort`) that gave `Z_low ≤ Z_high`, now counted rather than compared. The `Discovery` +example makes the numbers concrete: across `N = 20` draws exactly `1` (`= ⌊20/20⌋`) sits below `Z_low` and +`0` above `Z_high`, while the slack `Z_high` admits `19` — a negative control showing the `5%` calibration +is specific to `Z_low`, not an artifact of any threshold. + +*The Gaussian null law (measure theory).* CHD draws each null sample i.i.d. standard Gaussian. We model +one draw as `nullGaussian n := Measure.pi (fun _ => gaussianReal 0 1)`, the product of `n` standard +normals on `Fin n → ℝ` — a genuine probability measure. The per-draw statistic `noiseMap` (the same +`varNoiseFn ∘ projFn` the data is scored by, identified with CHD's `sampleNoisesFn` by +`sampleNoisesFn_eq_noiseMap`) is *measurable* (`measurable_noiseMap`: a ratio of finite sums of products +of the draw coordinates), so its pushforward `noiseLaw` is a probability measure +(`IsProbabilityMeasure`). And because every draw's noise lies in `[0,1]` — the verified +`varNoiseFn_nonneg` / `varNoiseFn_le_one`, now lifted to the law — that law is *concentrated on `[0,1]`*: +`noiseLaw_Icc_eq_one` shows it assigns full mass to `[0,1]`. So `Z_low`/`Z_high` are percentiles of a +bona fide `[0,1]`-valued random variable, not of an unconstrained sample. + +*What is honestly left.* The remaining step is *asymptotic* calibration — that the empirical 5%/95% +percentiles converge to the true quantiles of `noiseLaw` (Glivenko–Cantelli / DKW), and that under +exchangeability of a fresh null draw with the sample the false-positive rate is exactly the rank level +`k/(N+1)`. Both need an empirical-process theory Mathlib v4.30.0 does not carry, so they are stated as +the open frontier, never stubbed with `sorry`. The finite-sample false-positive *bound* above is the +exact, non-asymptotic statement the test actually guarantees. + # The a-posteriori residual certificate For the iterative routines, the replacement for an impossible a-priori convergence proof is an exact @@ -591,13 +628,20 @@ resting directly on the verified `noise ≤ 1` bound, so the structural decision over a statistic whose range was itself proved. The `Z_test` *significance thresholds* are now proved well-posed too: `Z_low` and `Z_high` are order statistics of the null `noise` distribution, each inheriting the `[0,1]` bound from the shared `varNoiseFn`, with `Z_low ≤ Z_high` by order-statistic -monotonicity — and the verdict `noise < Z_low` is shown to feed `MinNoiseKernelChooser`. +monotonicity — and the verdict `noise < Z_low` is shown to feed `MinNoiseKernelChooser`. The +*distributional* layer of the `Z_test` is now partly proved too: the threshold's finite-sample +false-positive rate is bounded exactly (`≤ 5%` of the null draws beat `Z_low`, +`zLow_null_exceedance_le`; symmetrically for `Z_high`), and — modelling the draws as i.i.d. standard +Gaussian — the null `noise` law is a genuine probability measure concentrated on `[0,1]` +(`noiseLaw_Icc_eq_one`). So the CHD foundation is complete, from the kernel build through the regularized solve, the noise -statistic, and the `Z_test` thresholds up to the graph-structure decisions. The two remaining open -items are both narrow and deliberately scoped: the cyclic-Jacobi convergence *rate* (captured exactly -by the a-posteriori residual certificate, never by `sorry`), and the *distributional* content of the -`Z_test` — that the draws are Gaussian and the empirical percentile is a calibrated confidence level -(a probability-theory statement needing `Mathlib.Probability`, distinct from the now-proved -order-statistic well-posedness). One is a proof-only gap on a quantity CHD does not need to *run*; the -other is statistical rather than algebraic and exercised numerically. +statistic, and the `Z_test` thresholds up to the graph-structure decisions. The remaining open items +are both narrow and deliberately scoped: the cyclic-Jacobi convergence *rate* (captured exactly by the +a-posteriori residual certificate, never by `sorry`), and the *asymptotic* half of the `Z_test` — that +the empirical 5%/95% percentiles converge to the true quantiles of the now-proved null law +(Glivenko–Cantelli / DKW), and that an exchangeable fresh draw is rejected at exactly rank rate +`k/(N+1)`. That needs an empirical-process theory `Mathlib.Probability` does not yet carry, distinct +from the finite-sample false-positive bound and probability-measure facts already proved. One is a +proof-only gap on a quantity CHD does not need to *run*; the other is the genuine statistical frontier, +flagged rather than stubbed with `sorry`. From 05ae7235a89f272f5dce2b7439c2f5d169198f3d Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 21:49:06 -0700 Subject: [PATCH 19/22] Z_test asymptotic calibration step (a): the i.i.d.-draws scaffold MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lift the single null draw (FactorizationsZTest) to the i.i.d. *sequence* `nullSeqGaussian n := Measure.infinitePi (fun _ : ℕ => nullGaussian n)` on `ℕ → (Fin n → ℝ)` — the infinite product, since `Measure.pi` is finite-index only — and `nullNoise Λ V γ i ω := noiseMap Λ V γ (ω i)`, the same measurable statistic read off coordinate `i`. Proven sorry-free in the new `NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean`: - nullNoise_iIndepFun (← iIndepFun_infinitePi), pairwise nullNoise_pairwise_indepFun in the Pairwise (·⟂ᵢ[μ]·) shape SLLN takes; - nullNoise_hasLaw / nullNoise_identDistrib — each draw has the common law noiseLaw (via measurePreserving_eval_infinitePi + HasLaw.comp); - nullNoise_mem_Icc ([0,1]-valued) and integrable_nullNoise. That is exactly the hint/hindep/hident triple `strong_law_ae_real` and the Hoeffding tail consume, so plan steps (b)–(d) become applications of an in-place scaffold. The *uniform* GC / DKW–Massart sharp constant and the exchangeability rank rate k/(N+1) stay research-grade (flagged, not sorry'd). Discovery.lean exercises the computable shadow — the empirical CDF F̂_N(t) = #{i --- NN/Examples/Factorization.lean | 10 +- NN/Examples/Factorization/Discovery.lean | 51 +++++++ NN/Proofs/Tensor/Basic.lean | 1 + .../Basic/FactorizationsZAsymptotic.lean | 133 ++++++++++++++++++ .../Ch4_Verification/Factorizations.lean | 45 ++++-- 5 files changed, 229 insertions(+), 11 deletions(-) create mode 100644 NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index ec34bbf..e2376fc 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -107,7 +107,15 @@ factorization misbehaves. (`zHigh_null_exceedance_le`, here `0`); a **negative control** confirms the slack `Z_high` threshold admits `≈ 95%` of the draws, so the `5%` calibration is specific to `Z_low`. (The companion measure-theoretic fact — the i.i.d.-Gaussian null law is a probability measure on - `[0,1]`, `noiseLaw_Icc_eq_one` — is noncomputable and lives in the proofs.) + `[0,1]`, `noiseLaw_Icc_eq_one` — is noncomputable and lives in the proofs.) A closing + **asymptotic-scaffold** sub-block corroborates `FactorizationsZAsymptotic` (step (a) of the + asymptotic-calibration plan): the i.i.d. null *sequence* `nullNoise` is proven independent, + identically distributed with law `noiseLaw`, `[0,1]`-valued and integrable (the SLLN's + `hint`/`hindep`/`hident`) — noncomputable, so the `#eval`s exercise its **computable shadow**, the + empirical CDF `F̂_N(t) = #{i 5% of null draws fall below it (calibration is specific to Z_low)" (decide (Spec.zLowIdx 20 < countBelow zHigh)) +/-! ### The asymptotic-calibration scaffold (step a): the empirical CDF of the null sample + +`FactorizationsZAsymptotic` lifts the single null draw to the i.i.d. *sequence* `nullNoise` under the +product measure `nullSeqGaussian`, proving it independent (`nullNoise_iIndepFun`), identically +distributed with the common law `noiseLaw` (`nullNoise_hasLaw`, `nullNoise_identDistrib`), +`[0,1]`-valued (`nullNoise_mem_Icc`) and integrable (`integrable_nullNoise`) — exactly the three +hypotheses (`hint`/`hindep`/`hident`) the strong law of large numbers consumes. That scaffold is +*noncomputable* (a statement about an infinite product measure), so it cannot be `#eval`'d; what we +exercise here is its **computable shadow**, the empirical CDF of the finite null sample +`F̂_N(t) = #{i < N : noiseᵢ ≤ t} / N`. This is the very object whose almost-sure convergence to +`cdf noiseLaw` *is* the SLLN application (step b of the plan, not yet formalized). At step (a) the +i.i.d. sample alone already gives that `F̂_N` is a bona fide CDF — monotone, valued in `[0,1]`, +saturating to `1` above the support and vanishing below it — which is what we check. -/ + +/-- Empirical CDF of the `N = 20` null noises at a threshold `t`: the fraction of draws scoring `≤ t` +(using the `leBool` comparator the order statistics already use). The computable shadow of the +noncomputable `empCDF` whose consistency is step (b). -/ +def empCdf (t : Float) : Float := + (((List.finRange 20).filter (fun j => Spec.leBool (zNullNoises j) t)).length).toFloat / 20.0 + +#eval IO.println s!"empirical CDF of the null sample: F(0) = {empCdf 0.0}, F(Z_low) = {empCdf zLow}, \ + F(Z_high) = {empCdf zHigh}, F(1) = {empCdf 1.0}" + +-- Positive — `F̂` is valued in `[0,1]` at every threshold (it is a fraction of the 20 draws), the +-- finite-sample image of `nullNoise_mem_Icc` / the law `noiseLaw` being a probability measure. +#eval assertTrue "empirical CDF lies in [0,1] across thresholds" + ([0.0, zLow, zHigh, 0.5, 1.0].all + (fun t => Spec.leBool 0.0 (empCdf t) && Spec.leBool (empCdf t) 1.0)) + +-- Positive — `F̂` is monotone nondecreasing: more of the sample falls below a larger threshold. +-- Since `Z_low ≤ Z_high`, `F̂(Z_low) ≤ F̂(Z_high)` — the empirical shadow of `monotone_cdf`. +#eval assertTrue "empirical CDF is monotone: Z_low ≤ Z_high ⇒ F(Z_low) ≤ F(Z_high)" + (Spec.leBool (empCdf zLow) (empCdf zHigh)) + +-- Positive — `F̂` saturates to `1`: every null noise lies in `[0,1]` (`nullNoise_mem_Icc`), so all +-- 20 draws score `≤ 1` and the empirical CDF reaches its full mass there. +#eval assertTrue "empirical CDF reaches 1 at t = 1 (all null noises ≤ 1, nullNoise_mem_Icc)" + (empCdf 1.0 == 1.0) + +-- Positive — `F̂` vanishes below the support: no null noise is negative (`nullNoise_mem_Icc`), so +-- none scores `≤` a negative `t`. +#eval assertTrue "empirical CDF is 0 below the support (no null noise < 0)" + (empCdf (-0.01) == 0.0) + +-- Negative control — `F̂` is *not* the constant function: it genuinely rises from `0` to `1` across +-- the support, so it carries the distributional content the i.i.d. scaffold formalizes. A degenerate +-- (point-mass) sample would have a flat-then-jump CDF; a sample with no spread would not separate +-- these thresholds. This is what makes the consistency target of step (b) non-vacuous. +#eval assertTrue "empirical CDF is non-degenerate: F(below support) < F(1) (carries distribution info)" + (Spec.ltBool (empCdf (-0.01)) (empCdf 1.0)) + end NN.Examples.Factorization.Discovery diff --git a/NN/Proofs/Tensor/Basic.lean b/NN/Proofs/Tensor/Basic.lean index b81aa8a..44fb923 100644 --- a/NN/Proofs/Tensor/Basic.lean +++ b/NN/Proofs/Tensor/Basic.lean @@ -15,6 +15,7 @@ public import NN.Proofs.Tensor.Basic.FactorizationsSolve public import NN.Proofs.Tensor.Basic.FactorizationsVariational public import NN.Proofs.Tensor.Basic.FactorizationsDecision public import NN.Proofs.Tensor.Basic.FactorizationsZTest +public import NN.Proofs.Tensor.Basic.FactorizationsZAsymptotic public import NN.Proofs.Tensor.Basic.FactorizationsKernels public import NN.Proofs.Tensor.Basic.FactorizationsOrthonormal public import NN.Proofs.Tensor.Basic.FactorizationsJacobi diff --git a/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean new file mode 100644 index 0000000..0076404 --- /dev/null +++ b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean @@ -0,0 +1,133 @@ +/- +Copyright (c) 2026 TorchLean +Released under MIT license as described in the file LICENSE. +Authors: TorchLean Team +-/ + +module + +public import NN.Proofs.Tensor.Basic.FactorizationsZTest +public import Mathlib.Probability.Independence.InfinitePi +public import Mathlib.MeasureTheory.Integral.IntegrableOn + +/-! +# CHD `Z_test`: the asymptotic-calibration scaffold (step a) + +[`FactorizationsZTest`](./FactorizationsZTest.lean) modelled a *single* `Z_test` null draw as +`nullGaussian n` (the product of `n` standard normals on `Fin n → ℝ`) and proved the per-draw +`noise` statistic measurable, with null law `noiseLaw` a probability measure on `[0,1]`. That is +enough for the finite-sample false-positive bound, but the *asymptotic* calibration — +empirical 5%/95% percentiles converging to the true quantiles of `noiseLaw` — needs the whole +**i.i.d. sequence** of null draws, not one of them. + +This file builds that sequence and proves it i.i.d.: the scaffold the asymptotic statements +(Glivenko–Cantelli via the SLLN, the Hoeffding per-`t` rate) are applications of. Concretely: + +* **The sequence measure.** `nullSeqGaussian n := Measure.infinitePi (fun _ : ℕ => nullGaussian n)` + on `ℕ → (Fin n → ℝ)` — countably many independent copies of one null draw, a genuine probability + measure (`instIsProbabilityMeasureNullSeqGaussian`). + +* **The `i`-th draw's statistic.** `nullNoise Λ V γ i ω := noiseMap Λ V γ (ω i)` — the same + measurable `noiseMap` from `FactorizationsZTest`, read off the `i`-th coordinate. + +* **i.i.d.** The coordinate evaluations are independent under the product measure, and composing + with the measurable `noiseMap` preserves it (`nullNoise_iIndepFun`, and its pairwise corollary + `nullNoise_pairwise_indepFun` in the exact shape `strong_law_ae_real` consumes). Each draw is + measure-preservingly the same standard-Gaussian draw, so each has the *same* law `noiseLaw` + (`nullNoise_hasLaw`, `nullNoise_identDistrib`). Every draw's noise lies in `[0,1]` + (`nullNoise_mem_Icc`), hence is integrable (`integrable_nullNoise`). + +So `nullNoise` is an i.i.d. real sequence, each with law `noiseLaw`, valued in `[0,1]` and +integrable — exactly the three hypotheses (`hint`/`hindep`/`hident`) the strong law of large +numbers and the Hoeffding tail take. This scaffold is the only genuinely *new* measure-theory +plumbing; the empirical-CDF consistency and concentration statements (steps b–d of the plan) are +applications of it, and the *uniform* Glivenko–Cantelli / DKW–Massart sharp constant and the +exchangeability rank rate remain genuinely research-grade (flagged, never `sorry`'d). +-/ + +@[expose] public section + +namespace Spec.Factorization + +open MeasureTheory ProbabilityTheory + +variable {n : Nat} + +noncomputable section + +/-- The i.i.d. null-draw sequence: countably many independent standard-Gaussian draws, one per +`Z_test` null sample. The product of probability measures, hence itself a probability measure. -/ +noncomputable def nullSeqGaussian (n : Nat) : Measure (ℕ → Fin n → ℝ) := + Measure.infinitePi (fun _ : ℕ => nullGaussian n) + +instance instIsProbabilityMeasureNullSeqGaussian (n : Nat) : + IsProbabilityMeasure (nullSeqGaussian n) := by + unfold nullSeqGaussian; infer_instance + +/-- The `i`-th null draw's `noise` statistic: `noiseMap` applied to the `i`-th coordinate of the +i.i.d. sequence. As `i` ranges over `ℕ` this is the i.i.d. real sequence the asymptotic calibration +runs on. -/ +noncomputable def nullNoise (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) : + ℕ → (ℕ → Fin n → ℝ) → ℝ := + fun i ω => noiseMap Λ V γ (ω i) + +/-- Each draw's `noise` is measurable: the measurable `noiseMap` composed with a coordinate +projection. -/ +theorem measurable_nullNoise (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (i : ℕ) : + Measurable (nullNoise Λ V γ i) := + (measurable_noiseMap Λ V γ).comp (measurable_pi_apply i) + +/-- **The null-noise sequence is independent.** The coordinate evaluations of the product measure +are independent (`iIndepFun_infinitePi`), and composing each with the measurable `noiseMap` +preserves independence. -/ +theorem nullNoise_iIndepFun (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) : + iIndepFun (nullNoise Λ V γ) (nullSeqGaussian n) := + iIndepFun_infinitePi (fun _ => measurable_noiseMap Λ V γ) + +/-- The pairwise-independence corollary, in the exact `Pairwise (· ⟂ᵢ[μ] ·) on X` shape the strong +law of large numbers (`strong_law_ae_real`) consumes for its `hindep` hypothesis. -/ +theorem nullNoise_pairwise_indepFun (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) : + Pairwise (Function.onFun (· ⟂ᵢ[nullSeqGaussian n] ·) (nullNoise Λ V γ)) := + fun _ _ hij => (nullNoise_iIndepFun Λ V γ).indepFun hij + +/-- **Each draw has the same law, `noiseLaw`.** The `i`-th coordinate projection is measure- +preserving from the product measure onto a single `nullGaussian n` draw, and composing with the +measurable `noiseMap` pushes that law forward to `noiseLaw` — independently of `i`. -/ +theorem nullNoise_hasLaw (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (i : ℕ) : + HasLaw (nullNoise Λ V γ i) (noiseLaw Λ V γ) (nullSeqGaussian n) := by + have hEval := (measurePreserving_eval_infinitePi (fun _ : ℕ => nullGaussian n) i).hasLaw + have hNoise : HasLaw (noiseMap Λ V γ) (noiseLaw Λ V γ) (nullGaussian n) := + { aemeasurable := (measurable_noiseMap Λ V γ).aemeasurable + map_eq := rfl } + exact hNoise.fun_comp hEval + +/-- **The null-noise sequence is identically distributed.** Every draw has the common law +`noiseLaw`, so any two are identically distributed — the `hident` hypothesis of the strong law, +stated against the `0`-th draw. -/ +theorem nullNoise_identDistrib (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (i : ℕ) : + IdentDistrib (nullNoise Λ V γ i) (nullNoise Λ V γ 0) (nullSeqGaussian n) (nullSeqGaussian n) where + aemeasurable_fst := (measurable_nullNoise Λ V γ i).aemeasurable + aemeasurable_snd := (measurable_nullNoise Λ V γ 0).aemeasurable + map_eq := by rw [(nullNoise_hasLaw Λ V γ i).map_eq, (nullNoise_hasLaw Λ V γ 0).map_eq] + +/-- **Every draw's noise lies in `[0,1]`**, pointwise — the verified `varNoiseFn` bound applied to +each coordinate. -/ +theorem nullNoise_mem_Icc {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) (i : ℕ) (ω : ℕ → Fin n → ℝ) : + nullNoise Λ V γ i ω ∈ Set.Icc (0 : ℝ) 1 := + Set.mem_Icc.mpr ⟨varNoiseFn_nonneg hΛ hγ _, varNoiseFn_le_one hΛ hγ _⟩ + +/-- **Each draw's noise is integrable** (bounded in `[0,1]` on the probability space) — the `hint` +hypothesis of the strong law. -/ +theorem integrable_nullNoise {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ : ℝ} (hγ : 0 < γ) + (V : Fin n → Fin n → ℝ) (i : ℕ) : + Integrable (nullNoise Λ V γ i) (nullSeqGaussian n) := + Integrable.of_bound (measurable_nullNoise Λ V γ i).aestronglyMeasurable 1 + (ae_of_all _ fun ω => by + have h := Set.mem_Icc.mp (nullNoise_mem_Icc hΛ hγ V i ω) + rw [Real.norm_eq_abs, abs_le] + exact ⟨by linarith [h.1], h.2⟩) + +end + +end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index cd962d4..5fd0c82 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -427,12 +427,31 @@ of the draw coordinates), so its pushforward `noiseLaw` is a probability measure `noiseLaw_Icc_eq_one` shows it assigns full mass to `[0,1]`. So `Z_low`/`Z_high` are percentiles of a bona fide `[0,1]`-valued random variable, not of an unconstrained sample. -*What is honestly left.* The remaining step is *asymptotic* calibration — that the empirical 5%/95% -percentiles converge to the true quantiles of `noiseLaw` (Glivenko–Cantelli / DKW), and that under -exchangeability of a fresh null draw with the sample the false-positive rate is exactly the rank level -`k/(N+1)`. Both need an empirical-process theory Mathlib v4.30.0 does not carry, so they are stated as -the open frontier, never stubbed with `sorry`. The finite-sample false-positive *bound* above is the -exact, non-asymptotic statement the test actually guarantees. +*The i.i.d. scaffold for the asymptotic step.* `FactorizationsZAsymptotic` takes the first concrete +step toward that asymptotic calibration — the *pointwise* half, which a survey of +`Mathlib.Probability` v4.30.0 shows is in fact assemblable sorry-free (a re-scope from the earlier +"absent from Mathlib" note). It lifts the single null draw to the i.i.d. *sequence* +`nullSeqGaussian n := Measure.infinitePi (fun _ : ℕ => nullGaussian n)` on `ℕ → (Fin n → ℝ)`, and +defines `nullNoise Λ V γ i ω := noiseMap Λ V γ (ω i)` — the same measurable `noiseMap`, read off the +`i`-th coordinate. The coordinate projections are independent under the product measure +(`iIndepFun_infinitePi`) and composing with `noiseMap` preserves it (`nullNoise_iIndepFun`, with the +pairwise corollary `nullNoise_pairwise_indepFun`); each projection is measure-preservingly one +standard-Gaussian draw, so every `nullNoise i` has the *same* law `noiseLaw` (`nullNoise_hasLaw`, +`nullNoise_identDistrib`); and every draw lies in `[0,1]` (`nullNoise_mem_Icc`) hence is integrable +(`integrable_nullNoise`). That is exactly the i.i.d.-bounded-integrable triple — `hint`, `hindep`, +`hident` — that the strong law of large numbers (`strong_law_ae_real`) and the Hoeffding tail consume. +This scaffold is the only genuinely new measure-theory plumbing; the empirical-CDF consistency +(Glivenko–Cantelli via the SLLN) and the per-`t` concentration rate `2 exp(-2 N ε²)` (Hoeffding) are +applications of it, deferred as the next increment. + +*What is honestly left.* What stays genuinely research-grade is the *uniform* Glivenko–Cantelli +(`sup_t |F̂_N - cdf| → 0`) and the full *DKW–Massart* inequality with its sharp constant `2` over +the supremum — both need the bracketing / VC-class chaining Mathlib v4.30.0 lacks — and the +*exchangeability rank rate* `k/(N+1)` for a fresh null draw, which needs a symmetric-group +rank-distribution argument also absent. Those are stated as the open frontier, never stubbed with +`sorry`. The finite-sample false-positive *bound* above is the exact, non-asymptotic statement the +test actually guarantees, and the pointwise scaffold is the sorry-free bridge toward the asymptotic +statement. # The a-posteriori residual certificate @@ -641,7 +660,13 @@ are both narrow and deliberately scoped: the cyclic-Jacobi convergence *rate* (c a-posteriori residual certificate, never by `sorry`), and the *asymptotic* half of the `Z_test` — that the empirical 5%/95% percentiles converge to the true quantiles of the now-proved null law (Glivenko–Cantelli / DKW), and that an exchangeable fresh draw is rejected at exactly rank rate -`k/(N+1)`. That needs an empirical-process theory `Mathlib.Probability` does not yet carry, distinct -from the finite-sample false-positive bound and probability-measure facts already proved. One is a -proof-only gap on a quantity CHD does not need to *run*; the other is the genuine statistical frontier, -flagged rather than stubbed with `sorry`. +`k/(N+1)`. The *pointwise* part of that asymptotic step is no longer fully out of reach: its i.i.d. +scaffold is now built and proved sorry-free (`FactorizationsZAsymptotic` — `nullNoise` an independent, +identically-`noiseLaw`-distributed, `[0,1]`-valued, integrable sequence under +`Measure.infinitePi nullGaussian`), exactly the hypotheses the strong law of large numbers and the +Hoeffding tail take, so the empirical-CDF consistency and per-`t` rate are now applications rather than +frontier. What stays genuinely research-grade is the *uniform* Glivenko–Cantelli / DKW–Massart sharp +constant (bracketing / VC chaining) and the exchangeability rank rate `k/(N+1)` +(symmetric-group rank distribution) — both absent from `Mathlib.Probability` v4.30.0. One open item is +a proof-only gap on a quantity CHD does not need to *run*; the other is the genuine statistical +frontier, flagged rather than stubbed with `sorry`. From d37099105d0823d279bb13658ae42f06ebedbb87 Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Sun, 31 May 2026 22:08:04 -0700 Subject: [PATCH 20/22] =?UTF-8?q?Z=5Ftest=20asymptotic=20calibration=20ste?= =?UTF-8?q?p=20(b):=20pointwise=20Glivenko=E2=80=93Cantelli=20via=20the=20?= =?UTF-8?q?SLLN?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Build on the step-(a) i.i.d. scaffold (FactorizationsZAsymptotic) to prove the pointwise empirical-CDF consistency theorem, sorry-free over Mathlib v4.30.0: - nullBelow Λ V γ t i ω := (Set.Iic t).indicator 1 (nullNoise Λ V γ i ω), the threshold indicators 1{noiseᵢ ≤ t}, and empCDF, their normalized prefix sum. - nullBelow_pairwise_indepFun / nullBelow_identDistrib / integrable_nullBelow: the indicators inherit the scaffold's i.i.d.-bounded-integrable structure (indicator composed onto each independent, identically-distributed draw). - integral_nullBelow_zero: their common mean is exactly cdf (noiseLaw Λ V γ) t (HasLaw.integral_comp + integral_indicator_one + cdf_eq_real), so empCDF is the Monte-Carlo estimator of the null CDF. - empCDF_tendsto_cdf: strong_law_ae_real on the hint/hindep/hident triple gives ∀ᵐ ω, Tendsto (fun N => empCDF … N t ω) atTop (𝓝 (cdf noiseLaw t)) — pointwise Glivenko–Cantelli at each fixed t. Discovery.lean adds the matching #eval block exercising the computable shadow: the growing-prefix running mean F̂_N settling toward the full-sample estimate of cdf noiseLaw t (each prefix a valid [0,1] CDF value, cdf 1 = 1 attained at every N, with a non-degeneracy negative control). Aggregate docstring and the Ch4 blueprint updated to describe step (b); the uniform GC / DKW–Massart sharp constant and exchangeability rank rate remain flagged research-grade (no sorry). Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 8 ++ NN/Examples/Factorization/Discovery.lean | 49 ++++++++++ .../Basic/FactorizationsZAsymptotic.lean | 98 ++++++++++++++++++- .../Ch4_Verification/Factorizations.lean | 30 ++++-- 4 files changed, 178 insertions(+), 7 deletions(-) diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index e2376fc..0244774 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -116,6 +116,14 @@ factorization misbehaves. monotone, saturating to `1` at the top of the `[0,1]` support, vanishing below `0`), with a **negative control** that it is non-degenerate (rises strictly from `0` to `1`, carrying the distributional content whose convergence to `cdf noiseLaw` is the next increment, step (b)). + A final **consistency** sub-block corroborates `empCDF_tendsto_cdf` (step (b)): the empirical CDF + is the SLLN *running mean* of the bounded i.i.d. indicators `1{noiseᵢ ≤ t}`, whose mean is exactly + `cdf noiseLaw t` (`integral_nullBelow_zero`), so almost surely `F̂_N(t) → cdf noiseLaw t` (pointwise + Glivenko–Cantelli). The limit needs `N → ∞`, so the `#eval`s watch the **growing-prefix running + mean** `F̂_N` settle toward the full-sample estimate: each prefix is a valid `[0,1]` CDF value + (bounded summands), the limit value `cdf 1 = 1` is attained at every `N`, and a **negative control** + confirms the estimate genuinely moves with `N` (an early prefix differs from the full sample), so + the convergence is a real limit being approached rather than a vacuous constant. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Discovery.lean b/NN/Examples/Factorization/Discovery.lean index 7139904..5b36f4b 100644 --- a/NN/Examples/Factorization/Discovery.lean +++ b/NN/Examples/Factorization/Discovery.lean @@ -344,4 +344,53 @@ def empCdf (t : Float) : Float := #eval assertTrue "empirical CDF is non-degenerate: F(below support) < F(1) (carries distribution info)" (Spec.ltBool (empCdf (-0.01)) (empCdf 1.0)) +/-! ### Pointwise consistency of the empirical CDF (step b): the SLLN running mean + +`FactorizationsZAsymptotic` now proves `empCDF_tendsto_cdf`: for each threshold `t`, the empirical +CDF `empCDF Λ V γ N t` of the i.i.d. null draws converges *almost surely* to the true CDF +`cdf noiseLaw t` as `N → ∞` — the pointwise Glivenko–Cantelli theorem, via the strong law of large +numbers (`strong_law_ae_real`) applied to the bounded i.i.d. indicators `1{noiseᵢ ≤ t}`. The limit +value is pinned by `integral_nullBelow_zero`: the *mean* of the indicator is exactly +`cdf noiseLaw t`, so the empirical CDF is the Monte-Carlo estimator of the null CDF. Convergence +needs `N → ∞`, so it is not directly `#eval`-able; the computable shadow is the **running mean** +`F̂_N(t) = (1/N)·#{i < N : noiseᵢ ≤ t}` over growing prefixes `N` of the 20-draw sample, which we +watch settle toward the full-sample estimate `empCdf t` of `cdf noiseLaw t`. -/ + +/-- Running empirical CDF over the first `N ≤ 20` null draws at threshold `t`: the partial mean of the +indicator sequence `1{noiseᵢ ≤ t}` that the strong law averages. As `N → ∞` this is precisely the +quantity `empCDF_tendsto_cdf` sends to `cdf noiseLaw t`; here we watch its growing-`N` prefixes. -/ +def empCdfPrefix (N : Nat) (t : Float) : Float := + (((List.finRange 20).filter + (fun j => decide (j.val < N) && Spec.leBool (zNullNoises j) t)).length).toFloat / N.toFloat + +#eval IO.println s!"running empirical CDF at t = 0.057 (mid-support) over growing prefixes: \ + F̂_5 = {empCdfPrefix 5 0.057}, F̂_10 = {empCdfPrefix 10 0.057}, F̂_15 = {empCdfPrefix 15 0.057}, \ + F̂_20 = {empCdfPrefix 20 0.057} (→ empCdf 0.057 = {empCdf 0.057}, the estimate of cdf noiseLaw 0.057)" + +-- Positive — the full prefix `N = 20` *is* the empirical CDF: `empCDF Λ V γ 20 t` evaluated on this +-- sample. The running mean and the count-fraction `empCdf` coincide at `N = 20` (the shadow of the +-- `empCDF` definition as a normalized indicator sum). +#eval assertTrue "running mean at N = 20 equals the empirical CDF (empCDF is the normalized indicator sum)" + ([0.0, zLow, 0.5, zHigh, 1.0].all (fun t => empCdfPrefix 20 t == empCdf t)) + +-- Positive — every running prefix is a valid CDF value in `[0,1]`: a mean of the `[0,1]`-valued +-- indicators stays in `[0,1]` (the `integrable_nullBelow` / boundedness hypothesis feeding the SLLN). +#eval assertTrue "every running prefix mean lies in [0,1] (bounded indicators ⇒ bounded average)" + ([1, 2, 5, 10, 15, 20].all (fun N => + [0.0, zLow, 0.5, zHigh, 1.0].all (fun t => + Spec.leBool 0.0 (empCdfPrefix N t) && Spec.leBool (empCdfPrefix N t) 1.0))) + +-- Positive — the limit value `cdf noiseLaw 1 = 1` is already attained at *every* finite `N`: all +-- indicators `1{noiseᵢ ≤ 1}` are `1` (`nullNoise_mem_Icc`), so each running mean is exactly `1`. The +-- empirical CDF converges to the saturation endpoint trivially there. +#eval assertTrue "running mean saturates to cdf 1 = 1 at every prefix (all noises ≤ 1)" + ([1, 2, 5, 10, 15, 20].all (fun N => empCdfPrefix N 1.0 == 1.0)) + +-- Negative control — consistency is *non-vacuous*: the running estimate genuinely changes with `N` +-- (an early prefix differs from the full sample at some interior threshold), so the convergence +-- `F̂_N → cdf noiseLaw t` is a real limit being approached, not a constant already equal to its limit +-- at `N = 5`. A degenerate (point-mass) sample would make every prefix equal and the SLLN vacuous. +#eval assertTrue "running empirical CDF is non-trivial: an early prefix differs from the full sample" + ([0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, zLow].any (fun t => !(empCdfPrefix 5 t == empCdfPrefix 20 t))) + end NN.Examples.Factorization.Discovery diff --git a/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean index 0076404..22053c6 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean @@ -9,9 +9,12 @@ module public import NN.Proofs.Tensor.Basic.FactorizationsZTest public import Mathlib.Probability.Independence.InfinitePi public import Mathlib.MeasureTheory.Integral.IntegrableOn +public import Mathlib.Probability.StrongLaw +public import Mathlib.Probability.CDF +public import Mathlib.MeasureTheory.Integral.Bochner.Set /-! -# CHD `Z_test`: the asymptotic-calibration scaffold (step a) +# CHD `Z_test`: the asymptotic-calibration scaffold and empirical-CDF consistency (steps a–b) [`FactorizationsZTest`](./FactorizationsZTest.lean) modelled a *single* `Z_test` null draw as `nullGaussian n` (the product of `n` standard normals on `Fin n → ℝ`) and proved the per-draw @@ -43,6 +46,15 @@ numbers and the Hoeffding tail take. This scaffold is the only genuinely *new* m plumbing; the empirical-CDF consistency and concentration statements (steps b–d of the plan) are applications of it, and the *uniform* Glivenko–Cantelli / DKW–Massart sharp constant and the exchangeability rank rate remain genuinely research-grade (flagged, never `sorry`'d). + +**Step (b) — pointwise consistency of the empirical CDF** is the first such application, proved +here. Fix a threshold `t`. The threshold indicators `nullBelow Λ V γ t i ω = 𝟙[nullNoise i ω ≤ t]` +inherit the i.i.d. structure (composition with the measurable indicator of `Iic t`), are +`[0,1]`-valued hence integrable, and have common mean `cdf (noiseLaw Λ V γ) t` +(`integral_nullBelow_zero`). The strong law (`strong_law_ae_real`, Etemadi's pairwise form) then +yields `empCDF_tendsto_cdf`: almost surely the empirical CDF `empCDF Λ V γ N t` converges to +`cdf (noiseLaw Λ V γ) t` as `N → ∞` — the pointwise Glivenko–Cantelli theorem. The *uniform* +(sup-norm over `t`) strengthening and the DKW rate are the remaining steps (c)–(d). -/ @[expose] public section @@ -128,6 +140,90 @@ theorem integrable_nullNoise {Λ : Fin n → ℝ} (hΛ : ∀ i, 0 ≤ Λ i) {γ rw [Real.norm_eq_abs, abs_le] exact ⟨by linarith [h.1], h.2⟩) +/-! ## Step (b): pointwise consistency of the empirical CDF (Glivenko–Cantelli via the SLLN) + +Fix a threshold `t`. The *threshold indicators* `nullBelow Λ V γ t i ω = 𝟙[nullNoise i ω ≤ t]` are, +like `nullNoise` itself, i.i.d. — composing each independent, identically-distributed draw with the +measurable indicator of `Iic t` preserves both — and `[0,1]`-valued, hence integrable. Their common +mean is exactly the CDF of the null law at `t`, +`∫ ω, nullBelow Λ V γ t 0 ω = (noiseLaw Λ V γ).real (Iic t) = cdf (noiseLaw Λ V γ) t`. The strong law +of large numbers (`strong_law_ae_real`, Etemadi's pairwise-independent form) then gives, almost +surely, `empCDF Λ V γ N t ω → cdf (noiseLaw Λ V γ) t` as `N → ∞`: pointwise consistency of the +empirical distribution function. -/ + +/-- The threshold indicator of the `i`-th null draw at level `t`: `1` if that draw's `noise` is +`≤ t`, else `0`. Normalized sums of these are the empirical CDF, and as an i.i.d. bounded sequence +they are the random variables the strong law runs on. -/ +noncomputable def nullBelow (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) : + ℕ → (ℕ → Fin n → ℝ) → ℝ := + fun i ω => (Set.Iic t).indicator (1 : ℝ → ℝ) (nullNoise Λ V γ i ω) + +/-- The **empirical CDF** of the first `N` null draws at threshold `t`: +`F̂_N(t)(ω) = #{i < N : nullNoise i ω ≤ t} / N`, written as the normalized sum of threshold +indicators so it plugs directly into the strong law. -/ +noncomputable def empCDF (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (N : ℕ) (t : ℝ) + (ω : ℕ → Fin n → ℝ) : ℝ := + (∑ i ∈ Finset.range N, nullBelow Λ V γ t i ω) / (N : ℝ) + +/-- Each threshold indicator is measurable: the measurable indicator of `Iic t` composed with the +measurable `nullNoise`. -/ +theorem measurable_nullBelow (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) : + Measurable (nullBelow Λ V γ t i) := + (measurable_const.indicator measurableSet_Iic).comp (measurable_nullNoise Λ V γ i) + +/-- **The threshold-indicator sequence is pairwise independent** — composing each independent +`nullNoise` draw with the measurable indicator of `Iic t` preserves independence. The exact +`hindep` shape `strong_law_ae_real` consumes. -/ +theorem nullBelow_pairwise_indepFun (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) : + Pairwise (Function.onFun (· ⟂ᵢ[nullSeqGaussian n] ·) (nullBelow Λ V γ t)) := by + intro i j hij + exact ((nullNoise_iIndepFun Λ V γ).indepFun hij).comp + (measurable_const.indicator measurableSet_Iic) (measurable_const.indicator measurableSet_Iic) + +/-- **The threshold-indicator sequence is identically distributed** — each is the common `nullNoise` +law pushed through the same indicator. The `hident` hypothesis of the strong law. -/ +theorem nullBelow_identDistrib (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) : + IdentDistrib (nullBelow Λ V γ t i) (nullBelow Λ V γ t 0) + (nullSeqGaussian n) (nullSeqGaussian n) := + (nullNoise_identDistrib Λ V γ i).comp (measurable_const.indicator measurableSet_Iic) + +/-- **Each threshold indicator is integrable** — it is `[0,1]`-valued on a probability space. The +`hint` hypothesis of the strong law. -/ +theorem integrable_nullBelow (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) : + Integrable (nullBelow Λ V γ t i) (nullSeqGaussian n) := + Integrable.of_bound (measurable_nullBelow Λ V γ t i).aestronglyMeasurable 1 + (ae_of_all _ fun ω => by + show ‖(Set.Iic t).indicator (1 : ℝ → ℝ) (nullNoise Λ V γ i ω)‖ ≤ 1 + refine le_trans (norm_indicator_le_norm_self _ _) ?_ + simp) + +/-- **The common mean of the threshold indicators is the null CDF at `t`.** Pushing the indicator of +`Iic t` through the `0`-th draw's law `noiseLaw` (via `HasLaw.integral_comp`) turns the expectation +into `(noiseLaw Λ V γ).real (Iic t) = cdf (noiseLaw Λ V γ) t`. -/ +theorem integral_nullBelow_zero (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) : + (nullSeqGaussian n)[nullBelow Λ V γ t 0] = cdf (noiseLaw Λ V γ) t := by + have hf : AEStronglyMeasurable ((Set.Iic t).indicator (1 : ℝ → ℝ)) (noiseLaw Λ V γ) := + (measurable_const.indicator measurableSet_Iic).aestronglyMeasurable + have key := (nullNoise_hasLaw Λ V γ 0).integral_comp hf + rw [integral_indicator_one measurableSet_Iic, ← cdf_eq_real] at key + exact key + +/-- **Pointwise consistency of the empirical CDF (pointwise Glivenko–Cantelli via the SLLN).** For +each fixed threshold `t`, almost surely the empirical CDF `empCDF` of the i.i.d. null draws converges +to the true CDF of the null law `noiseLaw` as the number of draws `N → ∞`. This is step (b) of the +asymptotic-calibration plan — the foundation under the 5%/95% percentile convergence, whose uniform +and concentration refinements are steps (c)–(d). -/ +theorem empCDF_tendsto_cdf (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) : + ∀ᵐ ω ∂(nullSeqGaussian n), + Filter.Tendsto (fun N : ℕ => empCDF Λ V γ N t ω) Filter.atTop + (nhds (cdf (noiseLaw Λ V γ) t)) := by + have hlaw := strong_law_ae_real (nullBelow Λ V γ t) + (integrable_nullBelow Λ V γ t 0) + (nullBelow_pairwise_indepFun Λ V γ t) + (fun i => nullBelow_identDistrib Λ V γ t i) + rw [integral_nullBelow_zero] at hlaw + exact hlaw + end end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 5fd0c82..0a98163 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -442,7 +442,23 @@ standard-Gaussian draw, so every `nullNoise i` has the *same* law `noiseLaw` (`n `hident` — that the strong law of large numbers (`strong_law_ae_real`) and the Hoeffding tail consume. This scaffold is the only genuinely new measure-theory plumbing; the empirical-CDF consistency (Glivenko–Cantelli via the SLLN) and the per-`t` concentration rate `2 exp(-2 N ε²)` (Hoeffding) are -applications of it, deferred as the next increment. +applications of it. + +*Pointwise consistency of the empirical CDF (step b).* The first such application is now proved, +`empCDF_tendsto_cdf`. Fix a threshold `t`. The threshold indicators +`nullBelow Λ V γ t i ω := (Set.Iic t).indicator 1 (nullNoise Λ V γ i ω)` — the events `1{noiseᵢ ≤ t}` +— inherit the scaffold's i.i.d. structure: composing each independent, identically-distributed draw +with the measurable indicator of `Iic t` preserves both (`nullBelow_pairwise_indepFun`, +`nullBelow_identDistrib`), and they are `[0,1]`-valued hence integrable (`integrable_nullBelow`). +Their common mean is pinned by a short `HasLaw.integral_comp` computation that pushes the indicator +through `noiseLaw`: `∫ ω, nullBelow Λ V γ t 0 ω = (noiseLaw Λ V γ).real (Iic t) = cdf (noiseLaw Λ V γ) t` +(`integral_nullBelow_zero`) — so the empirical CDF is literally the Monte-Carlo estimator of the null +CDF. Feeding the `hint`/`hindep`/`hident` triple to `strong_law_ae_real` then yields, almost surely, +`empCDF Λ V γ N t ω → cdf (noiseLaw Λ V γ) t` as `N → ∞`, where +`empCDF Λ V γ N t ω := (∑ i ∈ range N, nullBelow Λ V γ t i ω) / N`. That is the *pointwise* +Glivenko–Cantelli theorem, sorry-free over Mathlib v4.30.0. The executable `Discovery` examples +exercise its computable shadow — the growing-prefix running mean `F̂_N(t)` settling toward the +full-sample estimate of `cdf noiseLaw t`. *What is honestly left.* What stays genuinely research-grade is the *uniform* Glivenko–Cantelli (`sup_t |F̂_N - cdf| → 0`) and the full *DKW–Massart* inequality with its sharp constant `2` over @@ -660,12 +676,14 @@ are both narrow and deliberately scoped: the cyclic-Jacobi convergence *rate* (c a-posteriori residual certificate, never by `sorry`), and the *asymptotic* half of the `Z_test` — that the empirical 5%/95% percentiles converge to the true quantiles of the now-proved null law (Glivenko–Cantelli / DKW), and that an exchangeable fresh draw is rejected at exactly rank rate -`k/(N+1)`. The *pointwise* part of that asymptotic step is no longer fully out of reach: its i.i.d. -scaffold is now built and proved sorry-free (`FactorizationsZAsymptotic` — `nullNoise` an independent, +`k/(N+1)`. The *pointwise* part of that asymptotic step is now proved, not merely scaffolded: on the +i.i.d. sequence `FactorizationsZAsymptotic` builds (`nullNoise` an independent, identically-`noiseLaw`-distributed, `[0,1]`-valued, integrable sequence under -`Measure.infinitePi nullGaussian`), exactly the hypotheses the strong law of large numbers and the -Hoeffding tail take, so the empirical-CDF consistency and per-`t` rate are now applications rather than -frontier. What stays genuinely research-grade is the *uniform* Glivenko–Cantelli / DKW–Massart sharp +`Measure.infinitePi nullGaussian`), `empCDF_tendsto_cdf` applies the strong law of large numbers to +the bounded indicators `1{noiseᵢ ≤ t}` — whose mean is exactly `cdf noiseLaw t` +(`integral_nullBelow_zero`) — to give almost-sure convergence `F̂_N(t) → cdf noiseLaw t` for every +fixed `t`, the *pointwise* Glivenko–Cantelli theorem, sorry-free. What stays genuinely research-grade +is the *uniform* Glivenko–Cantelli / DKW–Massart sharp constant (bracketing / VC chaining) and the exchangeability rank rate `k/(N+1)` (symmetric-group rank distribution) — both absent from `Mathlib.Probability` v4.30.0. One open item is a proof-only gap on a quantity CHD does not need to *run*; the other is the genuine statistical From aa6a3334b345b9b3966d72cbf32cae34fed914ab Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Mon, 1 Jun 2026 07:46:31 -0700 Subject: [PATCH 21/22] Z_test asymptotic calibration step (c): pointwise DKW-at-a-point via Hoeffding MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prove the finite-sample companion of step (b)'s almost-sure limit: at a fixed threshold t, the empirical CDF of the i.i.d. null draws concentrates around the true null CDF at the sharp Hoeffding rate. FactorizationsZAsymptotic.lean: - nullBelow_subgaussian / nullBelow_neg_subgaussian: the [0,1]-bounded threshold indicators are sub-Gaussian with variance proxy 1/4 (hasSubgaussianMGF_of_mem_Icc), centered at mean cdf noiseLaw t (integral_nullBelow_eq). - hoeffding_avg_ge: Mathlib's measure_sum_ge_le_of_iIndepFun with ε ↦ N·ε turns the proxy sum N/4 into the sharp exponent -2Nε². - empCDF_upper_tail / empCDF_lower_tail: P(±(empCDF − cdf) ≥ ε) ≤ exp(-2Nε²). - empCDF_concentration: two-sided DKW-at-a-point P(|empCDF − cdf| ≥ ε) ≤ 2·exp(-2Nε²) via measureReal_union_le + le_abs. Discovery.lean: matching #eval block exercising the bound's computable shadows — the tail function 2·exp(-2Nε²) (twice one-sided, decreasing in N and ε, non-vacuous < 1 once 2Nε² > ln 2, trivially 2 at ε = 0) and the observed prefix deviation it governs (every prefix of ≥ 3 draws keeps F̂_N within ε = 0.3 of the full sample; negative control: N = 1, 2 deviate by 0.5 > ε). Docs: aggregate Factorization docstring and Ch4 blueprint updated; uniform DKW–Massart and quantile-transfer step (d) remain flagged research-grade, never sorry'd. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 11 + NN/Examples/Factorization/Discovery.lean | 74 +++++++ .../Basic/FactorizationsZAsymptotic.lean | 189 +++++++++++++++++- .../Ch4_Verification/Factorizations.lean | 45 ++++- 4 files changed, 307 insertions(+), 12 deletions(-) diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index 0244774..b4eb302 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -124,6 +124,17 @@ factorization misbehaves. (bounded summands), the limit value `cdf 1 = 1` is attained at every `N`, and a **negative control** confirms the estimate genuinely moves with `N` (an early prefix differs from the full sample), so the convergence is a real limit being approached rather than a vacuous constant. + A final **concentration** sub-block corroborates `empCDF_concentration` (step (c)): the + Dvoretzky–Kiefer–Wolfowitz inequality *at a single point*, `ℙ(|F̂_N(t) − cdf noiseLaw t| ≥ ε) ≤ + 2·exp(−2·N·ε²)` with the sharp Hoeffding exponent (the threshold indicators are `[0,1]`-bounded, + so sub-Gaussian with proxy `1/4`; the one-sided `empCDF_upper_tail`/`empCDF_lower_tail` give + `exp(−2Nε²)` each, the union the factor `2`). The probability is noncomputable, so the `#eval`s + exercise the bound's two computable shadows: the tail *function* `2·exp(−2Nε²)` (twice the + one-sided tail, decreasing in both `N` and `ε`, non-vacuous `< 1` once `2Nε² > ln 2`), and the + observed deviation it governs — every prefix of `≥ 3` draws keeps `F̂_N` within `ε = 0.3` of the + full-sample estimate uniformly over thresholds, with a **negative control** that the tiniest + prefixes (`N = 1, 2`) deviate by `0.5 > ε`, the honest weak-`N` regime where the `2·exp(−2Nε²)` + bound is still near `2`. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Discovery.lean b/NN/Examples/Factorization/Discovery.lean index 5b36f4b..983747d 100644 --- a/NN/Examples/Factorization/Discovery.lean +++ b/NN/Examples/Factorization/Discovery.lean @@ -393,4 +393,78 @@ def empCdfPrefix (N : Nat) (t : Float) : Float := #eval assertTrue "running empirical CDF is non-trivial: an early prefix differs from the full sample" ([0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, zLow].any (fun t => !(empCdfPrefix 5 t == empCdfPrefix 20 t))) +/-! ### Pointwise finite-sample concentration (step c): the Hoeffding / DKW-at-a-point bound + +`empCDF_concentration` proves, for the i.i.d. null sequence, that at any fixed threshold `t`, + + ℙ(|F̂_N(t) − cdf noiseLaw t| ≥ ε) ≤ 2·exp(−2·N·ε²), + +the Dvoretzky–Kiefer–Wolfowitz inequality at a single point, with the sharp Hoeffding exponent (the +one-sided `empCDF_upper_tail`/`empCDF_lower_tail` give `exp(−2Nε²)` each; their union the factor `2`). +The probability is noncomputable — a measure on the infinite product of draws — but two computable +shadows make the statement concrete: the tail-bound *function* `2·exp(−2Nε²)`, and the *observed* +deviation of the prefix empirical CDF from the full-sample estimate, which that bound governs. -/ + +/-- The one-sided Hoeffding tail `exp(−2Nε²)` (`empCDF_upper_tail` / `empCDF_lower_tail`). -/ +def oneSidedBound (N : Nat) (ε : Float) : Float := Float.exp (-2.0 * N.toFloat * ε * ε) + +/-- The two-sided DKW-at-a-point bound `2·exp(−2Nε²)` (`empCDF_concentration`). -/ +def hoeffdingBound (N : Nat) (ε : Float) : Float := 2.0 * oneSidedBound N ε + +#eval IO.println s!"Hoeffding two-sided bound 2·exp(−2Nε²): (N=20,ε=0.3) = {hoeffdingBound 20 0.3}, \ + (N=20,ε=0.15) = {hoeffdingBound 20 0.15}, (N=10,ε=0.3) = {hoeffdingBound 10 0.3}, \ + trivial (ε=0) = {hoeffdingBound 20 0.0}" + +-- Positive — the two-sided bound is exactly twice the one-sided tail: the union bound assembling +-- `empCDF_concentration` from `empCDF_upper_tail` + `empCDF_lower_tail`, each `exp(−2Nε²)`. +#eval assertTrue "two-sided Hoeffding bound = 2 × one-sided tail (upper + lower)" + ([(5, 0.1), (10, 0.2), (20, 0.3)].all (fun p => hoeffdingBound p.1 p.2 == 2.0 * oneSidedBound p.1 p.2)) + +-- Positive — the bound tightens with more samples: doubling `N` shrinks the tail (the `N`-dependence +-- of the `−2Nε²` exponent, i.e. the finite-sample consistency rate). +#eval assertTrue "Hoeffding bound decreases in N (more draws ⇒ sharper concentration)" + ([0.15, 0.2, 0.3].all (fun ε => Spec.leBool (hoeffdingBound 40 ε) (hoeffdingBound 20 ε) + && Spec.leBool (hoeffdingBound 20 ε) (hoeffdingBound 10 ε))) + +-- Positive — the bound tightens with a looser tolerance `ε` (the `ε²` in the exponent). +#eval assertTrue "Hoeffding bound decreases in ε (larger tolerance ⇒ smaller exceedance probability)" + ([5, 10, 20].all (fun N => Spec.leBool (hoeffdingBound N 0.4) (hoeffdingBound N 0.2))) + +-- Positive — the bound is *non-vacuous* (a genuine probability bound `< 1`) once `2Nε² > ln 2`; at +-- `N = 20`, `ε = 0.3` it is `≈ 0.055`, so the empirical CDF is within 0.3 of the truth w.p. ≥ 0.945. +#eval assertTrue "Hoeffding bound is non-vacuous (< 1) at N = 20, ε = 0.3" + (Spec.ltBool (hoeffdingBound 20 0.3) 1.0) + +-- Negative control — at `ε = 0` the bound is exactly the trivial constant `2` (the vacuous `ℙ ≤ 2`): +-- concentration says nothing without a positive tolerance, so the `ε²` in the exponent does the work. +#eval assertTrue "at ε = 0 the Hoeffding bound is the trivial constant 2 (vacuous without tolerance)" + (hoeffdingBound 20 0.0 == 2.0) + +/-! The bound governs the *observed* fluctuation: as the prefix grows, the empirical CDF `F̂_N` +concentrates around the full-sample estimate. With tolerance `ε = 0.3`, every prefix of `≥ 3` draws +stays within `ε` of the full sample uniformly over the threshold grid, while the tiniest prefixes +(`N = 1, 2`) deviate by `0.5` — exactly the weak-`N` regime where `2·exp(−2Nε²)` is still near `2`. -/ + +/-- Threshold grid spanning the tight null-noise band `[0.048, 0.062]` plus the `[0,1]` tails. -/ +def devGrid : List Float := [-0.01, 0.0, 0.05, 0.055, 0.057, 0.06, 0.062, 0.2, 0.5, 1.0] + +/-- Sup-over-the-grid deviation of the prefix-`N` empirical CDF from the full-sample estimate — the +quantity the two-sided bound controls (a computable proxy for `|F̂_N(t) − cdf noiseLaw t|`). -/ +def maxDev (N : Nat) : Float := + (devGrid.map (fun t => Float.abs (empCdfPrefix N t - empCdf t))).foldl max 0.0 + +#eval IO.println s!"max |F̂_N − F̂_20| over the grid: N=1 {maxDev 1}, N=2 {maxDev 2}, N=3 {maxDev 3}, \ + N=5 {maxDev 5}, N=10 {maxDev 10}, N=20 {maxDev 20}" + +-- Positive — concentration: with enough draws the empirical CDF settles within `ε = 0.3` of the +-- full-sample estimate, uniformly over thresholds (the deviation the two-sided bound governs). +#eval assertTrue "empirical CDF concentrates: max deviation ≤ ε = 0.3 for every prefix of ≥ 3 draws" + ([3, 5, 10, 15, 20].all (fun N => Spec.leBool (maxDev N) 0.3)) + +-- Negative control — at the tiniest prefixes (`N = 1, 2`) the empirical CDF still deviates by `0.5 > ε`, +-- so concentration genuinely needs `N` to grow: this is the regime where `2·exp(−2Nε²)` is near `2`, +-- i.e. the bound is honestly vacuous and the empirical CDF has not yet concentrated. +#eval assertTrue "concentration needs N to grow: N = 1, 2 prefixes deviate by > ε = 0.3" + ([1, 2].all (fun N => Spec.ltBool 0.3 (maxDev N))) + end NN.Examples.Factorization.Discovery diff --git a/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean index 22053c6..23ad95d 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean @@ -12,9 +12,10 @@ public import Mathlib.MeasureTheory.Integral.IntegrableOn public import Mathlib.Probability.StrongLaw public import Mathlib.Probability.CDF public import Mathlib.MeasureTheory.Integral.Bochner.Set +public import Mathlib.Probability.Moments.SubGaussian /-! -# CHD `Z_test`: the asymptotic-calibration scaffold and empirical-CDF consistency (steps a–b) +# CHD `Z_test`: asymptotic calibration — i.i.d. scaffold, empirical-CDF consistency and pointwise concentration (steps a–c) [`FactorizationsZTest`](./FactorizationsZTest.lean) modelled a *single* `Z_test` null draw as `nullGaussian n` (the product of `n` standard normals on `Fin n → ℝ`) and proved the per-draw @@ -55,6 +56,20 @@ inherit the i.i.d. structure (composition with the measurable indicator of `Iic yields `empCDF_tendsto_cdf`: almost surely the empirical CDF `empCDF Λ V γ N t` converges to `cdf (noiseLaw Λ V γ) t` as `N → ∞` — the pointwise Glivenko–Cantelli theorem. The *uniform* (sup-norm over `t`) strengthening and the DKW rate are the remaining steps (c)–(d). + +**Step (c) — pointwise finite-sample concentration (DKW-at-a-point via Hoeffding)** quantifies the +rate of that convergence at a fixed `t`. Each threshold indicator `nullBelow Λ V γ t i` is bounded in +`[0,1]`, so — once centered at its mean `cdf (noiseLaw Λ V γ) t` — it has a sub-Gaussian moment +generating function with variance proxy `1/4` (Hoeffding's lemma, `hasSubgaussianMGF_of_mem_Icc`). +Mathlib's Hoeffding inequality for sums of independent sub-Gaussians +(`HasSubgaussianMGF.measure_sum_ge_le_of_iIndepFun`) then gives, for every `N ≥ 1` and `ε ≥ 0`, the +one-sided tails `empCDF_upper_tail` / `empCDF_lower_tail` +`ℙ(±(empCDF Λ V γ N t − cdf (noiseLaw Λ V γ) t) ≥ ε) ≤ exp(−2·N·ε²)`, and their union the two-sided +`empCDF_concentration` `ℙ(|empCDF Λ V γ N t − cdf (noiseLaw Λ V γ) t| ≥ ε) ≤ 2·exp(−2·N·ε²)` — the +DKW inequality *at a single point* `t`, with the sharp Hoeffding exponent. This is the finite-sample +companion of step (b)'s almost-sure limit. The *uniform-over-`t`* DKW–Massart bound with the global +constant `2` (the genuine Dvoretzky–Kiefer–Wolfowitz theorem) is the research-grade strengthening +still flagged out of scope, and the quantile-transfer step (d) remains. -/ @[expose] public section @@ -63,6 +78,8 @@ namespace Spec.Factorization open MeasureTheory ProbabilityTheory +open scoped NNReal + variable {n : Nat} noncomputable section @@ -224,6 +241,176 @@ theorem empCDF_tendsto_cdf (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) ( rw [integral_nullBelow_zero] at hlaw exact hlaw +/-! ## Step (c): pointwise finite-sample concentration (DKW-at-a-point via Hoeffding) + +Step (b) is an almost-sure *limit*; step (c) is its quantitative, finite-`N` companion. The threshold +indicators `nullBelow Λ V γ t i` are bounded in `[0,1]`, hence — centered at their common mean +`cdf (noiseLaw Λ V γ) t` — sub-Gaussian with variance proxy `(1/2)² = 1/4` (Hoeffding's lemma). Being +i.i.d., their normalized sum (the empirical CDF) concentrates exponentially: Mathlib's sub-Gaussian +Hoeffding inequality gives both one-sided tails with rate `exp(−2·N·ε²)` and, by a union bound, the +two-sided `2·exp(−2·N·ε²)` — the Dvoretzky–Kiefer–Wolfowitz bound *at the single point* `t`. -/ + +/-- The empirical CDF lies in `[0,1]` pointwise, so each threshold indicator does too. -/ +theorem nullBelow_mem_Icc (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) + (ω : ℕ → Fin n → ℝ) : nullBelow Λ V γ t i ω ∈ Set.Icc (0 : ℝ) 1 := by + unfold nullBelow + rw [Set.indicator_apply] + split <;> simp [Set.mem_Icc] + +/-- **The threshold indicators are jointly independent** (not just pairwise): composing the i.i.d. +`nullNoise` sequence with the measurable indicator of `Iic t` preserves joint independence. The shape +the Hoeffding sum bound consumes. -/ +theorem nullBelow_iIndepFun (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) : + iIndepFun (nullBelow Λ V γ t) (nullSeqGaussian n) := + (nullNoise_iIndepFun Λ V γ).comp (fun _ => (Set.Iic t).indicator (1 : ℝ → ℝ)) + (fun _ => measurable_const.indicator measurableSet_Iic) + +/-- The common mean of the threshold indicators is the null CDF at `t`, for *every* draw `i` (not just +the `0`-th): identically distributed draws share their integral. -/ +theorem integral_nullBelow_eq (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) : + (nullSeqGaussian n)[nullBelow Λ V γ t i] = cdf (noiseLaw Λ V γ) t := by + rw [(nullBelow_identDistrib Λ V γ t i).integral_eq, integral_nullBelow_zero] + +/-- **Hoeffding's lemma for one threshold indicator.** Centered at its mean `cdf (noiseLaw Λ V γ) t`, +the `[0,1]`-valued indicator has a sub-Gaussian MGF with variance proxy `((1-0)/2)² = 1/4`. -/ +theorem nullBelow_subgaussian (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) : + HasSubgaussianMGF (fun ω => nullBelow Λ V γ t i ω - cdf (noiseLaw Λ V γ) t) + (1 / 4 : ℝ≥0) (nullSeqGaussian n) := by + have hb : ∀ᵐ ω ∂(nullSeqGaussian n), nullBelow Λ V γ t i ω ∈ Set.Icc (0 : ℝ) 1 := + ae_of_all _ (nullBelow_mem_Icc Λ V γ t i) + have h := hasSubgaussianMGF_of_mem_Icc (measurable_nullBelow Λ V γ t i).aemeasurable hb + rw [integral_nullBelow_eq] at h + rwa [show ((‖(1 : ℝ) - 0‖₊) / 2) ^ 2 = (1 / 4 : ℝ≥0) from by + rw [sub_zero, nnnorm_one]; norm_num] at h + +/-- The mean of the *negated*, recentred indicator `cdf (noiseLaw Λ V γ) t − nullBelow` is `0`. -/ +theorem integral_negBelow_eq (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) : + (nullSeqGaussian n)[fun ω => cdf (noiseLaw Λ V γ) t - nullBelow Λ V γ t i ω] = 0 := by + rw [integral_sub (integrable_const _) (integrable_nullBelow Λ V γ t i), integral_const, + integral_nullBelow_eq] + simp + +/-- The negated indicator `cdf (noiseLaw Λ V γ) t − nullBelow` lies in `[cdf − 1, cdf]`. -/ +theorem nullBelow_neg_mem_Icc (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) + (ω : ℕ → Fin n → ℝ) : + cdf (noiseLaw Λ V γ) t - nullBelow Λ V γ t i ω + ∈ Set.Icc (cdf (noiseLaw Λ V γ) t - 1) (cdf (noiseLaw Λ V γ) t) := by + have h := Set.mem_Icc.mp (nullBelow_mem_Icc Λ V γ t i ω) + rw [Set.mem_Icc] + constructor <;> linarith [h.1, h.2] + +/-- **Hoeffding's lemma for the negated indicator.** `cdf (noiseLaw Λ V γ) t − nullBelow` is +`[cdf − 1, cdf]`-valued (length-`1` interval) and already mean-zero, so it is sub-Gaussian with the +same variance proxy `1/4` — the lower-tail companion of `nullBelow_subgaussian`. -/ +theorem nullBelow_neg_subgaussian (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) (i : ℕ) : + HasSubgaussianMGF (fun ω => cdf (noiseLaw Λ V γ) t - nullBelow Λ V γ t i ω) + (1 / 4 : ℝ≥0) (nullSeqGaussian n) := by + have hb : ∀ᵐ ω ∂(nullSeqGaussian n), + (fun ω => cdf (noiseLaw Λ V γ) t - nullBelow Λ V γ t i ω) ω + ∈ Set.Icc (cdf (noiseLaw Λ V γ) t - 1) (cdf (noiseLaw Λ V γ) t) := + ae_of_all _ (nullBelow_neg_mem_Icc Λ V γ t i) + have hmeas : Measurable (fun ω => cdf (noiseLaw Λ V γ) t - nullBelow Λ V γ t i ω) := + measurable_const.sub (measurable_nullBelow Λ V γ t i) + have h := hasSubgaussianMGF_of_mem_Icc hmeas.aemeasurable hb + rw [integral_negBelow_eq] at h + simp only [sub_zero] at h + rwa [show ((‖cdf (noiseLaw Λ V γ) t - (cdf (noiseLaw Λ V γ) t - 1)‖₊) / 2) ^ 2 = (1 / 4 : ℝ≥0) + from by rw [sub_sub_cancel, nnnorm_one]; norm_num] at h + +/-- **Hoeffding's inequality for a normalized i.i.d. proxy-`1/4` sub-Gaussian sum.** If `X` is a +jointly independent sequence on the null-draw space, each centered draw sub-Gaussian with variance +proxy `1/4`, then for `N ≥ 1` and `ε ≥ 0` the empirical average `(∑_{i (1 / 4 : ℝ≥0)) (s := Finset.range N) (fun i _ => hsub i) + (ε := (N : ℝ) * ε) (by positivity) + have hset : {ω | (N : ℝ) * ε ≤ ∑ i ∈ Finset.range N, X i ω} + = {ω | ε ≤ (∑ i ∈ Finset.range N, X i ω) / (N : ℝ)} := by + ext ω + simp only [Set.mem_setOf_eq] + rw [le_div_iff₀ hNR, mul_comm] + rw [hset] at hbase + refine hbase.trans (le_of_eq ?_) + congr 1 + simp only [Finset.sum_const, Finset.card_range, nsmul_eq_mul] + push_cast + field_simp + ring + +/-- **Upper-tail concentration of the empirical CDF (Hoeffding).** For `N ≥ 1`, `ε ≥ 0`, the empirical +CDF overshoots the true null CDF at `t` by `ε` with probability `≤ exp(−2·N·ε²)`. -/ +theorem empCDF_upper_tail (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) {N : ℕ} + (hN : 1 ≤ N) {ε : ℝ} (hε : 0 ≤ ε) : + (nullSeqGaussian n).real {ω | ε ≤ empCDF Λ V γ N t ω - cdf (noiseLaw Λ V γ) t} + ≤ Real.exp (-2 * (N : ℝ) * ε ^ 2) := by + have hNR : (0 : ℝ) < N := by exact_mod_cast hN + have hind : iIndepFun (fun i ω => nullBelow Λ V γ t i ω - cdf (noiseLaw Λ V γ) t) + (nullSeqGaussian n) := + (nullBelow_iIndepFun Λ V γ t).comp (fun _ x => x - cdf (noiseLaw Λ V γ) t) + (fun _ => measurable_id.sub_const _) + have htail := hoeffding_avg_ge hind (fun i => nullBelow_subgaussian Λ V γ t i) hN hε + have hrw : ∀ ω, + (∑ i ∈ Finset.range N, (nullBelow Λ V γ t i ω - cdf (noiseLaw Λ V γ) t)) / (N : ℝ) + = empCDF Λ V γ N t ω - cdf (noiseLaw Λ V γ) t := by + intro ω + unfold empCDF + rw [Finset.sum_sub_distrib, Finset.sum_const, Finset.card_range, nsmul_eq_mul, sub_div, + mul_div_cancel_left₀ (cdf (noiseLaw Λ V γ) t) (ne_of_gt hNR)] + simp only [hrw] at htail + exact htail + +/-- **Lower-tail concentration of the empirical CDF (Hoeffding).** Symmetrically, the empirical CDF +undershoots the true null CDF at `t` by `ε` with probability `≤ exp(−2·N·ε²)`. -/ +theorem empCDF_lower_tail (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) {N : ℕ} + (hN : 1 ≤ N) {ε : ℝ} (hε : 0 ≤ ε) : + (nullSeqGaussian n).real {ω | ε ≤ cdf (noiseLaw Λ V γ) t - empCDF Λ V γ N t ω} + ≤ Real.exp (-2 * (N : ℝ) * ε ^ 2) := by + have hNR : (0 : ℝ) < N := by exact_mod_cast hN + have hind : iIndepFun (fun i ω => cdf (noiseLaw Λ V γ) t - nullBelow Λ V γ t i ω) + (nullSeqGaussian n) := + (nullBelow_iIndepFun Λ V γ t).comp (fun _ x => cdf (noiseLaw Λ V γ) t - x) + (fun _ => measurable_const.sub measurable_id) + have htail := hoeffding_avg_ge hind (fun i => nullBelow_neg_subgaussian Λ V γ t i) hN hε + have hrw : ∀ ω, + (∑ i ∈ Finset.range N, (cdf (noiseLaw Λ V γ) t - nullBelow Λ V γ t i ω)) / (N : ℝ) + = cdf (noiseLaw Λ V γ) t - empCDF Λ V γ N t ω := by + intro ω + unfold empCDF + rw [Finset.sum_sub_distrib, Finset.sum_const, Finset.card_range, nsmul_eq_mul, sub_div, + mul_div_cancel_left₀ (cdf (noiseLaw Λ V γ) t) (ne_of_gt hNR)] + simp only [hrw] at htail + exact htail + +/-- **Pointwise finite-sample concentration of the empirical CDF (DKW-at-a-point, step (c)).** For +each fixed threshold `t`, every `N ≥ 1` and tolerance `ε ≥ 0`, the empirical CDF of the i.i.d. null +draws deviates from the true null CDF by more than `ε` with probability at most `2·exp(−2·N·ε²)`. +This is the Dvoretzky–Kiefer–Wolfowitz inequality evaluated at a single point, with the sharp +Hoeffding exponent — the finite-sample rate underneath step (b)'s almost-sure limit. The +*uniform-over-`t`* DKW–Massart bound (global constant `2`) is the research-grade strengthening still +flagged out of scope. -/ +theorem empCDF_concentration (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (t : ℝ) {N : ℕ} + (hN : 1 ≤ N) {ε : ℝ} (hε : 0 ≤ ε) : + (nullSeqGaussian n).real {ω | ε ≤ |empCDF Λ V γ N t ω - cdf (noiseLaw Λ V γ) t|} + ≤ 2 * Real.exp (-2 * (N : ℝ) * ε ^ 2) := by + have hsplit : {ω | ε ≤ |empCDF Λ V γ N t ω - cdf (noiseLaw Λ V γ) t|} + = {ω | ε ≤ empCDF Λ V γ N t ω - cdf (noiseLaw Λ V γ) t} + ∪ {ω | ε ≤ cdf (noiseLaw Λ V γ) t - empCDF Λ V γ N t ω} := by + ext ω + simp only [Set.mem_setOf_eq, Set.mem_union, le_abs, neg_sub] + rw [hsplit] + refine (measureReal_union_le _ _).trans ?_ + have h1 := empCDF_upper_tail Λ V γ t hN hε + have h2 := empCDF_lower_tail Λ V γ t hN hε + linarith + end end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index 0a98163..d24a9f5 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -442,7 +442,7 @@ standard-Gaussian draw, so every `nullNoise i` has the *same* law `noiseLaw` (`n `hident` — that the strong law of large numbers (`strong_law_ae_real`) and the Hoeffding tail consume. This scaffold is the only genuinely new measure-theory plumbing; the empirical-CDF consistency (Glivenko–Cantelli via the SLLN) and the per-`t` concentration rate `2 exp(-2 N ε²)` (Hoeffding) are -applications of it. +applications of it — both now proved. *Pointwise consistency of the empirical CDF (step b).* The first such application is now proved, `empCDF_tendsto_cdf`. Fix a threshold `t`. The threshold indicators @@ -460,13 +460,31 @@ Glivenko–Cantelli theorem, sorry-free over Mathlib v4.30.0. The executable `Di exercise its computable shadow — the growing-prefix running mean `F̂_N(t)` settling toward the full-sample estimate of `cdf noiseLaw t`. -*What is honestly left.* What stays genuinely research-grade is the *uniform* Glivenko–Cantelli -(`sup_t |F̂_N - cdf| → 0`) and the full *DKW–Massart* inequality with its sharp constant `2` over -the supremum — both need the bracketing / VC-class chaining Mathlib v4.30.0 lacks — and the -*exchangeability rank rate* `k/(N+1)` for a fresh null draw, which needs a symmetric-group -rank-distribution argument also absent. Those are stated as the open frontier, never stubbed with -`sorry`. The finite-sample false-positive *bound* above is the exact, non-asymptotic statement the -test actually guarantees, and the pointwise scaffold is the sorry-free bridge toward the asymptotic +*Pointwise finite-sample concentration (step c).* Step (b)'s almost-sure limit gains a quantitative, +finite-`N` companion: `empCDF_concentration`, the Dvoretzky–Kiefer–Wolfowitz inequality *at a single +point*. The same threshold indicators are `[0,1]`-bounded, so — once centered at their mean +`cdf (noiseLaw Λ V γ) t` — Hoeffding's lemma (`hasSubgaussianMGF_of_mem_Icc`) makes them sub-Gaussian +with variance proxy `(1/2)² = 1/4` (`nullBelow_subgaussian`, and the mean-zero negated companion +`nullBelow_neg_subgaussian` for the lower tail). Mathlib's Hoeffding bound for sums of independent +sub-Gaussians (`HasSubgaussianMGF.measure_sum_ge_le_of_iIndepFun`), specialised through the +normalized-average lemma `hoeffding_avg_ge` (where the substitution `ε ↦ N·ε` turns the proxy sum +`N/4` into the sharp exponent), gives the one-sided tails `empCDF_upper_tail` / `empCDF_lower_tail`, +`ℙ(±(F̂_N(t) - cdf noiseLaw t) ≥ ε) ≤ exp(-2 N ε²)`; a union bound (`measureReal_union_le`, +`le_abs`) assembles the two-sided `ℙ(|F̂_N(t) - cdf noiseLaw t| ≥ ε) ≤ 2 exp(-2 N ε²)`. That is the +DKW inequality at one point with the sharp Hoeffding exponent — sorry-free over Mathlib v4.30.0. The +`Discovery` examples exercise the bound's two computable shadows: the tail *function* `2 exp(-2 N ε²)` +(twice the one-sided tail, decreasing in `N` and `ε`, non-vacuous once `2 N ε² > ln 2`) and the +observed prefix deviation it governs. + +*What is honestly left.* With the pointwise pair (b)–(c) proved, what stays genuinely research-grade +is the *uniform* Glivenko–Cantelli (`sup_t |F̂_N - cdf| → 0`) and the full *DKW–Massart* inequality +with its sharp constant `2` over the supremum — both need the bracketing / VC-class chaining Mathlib +v4.30.0 lacks — together with the *quantile-transfer* step (d) (converting CDF concentration into +convergence of the empirical 5%/95% percentiles to the true quantiles), and the *exchangeability rank +rate* `k/(N+1)` for a fresh null draw, which needs a symmetric-group rank-distribution argument also +absent. Those are stated as the open frontier, never stubbed with `sorry`. The finite-sample +false-positive *bound* above is the exact, non-asymptotic statement the test actually guarantees, and +the pointwise consistency-plus-concentration pair is the sorry-free bridge toward the asymptotic statement. # The a-posteriori residual certificate @@ -682,9 +700,14 @@ identically-`noiseLaw`-distributed, `[0,1]`-valued, integrable sequence under `Measure.infinitePi nullGaussian`), `empCDF_tendsto_cdf` applies the strong law of large numbers to the bounded indicators `1{noiseᵢ ≤ t}` — whose mean is exactly `cdf noiseLaw t` (`integral_nullBelow_zero`) — to give almost-sure convergence `F̂_N(t) → cdf noiseLaw t` for every -fixed `t`, the *pointwise* Glivenko–Cantelli theorem, sorry-free. What stays genuinely research-grade +fixed `t`, the *pointwise* Glivenko–Cantelli theorem, sorry-free; and its finite-sample companion +`empCDF_concentration` adds the per-`t` rate `ℙ(|F̂_N(t) - cdf noiseLaw t| ≥ ε) ≤ 2 exp(-2 N ε²)`, +the DKW inequality at one point, from Hoeffding's lemma on the `[0,1]`-bounded indicators +(`nullBelow_subgaussian`) and Mathlib's sub-Gaussian sum bound. What stays genuinely research-grade is the *uniform* Glivenko–Cantelli / DKW–Massart sharp -constant (bracketing / VC chaining) and the exchangeability rank rate `k/(N+1)` -(symmetric-group rank distribution) — both absent from `Mathlib.Probability` v4.30.0. One open item is +constant over the supremum (bracketing / VC chaining), the *quantile-transfer* step that turns this +CDF concentration into convergence of the empirical 5%/95% percentiles, and the exchangeability rank +rate `k/(N+1)` +(symmetric-group rank distribution) — all absent from `Mathlib.Probability` v4.30.0. One open item is a proof-only gap on a quantity CHD does not need to *run*; the other is the genuine statistical frontier, flagged rather than stubbed with `sorry`. From bed4e142d17f99926a06e9dc47cf863201b88b2e Mon Sep 17 00:00:00 2001 From: Nicolas Rouquette Date: Mon, 1 Jun 2026 08:20:36 -0700 Subject: [PATCH 22/22] Z_test asymptotic calibration step (d): quantile transfer (empirical percentile consistency) Inverts steps (b)-(c)'s empirical-CDF convergence into consistency of the empirical percentiles the Z_test chooser thresholds against, sorry-free over Mathlib v4.30.0. NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean: - empCDF_mono: empirical CDF monotone in the threshold. - StraddlesQuantile / IsLowerQuantile: the honest population straddle hypothesis (continuity + strict monotonicity through p at q) and the defining property of a lower empirical p-quantile. - empCDF_eventually_straddle: the transfer engine -- pointwise consistency at q +/- eps makes the empirical CDF eventually straddle p. - empQuantile_tendsto: any lower empirical p-quantile converges a.s. to q (sandwich pinned into [q-eps, q+eps], intersected over eps = 1/(m+1) via ae_all_iff). NN/Examples/Factorization/Discovery.lean: matching #eval block -- lower p-quantile reaches level p, monotone in p, in [0,1], median converges (<= 0.02 for N >= 3); negative controls for non-vacuity and straddle-hypothesis sensitivity. Docs: aggregate Factorization.lean bullet, Ch4 Verso blueprint, and CHDTorch plan updated to mark the four-tier asymptotic-calibration plan complete, leaving only the uniform DKW-Massart constant, the zLowFn/zHighFn triangular-array wiring, and the exchangeability rank rate as research-grade. Co-Authored-By: Claude Opus 4.8 (1M context) --- NN/Examples/Factorization.lean | 11 ++ NN/Examples/Factorization/Discovery.lean | 74 +++++++++++ .../Basic/FactorizationsZAsymptotic.lean | 116 +++++++++++++++++- .../Ch4_Verification/Factorizations.lean | 55 ++++++--- 4 files changed, 237 insertions(+), 19 deletions(-) diff --git a/NN/Examples/Factorization.lean b/NN/Examples/Factorization.lean index b4eb302..018b6c9 100644 --- a/NN/Examples/Factorization.lean +++ b/NN/Examples/Factorization.lean @@ -135,6 +135,17 @@ factorization misbehaves. full-sample estimate uniformly over thresholds, with a **negative control** that the tiniest prefixes (`N = 1, 2`) deviate by `0.5 > ε`, the honest weak-`N` regime where the `2·exp(−2Nε²)` bound is still near `2`. + A final **quantile-transfer** sub-block corroborates `empQuantile_tendsto` (step (d)): inverting the + CDF convergence into convergence of the empirical *percentiles* the chooser thresholds against. + Wherever the true CDF strictly straddles a level `p` at its quantile `q` (`StraddlesQuantile`), any + lower empirical `p`-quantile (`IsLowerQuantile`) converges almost surely to `q`. The limit is + noncomputable, so the `#eval`s use the **full-sample** quantile `q̂₂₀` as the stand-in for `q` and + the prefix quantile `q̂_N` as its shadow: the lower `p`-quantile reaches level `p` (`p ≤ F̂₂₀(q̂₂₀)`), + is monotone in `p` and lands in `[0,1]`, and the empirical median converges (`|q̂_N − q̂₂₀| ≤ 0.02` + for every prefix of `≥ 3` draws). Two **negative controls** keep it honest: the prefix median + genuinely moves with `N` (non-vacuous limit), and the convergence is hypothesis-sensitive — the + `5%`-tail quantile (flatter CDF, sparser straddle) deviates more at `N = 10` than the well-straddled + median, the empirical signature of `StraddlesQuantile` being a needed hypothesis. Both **positive** checks (a valid factorization reconstructs to `err ≈ 0`) and **negative controls** (the same metric reports a large error / `NaN` when a hypothesis is violated) are included, so a diff --git a/NN/Examples/Factorization/Discovery.lean b/NN/Examples/Factorization/Discovery.lean index 983747d..9dcd1e7 100644 --- a/NN/Examples/Factorization/Discovery.lean +++ b/NN/Examples/Factorization/Discovery.lean @@ -467,4 +467,78 @@ def maxDev (N : Nat) : Float := #eval assertTrue "concentration needs N to grow: N = 1, 2 prefixes deviate by > ε = 0.3" ([1, 2].all (fun N => Spec.ltBool 0.3 (maxDev N))) +/-! ### Quantile transfer (step d): consistency of the empirical percentiles + +`empQuantile_tendsto` inverts steps (b)–(c): wherever the true null CDF is continuous and strictly +increasing through a level `p` at the quantile `q` (`StraddlesQuantile`), *any* lower empirical +`p`-quantile (`IsLowerQuantile`: the CDF is `< p` to its left and `≥ p` to its right) converges almost +surely to `q` as `N → ∞`. This is the honest consistency statement for the 5%/95% percentile +thresholds the `Z_test` chooser uses. The limit `q` is noncomputable (a quantile of the law +`noiseLaw`), so — exactly as for steps (b)/(c) — we exercise it through the **full-sample** quantile +`q̂₂₀` standing in for `q`, and watch the prefix-`N` empirical quantile `q̂_N` settle toward it. -/ + +/-- The first `N ≤ 20` null noises, as the candidate set for the prefix empirical quantile. -/ +def prefixNoises (N : Nat) : List Float := + ((List.finRange 20).filter (fun j => decide (j.val < N))).map zNullNoises + +/-- The **lower empirical `p`-quantile** of the first `N` null draws: the smallest sampled noise `v` +whose running empirical CDF `F̂_N(v)` reaches `p` (`min`-fold over the qualifying draws, falling back +to `1`). This is the computable shadow of `IsLowerQuantile (empCDF … N) p` — `inf {t | p ≤ F̂_N t}` +for the right-continuous step CDF — the object `empQuantile_tendsto` drives to the true quantile. -/ +def empQuantilePrefix (N : Nat) (p : Float) : Float := + ((prefixNoises N).filter (fun v => Spec.leBool p (empCdfPrefix N v))).foldl min 1.0 + +/-- The full-sample (`N = 20`) lower `p`-quantile — the computable stand-in for the true quantile `q` +of `noiseLaw` that `empQuantile_tendsto` sends the prefix quantiles to. -/ +def empQuantile20 (p : Float) : Float := empQuantilePrefix 20 p + +/-- Deviation of the prefix-`N` lower `p`-quantile from the full-sample limit stand-in `q̂₂₀` — the +computable proxy for `|q̂_N − q|` that `empQuantile_tendsto` drives to `0`. -/ +def quantileDev (N : Nat) (p : Float) : Float := + Float.abs (empQuantilePrefix N p - empQuantile20 p) + +#eval IO.println s!"empirical median (p = 0.5) over growing prefixes: q̂_3 = {empQuantilePrefix 3 0.5}, \ + q̂_5 = {empQuantilePrefix 5 0.5}, q̂_10 = {empQuantilePrefix 10 0.5}, q̂_15 = {empQuantilePrefix 15 0.5}, \ + q̂_20 = {empQuantile20 0.5} (the limit stand-in q)" + +#eval IO.println s!"quantile triple at full sample (q̂₂₀): 5% = {empQuantile20 0.05}, 50% = {empQuantile20 0.5}, \ + 95% = {empQuantile20 0.95}; median dev at N=10 {quantileDev 10 0.5} vs 5%-tail dev at N=10 {quantileDev 10 0.05}" + +-- Positive — `IsLowerQuantile` right-property at the full sample: `p ≤ F̂₂₀(q̂₂₀)`. The lower +-- `p`-quantile genuinely reaches level `p` (here with equality, `p ∈ {0.05, 0.5, 0.95}` being multiples +-- of `1/20`) — the half of `IsLowerQuantile` feeding `empQuantile_tendsto`. +#eval assertTrue "lower p-quantile reaches level p: p ≤ F̂₂₀(q̂₂₀) for p ∈ {0.05, 0.5, 0.95}" + ([0.05, 0.5, 0.95].all (fun p => Spec.leBool p (empCdf (empQuantile20 p)))) + +-- Positive — the empirical quantile is monotone in the level `p` (order statistics are nondecreasing): +-- `q̂₂₀(0.05) ≤ q̂₂₀(0.5) ≤ q̂₂₀(0.95)`, the quantile-function shadow of `monotone_cdf` inverted. +#eval assertTrue "empirical quantile is monotone in p: q̂₂₀(5%) ≤ q̂₂₀(50%) ≤ q̂₂₀(95%)" + (Spec.leBool (empQuantile20 0.05) (empQuantile20 0.5) + && Spec.leBool (empQuantile20 0.5) (empQuantile20 0.95)) + +-- Positive — every empirical quantile is a fraction in `[0,1]` (the percentiles live in the null +-- support, `zLowFn_nonneg`/`_le_one` and friends). +#eval assertTrue "empirical quantiles lie in [0,1] for p ∈ {0.05, 0.5, 0.95}" + ([0.05, 0.5, 0.95].all (fun p => + Spec.leBool 0.0 (empQuantile20 p) && Spec.leBool (empQuantile20 p) 1.0)) + +-- Positive — quantile transfer (consistency): the prefix-`N` empirical median settles toward the +-- full-sample limit, within `0.02` for every prefix of `≥ 3` draws — the computable shadow of +-- `empQuantile_tendsto` (almost-sure `q̂_N → q` at the strictly-straddled median). +#eval assertTrue "empirical median converges: |q̂_N − q̂₂₀| ≤ 0.02 for every prefix of ≥ 3 draws" + ([3, 5, 10, 15, 20].all (fun N => Spec.leBool (quantileDev N 0.5) 0.02)) + +-- Negative control — consistency is non-vacuous: the prefix median genuinely *moves* with `N` (some +-- prefix differs from the full-sample limit), so `q̂_N → q̂₂₀` is a real limit being approached, not a +-- value already constant at `N = 3`. +#eval assertTrue "convergence is non-vacuous: some prefix median differs from the full-sample limit" + ([3, 5, 10, 15].any (fun N => !(empQuantilePrefix N 0.5 == empQuantile20 0.5))) + +-- Negative control — the convergence is hypothesis-sensitive: the lower `5%`-tail quantile (where the +-- CDF is flatter and the straddle sparser, fewer draws to pin it) deviates *more* at `N = 10` than the +-- median does — the empirical signature of `StraddlesQuantile` being a genuine, needed hypothesis, not +-- automatic. A flat CDF region (no strict straddle) would defeat consistency entirely. +#eval assertTrue "hypothesis-sensitive: the 5%-tail quantile deviates more at N=10 than the well-straddled median" + (Spec.ltBool (quantileDev 10 0.5) (quantileDev 10 0.05)) + end NN.Examples.Factorization.Discovery diff --git a/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean index 23ad95d..52ccb51 100644 --- a/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean +++ b/NN/Proofs/Tensor/Basic/FactorizationsZAsymptotic.lean @@ -15,7 +15,7 @@ public import Mathlib.MeasureTheory.Integral.Bochner.Set public import Mathlib.Probability.Moments.SubGaussian /-! -# CHD `Z_test`: asymptotic calibration — i.i.d. scaffold, empirical-CDF consistency and pointwise concentration (steps a–c) +# CHD `Z_test`: asymptotic calibration — i.i.d. scaffold, empirical-CDF consistency, pointwise concentration and quantile transfer (steps a–d) [`FactorizationsZTest`](./FactorizationsZTest.lean) modelled a *single* `Z_test` null draw as `nullGaussian n` (the product of `n` standard normals on `Fin n → ℝ`) and proved the per-draw @@ -70,6 +70,18 @@ DKW inequality *at a single point* `t`, with the sharp Hoeffding exponent. This companion of step (b)'s almost-sure limit. The *uniform-over-`t`* DKW–Massart bound with the global constant `2` (the genuine Dvoretzky–Kiefer–Wolfowitz theorem) is the research-grade strengthening still flagged out of scope, and the quantile-transfer step (d) remains. + +**Step (d) — quantile transfer (consistency of the empirical percentiles)** inverts steps (b)–(c): +it carries CDF convergence over to convergence of the empirical *quantiles* — the 5%/95% percentiles +the `Z_test` chooser thresholds against. Under the honest hypothesis that the true CDF is continuous +and strictly increasing through the target level `p` at the quantile `q` (`StraddlesQuantile`), the +classical sandwich — pointwise consistency (step (b)) at the two straddle points `q ∓ ε` +(`empCDF_eventually_straddle`) pinning any lower empirical `p`-quantile (`IsLowerQuantile`) into +`[q − ε, q + ε]`, intersected over `ε = 1/(m+1)` via `ae_all_iff` — gives `empQuantile_tendsto`: +almost surely `empQ N → q` as `N → ∞`. This is stated for a generic lower empirical `p`-quantile; the +concrete `zLowFn`/`zHighFn` order statistics instantiate it through the order-statistic count lemmas +with the moving level `p_N = (⌊N/20⌋ + 1)/N → 1/20`, the remaining concrete (triangular-array) bridge. +The *uniform* DKW–Massart sharp constant and the exchangeability rank rate stay research-grade. -/ @[expose] public section @@ -411,6 +423,108 @@ theorem empCDF_concentration (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) have h2 := empCDF_lower_tail Λ V γ t hN hε linarith +/-! ## Step (d): quantile transfer — consistency of the empirical percentiles + +Steps (b)–(c) control the empirical CDF at a *fixed* threshold. Step (d) *inverts* that: it transfers +the convergence of `empCDF` to the convergence of the empirical *quantiles* — the 5%/95% percentiles +`Z_low`/`Z_high` the `Z_test` chooser actually thresholds against. The honest hypothesis under which +this works is that the true CDF is continuous and strictly increasing through the target level `p` at +the quantile `q`, captured by `StraddlesQuantile`: the CDF sits strictly below `p` to the left of `q` +and strictly above `p` to the right. + +The argument is the classical sandwich. For any tolerance `ε > 0`, the straddle gives +`cdf (q − ε) < p < cdf (q + ε)`. Pointwise consistency (step (b)) at the two points `q ∓ ε` then says +that almost surely, eventually `empCDF (q − ε) < p < empCDF (q + ε)`. Any *lower empirical +`p`-quantile* `empQ` (CDF strictly below `p` to its left, at least `p` to its right — +`IsLowerQuantile`) is therefore pinned into `[q − ε, q + ε]` once the sandwich holds. Letting `ε` run +over `1/(m+1)` and intersecting the countably many almost-sure events (`ae_all_iff`) yields, almost +surely, `empQ N → q` as `N → ∞`: **consistency of the empirical quantile**. + +This is stated for a *generic* lower empirical `p`-quantile `empQ`; the concrete percentile order +statistics `zLowFn`/`zHighFn` instantiate it through the order-statistic count lemmas +(`kthSmallestFn_strictBelow_count_le` / `kthSmallestFn_strictAbove_count_le`), with the index-driven +level `p_N = (⌊N/20⌋ + 1)/N → 1/20` — a triangular-array (moving-level) refinement that is the +remaining concrete bridge, while the *uniform* DKW–Massart sharp constant and the exchangeability rank +rate stay research-grade and out of scope (flagged, never `sorry`'d). -/ + +/-- **The empirical CDF is monotone in the threshold.** Raising `t` only enlarges `Iic t`, so each +threshold indicator (hence their normalized sum) is nondecreasing — the empirical CDF behaves like a +genuine distribution function in its argument. -/ +theorem empCDF_mono (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) (N : ℕ) (ω : ℕ → Fin n → ℝ) : + Monotone (fun t => empCDF Λ V γ N t ω) := by + intro t t' htt' + have hsum : ∑ i ∈ Finset.range N, nullBelow Λ V γ t i ω + ≤ ∑ i ∈ Finset.range N, nullBelow Λ V γ t' i ω := + Finset.sum_le_sum fun i _ => + Set.indicator_le_indicator_of_subset (Set.Iic_subset_Iic.mpr htt') + (fun _ => zero_le_one) (nullNoise Λ V γ i ω) + simp only [empCDF] + exact div_le_div_of_nonneg_right hsum (by positivity) + +/-- **Population `p`-quantile (continuous, strictly-increasing-through-`p` sense).** `q` straddles +level `p` for the CDF `F` when `F` sits strictly below `p` just left of `q` and strictly above just +right. This holds whenever `F` is continuous and strictly monotone at `q` with `F q = p` — the honest +hypothesis the empirical quantile is consistent under. -/ +def StraddlesQuantile (F : ℝ → ℝ) (p q : ℝ) : Prop := + ∀ ε : ℝ, 0 < ε → F (q - ε) < p ∧ p < F (q + ε) + +/-- **Lower empirical `p`-quantile.** `q` is a lower `p`-quantile of the distribution function `F` +when `F` is strictly below `p` to the left of `q` and at least `p` to the right — exactly +`inf {t | p ≤ F t}` for a right-continuous step CDF. The defining property the order-statistic +percentiles satisfy. -/ +def IsLowerQuantile (F : ℝ → ℝ) (p q : ℝ) : Prop := + (∀ t, t < q → F t < p) ∧ (∀ t, q < t → p ≤ F t) + +/-- **Quantile sandwich (the transfer engine).** If the true null CDF straddles level `p` strictly +across `t₁ < t₂` (`cdf t₁ < p < cdf t₂`), then — by pointwise consistency (step (b)) at the two +points — almost surely the empirical CDF eventually straddles `p` the same way: +`empCDF N t₁ < p < empCDF N t₂` for all large `N`. -/ +theorem empCDF_eventually_straddle (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) {p t₁ t₂ : ℝ} + (h1 : cdf (noiseLaw Λ V γ) t₁ < p) (h2 : p < cdf (noiseLaw Λ V γ) t₂) : + ∀ᵐ ω ∂(nullSeqGaussian n), ∀ᶠ N in Filter.atTop, + empCDF Λ V γ N t₁ ω < p ∧ p < empCDF Λ V γ N t₂ ω := by + filter_upwards [empCDF_tendsto_cdf Λ V γ t₁, empCDF_tendsto_cdf Λ V γ t₂] with ω hω1 hω2 + filter_upwards [hω1.eventually_lt_const h1, hω2.eventually_const_lt h2] with N hN1 hN2 + exact ⟨hN1, hN2⟩ + +/-- **Consistency of the empirical quantile (quantile transfer, step (d)).** Fix a target level `p` +and a population quantile `q` straddled by the true null CDF. Then for *any* lower empirical +`p`-quantile `empQ` of the empirical CDF (e.g. the percentile order statistics), almost surely +`empQ N → q` as the number of null draws `N → ∞`. This is the honest consistency statement for the +5%/95% thresholds the `Z_test` chooser uses, inverting steps (b)–(c)'s CDF convergence into quantile +convergence wherever the CDF is continuous and strictly monotone at the quantile. -/ +theorem empQuantile_tendsto (Λ : Fin n → ℝ) (V : Fin n → Fin n → ℝ) (γ : ℝ) {p q : ℝ} + {empQ : ℕ → (ℕ → Fin n → ℝ) → ℝ} + (hstr : StraddlesQuantile (cdf (noiseLaw Λ V γ)) p q) + (hq : ∀ N ω, IsLowerQuantile (fun t => empCDF Λ V γ N t ω) p (empQ N ω)) : + ∀ᵐ ω ∂(nullSeqGaussian n), + Filter.Tendsto (fun N => empQ N ω) Filter.atTop (nhds q) := by + have key : ∀ m : ℕ, ∀ᵐ ω ∂(nullSeqGaussian n), + ∀ᶠ N in Filter.atTop, |empQ N ω - q| ≤ 1 / (m + 1 : ℝ) := by + intro m + have hε : (0 : ℝ) < 1 / (m + 1 : ℝ) := by positivity + obtain ⟨hlt, hgt⟩ := hstr _ hε + filter_upwards [empCDF_eventually_straddle Λ V γ hlt hgt] with ω hω + filter_upwards [hω] with N hN + obtain ⟨hN1, hN2⟩ := hN + obtain ⟨hqL, hqR⟩ := hq N ω + have hub : empQ N ω ≤ q + 1 / (m + 1 : ℝ) := by + by_contra hc + exact absurd (hqL _ (not_le.mp hc)) (not_lt.mpr hN2.le) + have hlb : q - 1 / (m + 1 : ℝ) ≤ empQ N ω := by + by_contra hc + exact absurd (hqR _ (not_le.mp hc)) (not_le.mpr hN1) + rw [abs_le] + constructor <;> linarith + filter_upwards [ae_all_iff.mpr key] with ω hω + rw [Metric.tendsto_atTop] + intro δ hδ + obtain ⟨m, hm⟩ := exists_nat_one_div_lt hδ + obtain ⟨N₀, hN₀⟩ := Filter.eventually_atTop.mp (hω m) + refine ⟨N₀, fun N hN => ?_⟩ + rw [Real.dist_eq] + exact lt_of_le_of_lt (hN₀ N hN) hm + end end Spec.Factorization diff --git a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean index d24a9f5..d2bfa1a 100644 --- a/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean +++ b/blueprint/TorchLeanBlueprint/Guide/Ch4_Verification/Factorizations.lean @@ -441,8 +441,9 @@ standard-Gaussian draw, so every `nullNoise i` has the *same* law `noiseLaw` (`n (`integrable_nullNoise`). That is exactly the i.i.d.-bounded-integrable triple — `hint`, `hindep`, `hident` — that the strong law of large numbers (`strong_law_ae_real`) and the Hoeffding tail consume. This scaffold is the only genuinely new measure-theory plumbing; the empirical-CDF consistency -(Glivenko–Cantelli via the SLLN) and the per-`t` concentration rate `2 exp(-2 N ε²)` (Hoeffding) are -applications of it — both now proved. +(Glivenko–Cantelli via the SLLN), the per-`t` concentration rate `2 exp(-2 N ε²)` (Hoeffding), and the +quantile transfer (consistency of the empirical 5%/95% percentiles) are applications of it — all three +now proved. *Pointwise consistency of the empirical CDF (step b).* The first such application is now proved, `empCDF_tendsto_cdf`. Fix a threshold `t`. The threshold indicators @@ -476,16 +477,31 @@ DKW inequality at one point with the sharp Hoeffding exponent — sorry-free ove (twice the one-sided tail, decreasing in `N` and `ε`, non-vacuous once `2 N ε² > ln 2`) and the observed prefix deviation it governs. -*What is honestly left.* With the pointwise pair (b)–(c) proved, what stays genuinely research-grade -is the *uniform* Glivenko–Cantelli (`sup_t |F̂_N - cdf| → 0`) and the full *DKW–Massart* inequality -with its sharp constant `2` over the supremum — both need the bracketing / VC-class chaining Mathlib -v4.30.0 lacks — together with the *quantile-transfer* step (d) (converting CDF concentration into -convergence of the empirical 5%/95% percentiles to the true quantiles), and the *exchangeability rank -rate* `k/(N+1)` for a fresh null draw, which needs a symmetric-group rank-distribution argument also -absent. Those are stated as the open frontier, never stubbed with `sorry`. The finite-sample -false-positive *bound* above is the exact, non-asymptotic statement the test actually guarantees, and -the pointwise consistency-plus-concentration pair is the sorry-free bridge toward the asymptotic -statement. +*Quantile transfer (step d).* Steps (b)–(c) control the empirical CDF at a fixed threshold; +`empQuantile_tendsto` *inverts* that into convergence of the empirical *percentiles* the `Z_test` +chooser thresholds against. The honest hypothesis is `StraddlesQuantile`: the true CDF sits strictly +below the level `p` just left of the quantile `q` and strictly above just right — exactly continuity +plus strict monotonicity through `p` at `q`. The argument is the classical sandwich: for any tolerance +`ε`, the straddle gives `cdf (q - ε) < p < cdf (q + ε)`, and pointwise consistency (step b) at the two +points `q ∓ ε` makes the empirical CDF eventually straddle `p` the same way +(`empCDF_eventually_straddle`), which pins any lower empirical `p`-quantile (`IsLowerQuantile`, with +the monotone `empCDF_mono` as the step CDF) into `[q - ε, q + ε]`. Intersecting the countably many +almost-sure events over `ε = 1/(m+1)` (`ae_all_iff`) yields, almost surely, `empQ N → q` as `N → ∞` — +consistency of the empirical quantile, sorry-free over Mathlib v4.30.0. It is stated for a generic +lower empirical `p`-quantile; the `Discovery` examples corroborate it via the full-sample quantile as +the limit stand-in (the empirical median converges within `0.02` for prefixes of `≥ 3` draws, the +`5%`-tail quantile visibly slower — the empirical signature of the straddle hypothesis mattering). + +*What is honestly left.* With the pointwise pair (b)–(c) and the quantile transfer (d) proved, what +stays genuinely research-grade is the *uniform* Glivenko–Cantelli (`sup_t |F̂_N - cdf| → 0`) and the +full *DKW–Massart* inequality with its sharp constant `2` over the supremum — both need the bracketing +/ VC-class chaining Mathlib v4.30.0 lacks — together with the concrete *triangular-array* bridge +wiring the order-statistic percentiles `zLowFn`/`zHighFn` into `empQuantile_tendsto` at the moving +level `p_N = (⌊N/20⌋ + 1)/N → 1/20`, and the *exchangeability rank rate* `k/(N+1)` for a fresh null +draw, which needs a symmetric-group rank-distribution argument also absent. Those are stated as the +open frontier, never stubbed with `sorry`. The finite-sample false-positive *bound* above is the +exact, non-asymptotic statement the test actually guarantees, and the consistency-concentration- +quantile chain (b)–(d) is the sorry-free bridge toward the asymptotic statement. # The a-posteriori residual certificate @@ -700,14 +716,17 @@ identically-`noiseLaw`-distributed, `[0,1]`-valued, integrable sequence under `Measure.infinitePi nullGaussian`), `empCDF_tendsto_cdf` applies the strong law of large numbers to the bounded indicators `1{noiseᵢ ≤ t}` — whose mean is exactly `cdf noiseLaw t` (`integral_nullBelow_zero`) — to give almost-sure convergence `F̂_N(t) → cdf noiseLaw t` for every -fixed `t`, the *pointwise* Glivenko–Cantelli theorem, sorry-free; and its finite-sample companion +fixed `t`, the *pointwise* Glivenko–Cantelli theorem, sorry-free; its finite-sample companion `empCDF_concentration` adds the per-`t` rate `ℙ(|F̂_N(t) - cdf noiseLaw t| ≥ ε) ≤ 2 exp(-2 N ε²)`, the DKW inequality at one point, from Hoeffding's lemma on the `[0,1]`-bounded indicators -(`nullBelow_subgaussian`) and Mathlib's sub-Gaussian sum bound. What stays genuinely research-grade -is the *uniform* Glivenko–Cantelli / DKW–Massart sharp -constant over the supremum (bracketing / VC chaining), the *quantile-transfer* step that turns this -CDF concentration into convergence of the empirical 5%/95% percentiles, and the exchangeability rank -rate `k/(N+1)` +(`nullBelow_subgaussian`) and Mathlib's sub-Gaussian sum bound; and `empQuantile_tendsto` *inverts* +both into the quantile statement itself — wherever the true CDF strictly straddles a level `p` at its +quantile `q` (`StraddlesQuantile`), the sandwich at `q ∓ ε` (`empCDF_eventually_straddle`) drives any +lower empirical `p`-quantile to `q` almost surely, the honest consistency of the 5%/95% percentiles. +What stays genuinely research-grade is the *uniform* Glivenko–Cantelli / DKW–Massart sharp +constant over the supremum (bracketing / VC chaining), the concrete *triangular-array* bridge wiring +the `zLowFn`/`zHighFn` order statistics into `empQuantile_tendsto` at the moving level +`p_N = (⌊N/20⌋ + 1)/N → 1/20`, and the exchangeability rank rate `k/(N+1)` (symmetric-group rank distribution) — all absent from `Mathlib.Probability` v4.30.0. One open item is a proof-only gap on a quantity CHD does not need to *run*; the other is the genuine statistical frontier, flagged rather than stubbed with `sorry`.