Derivation of the conjugate gradient method: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 20:38, 5 August 2024 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,331 edits →Conjugate directions: gram-schmid Tag: Visual edit ← Previous edit		Latest revision as of 02:58, 17 June 2025 edit undo Shlomi Hillel (talk \| contribs) 275 edits →Conjugate directions: remove word Tag: 2017 wikitext editor
(2 intermediate revisions by 2 users not shown)
Line 21: The simplest method is greedy [[line search]], where we start at some point <math>\boldsymbol{x}_0</math>, pick a direction <math>\boldsymbol{p}_0</math> somehow, then minimize <math>f(\boldsymbol{x}_0 + \boldsymbol{p}_0 \alpha_0)</math>. This has a simple closed-form solution that does not involve matrix inversion:<math display="block">\alpha_0 = \frac{\boldsymbol{p}_0^\mathrm{T}(\boldsymbol{b}-\boldsymbol{Ax}_0) }{\boldsymbol{p}_0^\mathrm{T}\boldsymbol{A}\boldsymbol{p}_0} </math>Geometrically, we start at some point <math>\boldsymbol{x}_0</math> on some ellipsoid, then choose a direction and travel along that direction, until we hit the point where the ellipsoid is minimized in that direction. This is not necessarily the minimum, but it is progress towards it. Visually, it is moving along a line, and stopping as soon as we reach a point tangent to the contour ellipsoid. We can now repeat this procedure, starting at our new point <math>\boldsymbol{x}_1 = \boldsymbol{x}_0 + \alpha_0\boldsymbol{p}_0 </math>, pick a new direction <math>\boldsymbol{p}_1</math>, ~~and~~compute ~~repeat~~<math>\alpha_1</math>, etc. We can summarize this as the following algorithm: Line 33: \end{align}</math> where <math>\boldsymbol{p}_0,\boldsymbol{p}_1,\boldsymbol{p}_2,\ldots</math> are to be picked. Notice in particular how the residual is calculated iteratively step-by-step, instead of anew every time:<math display="block">\boldsymbol{r}_{i+1} = \boldsymbol{b} - \boldsymbol{Ax}_{i+1} = \boldsymbol{b} - \boldsymbol{A} (\boldsymbol{x}_i+\alpha_i\boldsymbol{p}_i) = \boldsymbol{r}_i -\alpha_i \boldsymbol{A} \boldsymbol{p}_i </math>It is possibly true that <math>\alpha_i = 0</math> prematurely, which would bring numerical problems. However, for particular choices of <math>\boldsymbol{p}_0,\boldsymbol{p}_1,\boldsymbol{p}_2,\ldots</math>, this will not occur before convergence, as we will prove below. === Conjugate directions === If the directions <math>\boldsymbol{p}_0,\boldsymbol{p}_1,\boldsymbol{p}_2,\ldots</math> are not picked well, then progress will be slow. In particular, the gradient descent method would be slow. This can be seen in the diagram, where the green line is the result of always picking the local gradient direction. It zig-zags towards the minimum, but repeatedly overshoots. In contrast, if we pick the directions to be a set of '''mutually conjugate directions''', then there will be no overshoot, and we would obtain the global minimum after <math>n</math> steps, where <math>n</math> is the number of dimensions. [[File:Conjugate_Diameters.svg\|right\|thumb\|300x300px\|Two conjugate diameters of an [[ellipse]]. Each edge of the bounding [[parallelogram]] is [[Parallel (geometry)\|parallel]] to one of the diameters.]] The concept of conjugate directions came from classical geometry of ellipse. For an ellipse, two semi-axes ~~center~~ are mutually conjugate with respect to the ellipse iff the lines are parallel to the tangent bounding parallelogram, as pictured. The concept generalizes to ''n''-dimensional ellipsoids, where ''n'' semi-axes <math>t_0\boldsymbol{p}_0, \dots, t_{n-1}\boldsymbol{p}_{n-1}</math> are mutually conjugate with respect to the ellipsoid iff each axis is parallel to the tangent bounding [[parallelepiped]]. In other words, for any <math>i</math>, the tangent plane to the ellipsoid at <math>\boldsymbol{c} + t_i \boldsymbol{p}_i</math> is a hyperplane spanned by the vectors <math>\{\boldsymbol{p}_j : j \neq i\}</math>, where <math>\boldsymbol{c} </math> is the center of the ellipsoid. Note that we need to scale each directional vector <math>\boldsymbol{p}_i </math> by a scalar <math>t_i </math>, so that <math>\boldsymbol{c} + t_i \boldsymbol{p}_i</math> falls exactly on the ellipsoid. Line 51 ⟶ 53: {\| class="wikitable" \|+ ! ! \|0 Line 60 ⟶ 61: \|- \| 0 \| \| \| <math>\boldsymbol{p}_0^\mathrm{T}\boldsymbol{Ap}_1</math> Line 68: \|- \| 1 \| \| \| Line 76 ⟶ 75: \|- \| 2 \| \| \| Line 84 ⟶ 82: \|- \| <math>\vdots</math> \| \| \| Line 92 ⟶ 89: \|} This resembles the problem of orthogonalization, which requires <math>\boldsymbol{p}_i^\mathrm{T}\boldsymbol{p}_j=0</math> for any <math>i\neq j</math>, and <math>\boldsymbol{p}_i^\mathrm{T}\boldsymbol{p}_j=1</math> for any <math>i = j</math>. Thus the problem of finding conjugate axes is less constrained than the problem of orthogonalization, so the [[Gram–Schmidt process]] works, with additional degrees of freedom that we can later use to pick the ones that would simplify the computation: * Arbitrarily set <math>\boldsymbol{p}_0</math>. Line 102 ⟶ 99: * Arbitrarily set <math>\boldsymbol{p}_{n-1,0}</math>, then modify it to <math>\boldsymbol{p}_{n-1} = \boldsymbol{p}_{n-1,0} - \sum_{i = 0}^{n-2} \frac{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_{n-1,0}}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\boldsymbol{p}_{i} </math>. The most natural choice of <math>\boldsymbol{p}_{k,0}</math> is the gradient. That is, <math>\boldsymbol{p}_{k,0} = \nabla f(\boldsymbol{x}_k)</math>. Since conjugate directions can be scaled by a nonzero value, we scale it by <math>-1/2</math> for notational cleanness, obtaining <math display="block"> \boldsymbol{p}_{k, 0} = \mathbf{r}_k = \mathbf{b} - \mathbf{Ax}_k</math>Thus, we have <math>\boldsymbol{p}_{k} = \boldsymbol{r}_{k} - \sum_{i = 0}^{k-1} \frac{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ar}_{k}}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\boldsymbol{p}_{i}</math>. Plugging it in, we have the conjugate gradient algorithm:<math display="block">\begin{align} & \mathbf{r}_0 := \mathbf{b} - \mathbf{A x}_0 \\ & \mathbf{p}_0 := \mathbf{r}_0 \\ & k := 0 \\ & \text{do while }k < n \\ & \qquad \alpha_k := \frac{\mathbf{p}_k^\mathsf{T} \mathbf{r}_k}{\mathbf{p}_k^\mathsf{T} \mathbf{A p}_k} \\ & \qquad \mathbf{x}_{k+1} := \mathbf{x}_k + \alpha_k \mathbf{p}_k \\ & \qquad \text{if } \|\alpha_k\| \text{ is sufficiently small, then exit loop}\\ & \qquad \mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{A p}_k \\ & \qquad \mathbf{p}_{k+1} := \boldsymbol{r}_{k+1} - \sum_{i = 0}^{k} \frac{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ar}_{k+1}}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\boldsymbol{p}_{i}\\ & \qquad k := k + 1 \\ & \text{return } \mathbf{x}_{k+1} \text{ as the result} \end{align}</math>'''Proposition.''' If at some point, <math>\alpha_k = 0</math>, then the algorithm has converged, that is, <math>\nabla f(\mathrm{x}_{k+1}) = 0</math>. '''Proof.''' By construction, it would mean that <math>\mathbf{x}_{k+1} = \mathbf{x}_k</math>, that is, taking a conjugate gradient step gets us exactly back to where we were. This is only possible if the local gradient is already zero. === Simplification === This algorithm can be significantly simplified by some lemmas, resulting in the conjugate gradient algorithm. '''Lemma 1.''' <math> \mathbf{p}_i^T \mathbf{r}_j = 0, \;\forall i < j</math> and <math> \mathbf{r}_i^T \mathbf{r}_j = 0, \;\forall i < j</math>. '''Proof.''' By the geometric construction, the tangent plane to the ellipsoid at <math> \mathbf{x}_j</math> contains each of the previous conjugate direction vectors <math>\mathbf{p}_0, \mathbf{p}_1, \dots,\mathbf{p}_{j-1}</math>. Further, <math> \mathbf{r}_j</math> is perpendicular to the tangent, thus <math> \mathbf{p}_i^T \mathbf{r}_j = 0, \;\forall i < j</math>. The second equation is true since by construction, <math> \mathbf{r}_0, \mathbf{r}_1, \dots,\mathbf{r}_{j-1}</math> is a linear transform of <math> \mathbf{p}_0, \mathbf{p}_1, \dots,\mathbf{p}_{j-1}</math>. '''Lemma 2.''' <math> \mathbf{p}_k^T \mathbf{r}_k = \mathbf{r}_k^T\mathbf{r}_k</math>. '''Proof.''' By construction, <math> \mathbf{p}_{k} := \boldsymbol{r}_{k} - \sum_{i = 0}^{k-1} \frac{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ar}_{k-1}}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\boldsymbol{p}_{i}</math>, now apply lemma 1. '''Lemma 3.''' <math>\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ar}_{k+1} = \begin{cases} 0, \; i < k \\ -\boldsymbol{r}_{k+1}^T \boldsymbol{r}_{k+1} /\alpha_{k}, \; i = k \end{cases}</math>. '''Proof.''' By construction, we have <math>\mathbf{r}_{i+1} = \mathbf{r}_i - \alpha_i \mathbf{A p}_i</math>, thus<math display="block">\boldsymbol r_{k+1}^T \boldsymbol A \boldsymbol p_i = \boldsymbol r_{k+1}^T \frac{\boldsymbol r_{i} -\boldsymbol r_{i+1}}{\alpha_i}</math>Now apply lemma 1. Plugging lemmas 1-3 in, we have <math>\alpha_k=\frac{\mathbf{r}_k^{\top} \mathbf{r}_k}{\mathbf{p}_k^{\top} \mathbf{A} \mathbf{p}_k}</math> and <math>\mathbf{p}_{k+1}:=\boldsymbol{r}_{k+1} + \frac{\mathbf{r}_{k+1}^{\top} \mathbf{r}_{k+1}}{\mathbf{r}_k^{\top} \mathbf{r}_k} \mathbf p_k</math>, which is the proper conjugate gradient algorithm. ==Arnoldi/Lanczos iteration== Line 298 ⟶ 334: {{reflist}} #{{cite journal\|last1 = Hestenes\|first1 = M. R.\|authorlink1 = David Hestenes\|last2 = Stiefel\|first2 = E.\|authorlink2 = Eduard Stiefel\|title = Methods of conjugate gradients for solving linear systems\|journal = Journal of Research of the National Bureau of Standards\|volume = 49\|issue = 6\|date=December 1952\|page = 409\|doi = 10.6028/jres.049.044\|url = http://nvlpubs.nist.gov/nistpubs/jres/049/6/V49.N06.A08.pdf}} #Shewchuk, Jonathan Richard. "''[https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf An introduction to the conjugate gradient method without the agonizing pain]''." (1994) #{{cite book\|last = Saad\|first = Y.\|title = Iterative methods for sparse linear systems\|url = https://archive.org/details/iterativemethods0000saad\|url-access = registration\|edition = 2nd\|chapter = Chapter 6: Krylov Subspace Methods, Part I\|publisher = SIAM\|year = 2003\|isbn = 978-0-89871-534-7}}