newton - rasigadelab/mathwiki GitHub Wiki

Newton's method in optimization

Newton's method is a powerful optimization technique used for finding the local minimum (or maximum) of a scalar function, particularly when the function is differentiable and has a vector argument. Unlike simpler methods like gradient descent, Newton's method leverages both the first and second derivatives of the function, which allows it to converge more quickly in some cases.

Newton’s Method Overview

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ (i.e., a function that takes an $n$-dimensional vector as input and returns a scalar), Newton’s method iteratively updates the input vector to minimize the function.

At each step, it uses the following formula to update the vector of variables $x$:

$$x_{k+1} = x_k - H_f(x_k)^{-1} \nabla f(x_k) $$

where:

$x_k$ is the current estimate of the minimum at iteration $k$,
$\nabla f(x_k)$ is the gradient of the function at $x_k$, which is an $n$-dimensional vector of partial derivatives,
$H_f(x_k)$ is the Hessian matrix of second-order partial derivatives of $f$ at $x_k$,
$H_f(x_k)^{-1} \nabla f(x_k)$ represents the Newton step, which includes solving a system of linear equations to find the direction and magnitude of the update.

Newton’s Method: Key Steps

Initialization: Choose an initial guess $x_0$ for the vector argument.
Compute the Gradient and Hessian:
- The gradient $\nabla f(x_k)$ is the vector of first-order partial derivatives:

$$\nabla f(x_k) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n} \right]^\top$$

The Hessian $H_f(x_k)$ is the matrix of second-order partial derivatives:

$$H_f(x_k) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$

Solve the Linear System: Compute the Newton step by solving: $$ p_k = H_f(x_k)^{-1} \nabla f(x_k) $$ This provides both the direction and the step size for updating $x_k$.
Update the Guess: Update the current estimate $x_k$ using: $$ x_{k+1} = x_k - p_k = x_k - H_f(x_k)^{-1} \nabla f(x_k) $$
Repeat: Repeat the process until convergence, i.e., until $| \nabla f(x_k) |$ becomes sufficiently small or the change in $x_k$ is very small.

Convergence

Quadratic Convergence: Newton’s method converges quadratically near the minimum, meaning that the error decreases very quickly in each iteration. This makes it faster than gradient-based methods like gradient descent, especially near the minimum.
Requirement for Hessian Inversion: However, a downside is that Newton's method requires the computation and inversion of the Hessian matrix at every step, which can be computationally expensive, especially for high-dimensional problems.

Challenges and Modifications

Non-positive Definite Hessian: If the Hessian is not positive definite (which can happen when the function is not convex), the method might fail or converge to a saddle point instead of a minimum. In such cases, modifications like trust-region methods or line search strategies are used to ensure better convergence.
Cost of Hessian Calculation: In high-dimensional optimization problems, computing and inverting the Hessian matrix is expensive (scaling as $O(n^3)$ in general). Approximations like quasi-Newton methods (e.g., BFGS) are often used to avoid explicitly calculating the Hessian.

Example:

Consider a simple quadratic function $f(x) = \frac{1}{2} x^T A x - b^T x$, where $A$ is a positive definite matrix and $b$ is a vector. The gradient and Hessian of this function are:

Gradient: $\nabla f(x) = A x - b$
Hessian: $H_f(x) = A$

Using Newton’s method, the update step becomes:

$$x_{k+1} = x_k - A^{-1} (A x_k - b) = A^{-1} b $$

In this case, the method converges in just one step if $A$ is invertible.

Summary

Newton’s method is a powerful optimization algorithm for minimizing scalar functions with vector arguments. By using both the gradient and the Hessian, it achieves faster convergence than gradient descent near the minimum. However, the method can be computationally expensive and requires certain conditions (like a positive definite Hessian) to work effectively.