Gradient and the Hessian Matrix ($\mathbf{H}$)¶

Gradient - Definition¶

For a differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, the gradient is a vector-valued function that represents the direction and magnitude of the steepest increase of $f$ at a given point.The gradient is defined using the nabla operator ($\nabla$), which is a vector of partial derivative operators:

$$\nabla = \begin{bmatrix} \frac{\partial}{\partial x_1} \\ \frac{\partial}{\partial x_2} \\ \vdots \\ \frac{\partial}{\partial x_n} \end{bmatrix}$$

Applying this operator to the scalar function $f$ yields the gradient vector:

$$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

Properties¶

Direction of Steepest Ascent: At any point $\mathbf{x}$, the vector $\nabla f(\mathbf{x})$ points in the direction where the function $f$ increases most rapidly.
Orthogonality to Contours: The gradient vector $\nabla f(\mathbf{x})$ is always perpendicular (orthogonal) to the level sets (contour lines) of the function at that point.

Hessian - Definition¶

For a twice-differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian matrix $\mathbf{H}$ is the square matrix of second-order partial derivatives. While the gradient $\nabla f$ defines the local slope, the Hessian captures the local curvature of the function.$$\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}$$

Symmetry and Schwarz's Theorem¶

According to Schwarz's Theorem, if the second-order partial derivatives are continuous in a neighborhood of a point, the order of differentiation does not matter. In practical optimization, we almost exclusively deal with such smooth functions, meaning the Hessian is symmetric:$$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} \implies H_{ij} = H_{ji} \implies \mathbf{H} = \mathbf{H}^T$$

Geometric Properties¶

The symmetry of the Hessian is mathematically significant for the search geometry (analog to PCA covariance matrix symmetry $\mathbf{C} = \mathbf{C}^T)$:

Real Eigenvalues: The curvature magnitudes are always real numbers. These $\lambda_i$ represent the "stiffness" of the landscape along the principal axes.
- A large $\lambda_i$ indicates a steep, narrow direction (high stiffness).
- A small $\lambda_i$ indicates a flat, wide direction (low stiffness).
Orthogonal Eigenvectors: The principal axes of the curvature are perpendicular to each other. These define the "natural" coordinate system of the local basin.

This is analog to PCA where the covariance matrix $\mathbf{C}$ shows the same symmetry: $$\mathbf{C} = \mathbf{C}^T$$

The Condition Number ($\kappa$)¶

The efficiency of any optimization algorithm is heavily dictated by the Condition Number of the Hessian, denoted as $\kappa(\mathbf{H})$. Formally, for a positive-definite Hessian, the condition number is defined as the ratio of the extremal eigenvalues:$$\kappa(\mathbf{H}) = \frac{\lambda_{\max}}{\lambda_{\min}}$$

This ratio serves as a quantitative measure of the "difficulty" of the landscape.

Well-Conditioned ($\kappa \approx 1$): The eigenvalues are nearly equal, resulting in hyperspherical contours. In this scenario, the negative gradient $-\nabla f$ points directly toward the optimum $\mathbf{x}^*$.
Ill-Conditioned ($\kappa \gg 1$): The eigenvalues vary by orders of magnitude (e.g., $10^6$), creating highly elongated, "canyon-like" elliptical contours. Here, the gradient is nearly perpendicular to the direction of the optimum, which causes first-order methods to oscillate across the canyon walls rather than progressing along the floor.

The Taylor Expansion¶

For a sufficiently smooth function $f$, the value of the function near a point $\mathbf{x}_0$ can be approximated using its derivatives. The second-order Taylor expansion is:

$$f(\mathbf{x}_0 + \Delta \mathbf{x}) = f(\mathbf{x}_0) + \underbrace{\nabla f(\mathbf{x}_0)^T \Delta \mathbf{x}}_{\text{Linear term}} + \underbrace{\frac{1}{2} \Delta \mathbf{x}^T \mathbf{H}(\mathbf{x}_0) \Delta \mathbf{x}}_{\text{Quadratic term}} + \mathcal{O}(\|\Delta \mathbf{x}\|^3)$$

The Linear Approximation (First-Order): The term $\nabla f(\mathbf{x}_0)^T \Delta \mathbf{x}$ uses the Gradient. It approximates the function as a flat, tilted plane. Algorithms that only use this (like standard Gradient Descent) assume the slope is constant, which is why they struggle with curved valleys.
The Quadratic Approximation (Second-Order): The term $\frac{1}{2} \Delta \mathbf{x}^T \mathbf{H} \Delta \mathbf{x}$ uses the Hessian. It accounts for the curvature, turning the flat plane into a parabolic "bowl."

The Quadratic Approximation¶

In the proximity of a local optimum $\mathbf{x}^*$, most smooth objective functions can be accurately modeled by a second-order Taylor expansion. If $\nabla f(\mathbf{x}^*) = 0$, the function simplifies to a quadratic form: $$f(\mathbf{x}) \approx f(\mathbf{x}^*) + \frac{1}{2} (\mathbf{x} - \mathbf{x}^*)^T \mathbf{H} (\mathbf{x} - \mathbf{x}^*)$$

This quadratic model reveals the underlying geometry of the search space:

Eigenvectors: Define the principal axes of the quadratic basin. These orthogonal vectors determine the orientation of the contours.
Eigenvalues: Determine the eccentricity (stretching) of the contours.

Second-order optimizer¶

CMA-ES is often called a second-order optimizer because it seeks to "learn" this second-order quadratic term.

Local to Global: While your global landscape might be a mess of hills and valleys, any tiny region around a local minimum looks almost exactly like a quadratic function.
Neutralizing the Hessian: As we discussed, CMA-ES adapts its covariance matrix $\mathbf{C}$ so that $\mathbf{C} \approx \mathbf{H}^{-1}$.
The "Sphering" Effect: In the Taylor expansion, if we replace $\Delta \mathbf{x}$ with our transformed coordinates $\mathbf{C}^{1/2} \mathbf{z}$, the quadratic term becomes:$$\frac{1}{2} \mathbf{z}^T (\mathbf{C}^{1/2} \mathbf{H} \mathbf{C}^{1/2}) \mathbf{z} \approx \frac{1}{2} \mathbf{z}^T \mathbf{I} \mathbf{z}$$ This effectively "cancels out" the curvature differences, making the search equally efficient in every direction.

Hessian Preconditioning and Coordinate Transformation¶

In ill-conditioned problems, the Hessian $\mathbf{H}$ creates a landscape where the "cost" of moving in certain directions is vastly higher than in others. CMA-ES treats the covariance matrix $\mathbf{C}$ as a dynamic preconditioner to neutralize this effect.

Congruence Transformation¶

Let $\mathbf{x}$ be our original coordinate. CMA-ES samples points in a "circular" space $\mathbf{z}$ and then transforms them into the "elliptical" problem space using:$$\mathbf{x} = \mathbf{C}^{1/2} \mathbf{z}$$Now, let's look at the quadratic objective function $f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T \mathbf{H} \mathbf{x}$. To see what the function looks like to the algorithm (in the $\mathbf{z}$-space), we substitute the transformation:$$f(\mathbf{z}) = \frac{1}{2} (\mathbf{C}^{1/2} \mathbf{z})^T \mathbf{H} (\mathbf{C}^{1/2} \mathbf{z})$$Using the property $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T \mathbf{A}^T$:$$f(\mathbf{z}) = \frac{1}{2} \mathbf{z}^T (\mathbf{C}^{1/2})^T \mathbf{H} \mathbf{C}^{1/2} \mathbf{z}$$Since $\mathbf{C}$ (and its square root) is symmetric, $(\mathbf{C}^{1/2})^T = \mathbf{C}^{1/2}$. This leaves us with:$$f(\mathbf{z}) = \frac{1}{2} \mathbf{z}^T \underbrace{\left( \mathbf{C}^{1/2} \mathbf{H} \mathbf{C}^{1/2} \right)}_{\mathbf{H}_{transformed}} \mathbf{z}$$

Note the difference of

Similarity Transformation ($P^{-1} A P$): This is used to preserve the eigenvalues of a matrix. It’s like looking at the same linear operator from a different camera angle.
Congruence Transformation ($P^T A P$): This is used to change the variables of a quadratic form (our "bowl"). It doesn't preserve eigenvalues; instead, it changes the shape of the bowl.

The Geometry of Whitening¶

When CMA-ES converges toward the optimal covariance, it achieves $\mathbf{C} \approx \mathbf{H}^{-1}$ (up to a scalar factor). This relationship creates a perfect symmetry between the landscape and the search distribution:

Inverse Scaling: Where the Hessian has high curvature (steep walls), $\mathbf{C}$ develops small eigenvalues (short axes). Where the Hessian is flat (long valley floor), $\mathbf{C}$ develops large eigenvalues (long axes).
Neutralizing the Hessian: Mathematically, the effective Hessian of the transformed space is given by the congruence transformation:$$\mathbf{H}_{transformed} = \mathbf{C}^{1/2} \mathbf{H} \mathbf{C}^{1/2}$$Substituting $\mathbf{C} = \mathbf{H}^{-1}$ (and thus $\mathbf{C}^{1/2} = \mathbf{H}^{-1/2}$):$$\mathbf{H}_{transformed} \approx \mathbf{H}^{-1/2} \mathbf{H} \mathbf{H}^{-1/2} = \mathbf{H}^{-1/2} (\mathbf{H}^{1/2} \mathbf{H}^{1/2}) \mathbf{H}^{-1/2} = \mathbf{I}$$

The Impact on Search Efficiency¶

By transforming the Hessian into the Identity matrix $\mathbf{I}$, the algorithm effectively "whitens" the objective function.

Before Preconditioning: The algorithm struggles with a condition number $\kappa \gg 1$, where the gradient direction is misleading and steps are disproportionately effective or ineffective depending on their orientation.
After Preconditioning: The search "sees" a perfectly spherical bowl. In this transformed space, every direction is equally promising, and the condition number becomes $\kappa \approx 1$.

This transformation is what allows CMA-ES to maintain consistent convergence rates on functions that would be numerically impossible for standard evolution strategies or gradient-based methods without manual variable scaling.