For a differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, the gradient is a vector-valued function that represents the direction and magnitude of the steepest increase of $f$ at a given point.The gradient is defined using the nabla operator ($\nabla$), which is a vector of partial derivative operators:
$$\nabla = \begin{bmatrix} \frac{\partial}{\partial x_1} \\ \frac{\partial}{\partial x_2} \\ \vdots \\ \frac{\partial}{\partial x_n} \end{bmatrix}$$
Applying this operator to the scalar function $f$ yields the gradient vector:
$$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
For a twice-differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian matrix $\mathbf{H}$ is the square matrix of second-order partial derivatives. While the gradient $\nabla f$ defines the local slope, the Hessian captures the local curvature of the function.$$\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}$$
According to Schwarz's Theorem, if the second-order partial derivatives are continuous in a neighborhood of a point, the order of differentiation does not matter. In practical optimization, we almost exclusively deal with such smooth functions, meaning the Hessian is symmetric:$$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} \implies H_{ij} = H_{ji} \implies \mathbf{H} = \mathbf{H}^T$$
The symmetry of the Hessian is mathematically significant for the search geometry (analog to PCA covariance matrix symmetry $\mathbf{C} = \mathbf{C}^T)$:
Real Eigenvalues: The curvature magnitudes are always real numbers. These $\lambda_i$ represent the "stiffness" of the landscape along the principal axes.
Orthogonal Eigenvectors: The principal axes of the curvature are perpendicular to each other. These define the "natural" coordinate system of the local basin.
This is analog to PCA where the covariance matrix $\mathbf{C}$ shows the same symmetry: $$\mathbf{C} = \mathbf{C}^T$$
The efficiency of any optimization algorithm is heavily dictated by the Condition Number of the Hessian, denoted as $\kappa(\mathbf{H})$. Formally, for a positive-definite Hessian, the condition number is defined as the ratio of the extremal eigenvalues:$$\kappa(\mathbf{H}) = \frac{\lambda_{\max}}{\lambda_{\min}}$$
This ratio serves as a quantitative measure of the "difficulty" of the landscape.
For a sufficiently smooth function $f$, the value of the function near a point $\mathbf{x}_0$ can be approximated using its derivatives. The second-order Taylor expansion is:
$$f(\mathbf{x}_0 + \Delta \mathbf{x}) = f(\mathbf{x}_0) + \underbrace{\nabla f(\mathbf{x}_0)^T \Delta \mathbf{x}}_{\text{Linear term}} + \underbrace{\frac{1}{2} \Delta \mathbf{x}^T \mathbf{H}(\mathbf{x}_0) \Delta \mathbf{x}}_{\text{Quadratic term}} + \mathcal{O}(\|\Delta \mathbf{x}\|^3)$$
In the proximity of a local optimum $\mathbf{x}^*$, most smooth objective functions can be accurately modeled by a second-order Taylor expansion. If $\nabla f(\mathbf{x}^*) = 0$, the function simplifies to a quadratic form: $$f(\mathbf{x}) \approx f(\mathbf{x}^*) + \frac{1}{2} (\mathbf{x} - \mathbf{x}^*)^T \mathbf{H} (\mathbf{x} - \mathbf{x}^*)$$
This quadratic model reveals the underlying geometry of the search space:
CMA-ES is often called a second-order optimizer because it seeks to "learn" this second-order quadratic term.
In ill-conditioned problems, the Hessian $\mathbf{H}$ creates a landscape where the "cost" of moving in certain directions is vastly higher than in others. CMA-ES treats the covariance matrix $\mathbf{C}$ as a dynamic preconditioner to neutralize this effect.
Let $\mathbf{x}$ be our original coordinate. CMA-ES samples points in a "circular" space $\mathbf{z}$ and then transforms them into the "elliptical" problem space using:$$\mathbf{x} = \mathbf{C}^{1/2} \mathbf{z}$$Now, let's look at the quadratic objective function $f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T \mathbf{H} \mathbf{x}$. To see what the function looks like to the algorithm (in the $\mathbf{z}$-space), we substitute the transformation:$$f(\mathbf{z}) = \frac{1}{2} (\mathbf{C}^{1/2} \mathbf{z})^T \mathbf{H} (\mathbf{C}^{1/2} \mathbf{z})$$Using the property $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T \mathbf{A}^T$:$$f(\mathbf{z}) = \frac{1}{2} \mathbf{z}^T (\mathbf{C}^{1/2})^T \mathbf{H} \mathbf{C}^{1/2} \mathbf{z}$$Since $\mathbf{C}$ (and its square root) is symmetric, $(\mathbf{C}^{1/2})^T = \mathbf{C}^{1/2}$. This leaves us with:$$f(\mathbf{z}) = \frac{1}{2} \mathbf{z}^T \underbrace{\left( \mathbf{C}^{1/2} \mathbf{H} \mathbf{C}^{1/2} \right)}_{\mathbf{H}_{transformed}} \mathbf{z}$$
Note the difference of
When CMA-ES converges toward the optimal covariance, it achieves $\mathbf{C} \approx \mathbf{H}^{-1}$ (up to a scalar factor). This relationship creates a perfect symmetry between the landscape and the search distribution:
By transforming the Hessian into the Identity matrix $\mathbf{I}$, the algorithm effectively "whitens" the objective function.
This transformation is what allows CMA-ES to maintain consistent convergence rates on functions that would be numerically impossible for standard evolution strategies or gradient-based methods without manual variable scaling.