While the basic Evolution Strategy we have built is robust, it faces a major hurdle in the real world: Ill-conditioning. In many optimization problems, the "valley" leading to the minimum is not a perfect circle; it is a long, narrow, and curved canyon.
In these scenarios, a circular mutation cloud (using the Identity Matrix) is highly inefficient. Most mutations will hit the "walls" of the canyon, and only a tiny fraction will move down the "floor." To solve this, we need the algorithm to learn the topology of the landscape as it moves.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
# --- 1. DESIGN THE EXTREME LANDSCAPE ---
def create_extreme_valley(angle_deg, scale_ratio):
angle = np.radians(angle_deg)
c, s = np.cos(angle), np.sin(angle)
R = np.array([[c, -s], [s, c]])
S = np.diag([scale_ratio, 1.0]) # 50:1 Elongation
return R @ S @ R.T, angle
# Angle of 60 degrees for a distinct diagonal look
A, rot_angle = create_extreme_valley(angle_deg=60, scale_ratio=50.0)
def fitness(x, y):
# Centered at (0,0)
return A[0,0]*x**2 + (A[0,1]+A[1,0])*x*y + A[1,1]*y**2
# --- 2. POSITION THE MEAN ON THE VALLEY FLOOR ---
# To be in the valley, the mean must align with the 'easy' eigenvector
distance_from_center = 1.2
m = np.array([
distance_from_center * np.cos(rot_angle + np.pi/2),
distance_from_center * np.sin(rot_angle + np.pi/2)
])
sigma = 0.2
# --- 3. SAMPLING ---
# Basic ES (Identity)
C_iso = np.eye(2)
samples_iso = m + sigma * np.random.multivariate_normal([0, 0], C_iso, 50)
# CMA-ES Goal (Optimal alignment)
C_aligned = np.linalg.inv(A)
C_aligned /= np.sqrt(np.linalg.det(C_aligned)) # Normalize area
samples_aligned = m + sigma * np.random.multivariate_normal([0, 0], C_aligned, 50)
# --- 4. VISUALIZATION ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))
data = [(ax1, samples_iso, C_iso, "Basic ES (Isotropic)"),
(ax2, samples_aligned, C_aligned, "CMA-ES (Aligned)")]
# Grid setup
lim = 2.0
x_grid = np.linspace(-lim, lim, 150)
y_grid = np.linspace(-lim, lim, 150)
X, Y = np.meshgrid(x_grid, y_grid)
Z = fitness(X, Y)
for ax, samples, C, title in data:
# Using log-scale for better visualization of the narrow floor
ax.contourf(X, Y, np.log1p(Z), levels=30, cmap='plasma', alpha=0.5)
ax.scatter(samples[:,0], samples[:,1], c='white', edgecolors='blue', s=15, alpha=0.8)
ax.scatter(m[0], m[1], c='red', marker='P', s=150, label="Mean (m)", zorder=6)
ax.scatter(0, 0, c='lime', marker='*', s=300, label="Target (0,0)", zorder=6)
# Calculate search boundary
vals, vecs = np.linalg.eigh(C)
order = vals.argsort()[::-1]
v, w = vals[order], vecs[:, order]
theta = np.degrees(np.arctan2(w[1, 0], w[0, 0]))
# Visualize the elongation
width, height = 4 * sigma * np.sqrt(v)
ellipse = Ellipse(xy=m, width=width*2, height=height*2, angle=theta,
edgecolor='cyan', fc='none', lw=2.5, ls='--', label='Search Area')
ax.add_patch(ellipse)
ax.set_title(title, fontsize=14)
ax.set_aspect('equal')
ax.set_xlim(-lim, lim)
ax.set_ylim(-lim, lim)
ax.grid(alpha=0.2)
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()
The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) represents a significant advancement in derivative-free, black-box optimization. It extends the $(\mu/\mu_w, \lambda)$ framework by incorporating a second-order model of the objective function's landscape through an adaptive covariance matrix.
Instead of just moving a fixed-shape cloud, CMA-ES adapts the shape and orientation of the mutation distribution.
CMA-ES utilizes two distinct mechanisms to adapt the search distribution dynamically:
| Feature | Basic ES | CMA-ES |
|---|---|---|
| Search Shape | Static Sphere | Adaptive Ellipsoid |
| Feedback | Immediate (Current Gen) | Historical (Evolution Path) |
| Best For | Simple, "Round" Bowls | Complex, Narrow, or Curved Valleys |
| Complexity | Easy to implement | Mathematically more complex |
In the CMA Evolution Strategy, a population of new search points (individuals, offspring) is generated by sampling a multivariate normal distribution:
$$\mathbf{x}_k^{(g+1)} \sim \mathbf{m}^{(g)} + \sigma^{(g)} \mathcal{N}\left(\mathbf{0}, \mathbf{C}^{(g)}\right) \quad \text{for } k = 1, \dots, \lambda$$
where
While the formula looks like standard statistics, it describes a physical "search ellipsoid":
The goal of CMA-ES is to adapt $\mathbf{C}$ so that the ellipsoid aligns with the "valleys" of the objective function. If the function is a long, narrow ridge, CMA-ES will learn a $\mathbf{C}$ that stretches the ellipsoid along that ridge, allowing the algorithm to take massive steps toward the optimum where other algorithms would be forced to take tiny, inefficient "zig-zag" steps.
The transition from the current generation to the next begins with the update of the mean $\mathbf{m}^{(g+1)}$. This process involves two sub-steps: Selection of the best performers and their Recombination into a single new center.
From the $\lambda$ offspring sampled in the current generation, we select the $\mu$ best individuals (those with the highest fitness/lowest objective value). The new mean is then calculated as the weighted average of these $\mu$ points:
$$\mathbf{m}^{(g+1)} = \sum_{i=1}^{\mu} w_i \mathbf{x}_{i:\lambda}$$
Where:
While $\mu$ counts the selected individuals, the variance-effective selection mass ($\mu_{eff}$) quantifies the reliable information gain. It is defined as:
$$\mu_{eff} = \left( \sum_{i=1}^{\mu} w_i^2 \right)^{-1}$$
Interpretation of $\mu_{eff}$:
Why is the variance $\sum w_i^2$?
For independent random variables $X_i$, the variance of their weighted sum is the sum of their weighted variances. Since $Var(aX) = a^2 Var(X)$:$$Var\left( \sum w_i X_i \right) = \sum w_i^2 Var(X_i)$$Assuming standardized mutation steps where $Var(X_i) = 1$, the total variance becomes $\sum w_i^2$.
Why the Inverse? (The Power of Averaging)
In signal processing, variance is noise and averaging is a filter.By averaging $\mu$ samples, we "clean" the search signal. $\mu_{eff}$ is the inverse of the variance because it represents the strength of this filter:
$\mu_{eff}$ essentially tells us: "How many independent random samples would I need to average together to get this same level of noise reduction?"
The efficiency of the search is heavily influenced by the relationship between the number of offspring ($\lambda$), the number of selected parents ($\mu$), and the resulting effective selection mass ($\mu_{eff}$).
The following heuristics are standard in the literature for balancing selection pressure and search stability:
The update to the distribution mean represents the "Tell" phase of the algorithm. It transitions the search center from the current generation to the next by aggregating the information from the most successful candidates.The final update equation for the mean is expressed as:
$$\mathbf{m}^{(g+1)} = \mathbf{m}^{(g)} + c_m \sum_{i=1}^{\mu} w_i (\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})$$
Analysis of the Equation
This formulation is mathematically equivalent to the weighted sum shown previously, but it offers a clearer insight into the algorithm's mechanics:
This specific structure—defining the update as a displacement from the current mean—is fundamental because it allows the algorithm to decouple the direction of the move from the step-size ($\sigma$) and the covariance shape ($\mathbf{C}$) during subsequent adaptation steps.
To understand Covariance Matrix Adaptation, one must view the mutation cloud not as a collection of random points, but as a probability distribution that the algorithm is iteratively "fitting" to the search space.
The objective is to update the covariance matrix $\mathbf{C}$ such that the likelihood of generating previously successful steps is increased in future generations.
In modern CMA-ES, the update to $\mathbf{C}$ combines two distinct "memories" of success:
$$\mathbf{C}^{(g+1)} = (1 - c_{1} - c_{\mu}) \mathbf{C}^{(g)} + c_{1} \underbrace{\mathbf{p}_c \mathbf{p}_c^T}_{\mathbf{P}_1} + c_{\mu} \underbrace{\sum_{i=1}^{\mu} w_i \mathbf{y}_i \mathbf{y}_i^T}_{ \mathbf{P}_{\mu}}$$
with the learning rate coefficients:
The Rank-$\mu$ update allows the algorithm to extract geometric information from the entire selected population. This mechanism is primarily responsible for adapting the search distribution to the local topology of the objective function.
The Rank-$\mu$ Innovation Matrix is the weighted sample covariance of the successful steps in the current generation. We define the standardized step $\mathbf{y}_{i:\lambda}$ as:
$$\mathbf{y}_{i:\lambda} = \frac{\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)}}{\sigma^{(g)}}$$The innovation matrix is then:
$$\mathbf{P}_{\mu} = \sum_{i=1}^{\lambda} w_i \mathbf{y}_{i:\lambda} \mathbf{y}_{i:\lambda}^T$$
Where:
In modern CMA-ES, we use Active Selection. While the mean $\mathbf{m}$ only follows the best $\mu$ individuals, the covariance matrix $\mathbf{C}$ also learns from the worst.
Weight Constraints: To maintain stability, the positive weights sum to one ($\sum_{w_i > 0} w_i = 1$). When including negative weights, the total sum $\sum_{i=1}^{\lambda} w_i$ is typically close to zero, ensuring the update primarily changes the shape of the distribution rather than just its overall volume.
The learning rate $c_{\mu}$ acts as a relative weight determining how much the current population's "snapshot" ($\mathbf{P}_{\mu}$) influences the update compared to the long-term evolution path ($\mathbf{P}_1$) and the existing memory ($\mathbf{C}^{(g)}$).
A critical intuition is that $1/c_{\mu}$ represents the Backward Time Horizon (or time constant) of the algorithm. Because the influence of old data fades exponentially, the "age" of the information follows a specific decay pattern.
The Derivation:
The update for the covariance matrix is defined by the learning rate $c_{\mu}$, which determines how much weight is given to new information versus the existing memory.
The current state of the covariance matrix $\mathbf{C}^{(g+1)}$ is a weighted sum of all past innovations $\mathbf{P}_{\mu}$. By expanding the update equation $\mathbf{C}^{(g+1)} = (1 - c_{\mu})\mathbf{C}^{(g)} + c_{\mu}\mathbf{P}_{\mu}^{(g)}$ backwards to the beginning of the search ($g=0$), we get:$$\mathbf{C}^{(g+1)} = (1-c_{\mu})^{g+1}\mathbf{C}^{(0)} + c_{\mu} \sum_{i=0}^{g} (1-c_{\mu})^i \mathbf{P}_{\mu}^{(g-i)}$$ This shows that the influence of an innovation from $i$ generations ago decays exponentially at a rate of $(1-c_{\mu})^i$. To find the "Backward Time Horizon," we look for the number of generations $\Delta g$ required for the sum of the weights to reach a certain fraction $\alpha$ of the total information. We define $\alpha$ as: $$\alpha = \sum_{i=0}^{\Delta g-1} c_{\mu}(1-c_{\mu})^i $$
The sum is a finite geometric series. Applying the formula for the sum of a geometric series results in: $$\alpha = 1 - (1-c_{\mu})^{\Delta g}$$
We want that $\alpha \approx 63\% \approx 1 - 1/e$, i.e. where approximately $63\%$ of the weights is summed up. This results in $$(1-c_{\mu})^{\Delta g} \approx e^{-1}$$
resp. $$ \Delta g = \frac{-1}{\ln(1-c_{\mu})} $$
For small learning rates ($c_\mu \ll 1$), we use the first-order Taylor expansion of the natural logarithm, $\ln(1-x) \approx -x$.
$$\Delta g \approx \frac{1}{c_{\mu}}$$
Intuition: The "Time Constant"
This approximation confirms that the reciprocal of the learning rate is the time constant of the system. Just as in physics or electronics (RC circuits), the value $1/c$ represents the point where the system has completed roughly 63.2% of its transition to a new state. In CMA-ES, this means that after $1/c_\mu$ generations, the "old" matrix has been 63% replaced by "new" geometric information.
To determine the optimal learning rate, we compare the information we gather against the complexity of the model we are trying to build:
Therefore, to ensure the matrix is supported by enough data, we set: $$\text{"Information per generation"} \times \text{"Number of generations in memory"} \approx \text{Needed information}$$ $$\mu_{eff} \cdot \frac{1}{c_{\mu}} \approx n^2$$ Rearranging for the learning rate gives us the heuristic: $$c_{\mu} \approx \frac{\mu_{eff}}{n^2}$$ The setting of $c_{\mu} \approx \mu_{eff} / n^2$ ensures that we only update the matrix as fast as we can reliably "learn" the landscape.
Discussion: The Balance of Learning
1. Avoiding Rank Deficiency
If we were to set $c_{\mu}$ much higher than $\mu_{eff}/n^2$, we would be trying to solve a system with $n^2$ unknowns using fewer than $n^2$ data points. Mathematically, this leads to a rank-deficient covariance matrix. Geometrically, this means the search ellipsoid would "collapse" or become perfectly flat in certain directions, prematurely excluding parts of the search space where the solution might lie.
2. Scaling with Population Size
This relationship reveals one of the greatest strengths of CMA-ES: Parallelization. If we increase the population size $\lambda$ (and thus $\mu_{eff}$), we can increase $c_{\mu}$ proportionally. This shortens the time horizon $\Delta g$, allowing the algorithm to adapt the coordinate system much faster. In high-dimensional spaces ($n > 100$), a large population is often the only way to make the covariance adaptation happen in a reasonable number of generations.
3. The $n^2$ "Curse"
The $n^2$ in the denominator tells us that the difficulty of learning the landscape's geometry grows quadratically with the dimension. While the Mean Update can react to a gradient almost immediately, the Covariance Update is naturally more conservative, requiring many observations to "verify" that a change in the search shape is statistically justified.
While the Rank-$\mu$ update captures the "spread" of the current population, it has a significant limitation: it is "sign-blind." Because it uses the outer product of mutation vectors, a step in direction $\mathbf{y}$ and a step in the opposite direction $-\mathbf{y}$ contribute identically to the matrix.
The Rank-One Update solves this by observing the Evolution Path—the correlated sequence of moves the distribution mean has taken over many generations.
Instead of looking only at the current generation, the algorithm maintains a strategy state called the evolution path $\mathbf{p}_c^{(g+1)}$. This is an exponentially weighted moving average of the steps taken by the mean $\mathbf{m}$:$$\mathbf{p}_c^{(g+1)} = (1 - c_c) \mathbf{p}_c^{(g)} + \sqrt{c_c (2 - c_c) \mu_{eff}} \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}$$
The normalization factor is essential for the stability of the Rank-One update. Its purpose is to ensure that under random selection (where there is no clear trend), the evolution path $\mathbf{p}_c$ follows the same distribution as a single mutation step: $\mathbf{p}_c \sim \mathcal{N}(\mathbf{0}, \mathbf{C})$.
Derivation from Variance Equilibrium
To keep the algorithm stable, the "energy" (variance) we add in each step must equal the "energy" lost through the decay $(1-c_c)$.
Let $\mathbf{z} = \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}$ be the latest step. If selection is random, $\mathbf{z}$ is the mean of $\mu$ samples from $\mathcal{N}(\mathbf{0}, \mathbf{C})$, so its variance is $Var(\mathbf{z}) = \frac{1}{\mu_{eff}} \mathbf{C}$.
We want the variance of the path to be stationary ($Var(\mathbf{p}_{new}) = Var(\mathbf{p}_{old}) = \mathbf{C}$).
Using the update rule $\mathbf{p}_{new} = (1-c_c)\mathbf{p}_{old} + \alpha \mathbf{z}$:$$Var(\mathbf{p}) = (1-c_c)^2 Var(\mathbf{p}) + \alpha^2 Var(\mathbf{z})$$
$$\mathbf{C} = (1-c_c)^2 \mathbf{C} + \frac{\alpha^2}{\mu_{eff}} \mathbf{C}$$
Dividing by $\mathbf{C}$ and solving for $\alpha$:
$$1 = (1 - 2c_c + c_c^2) + \frac{\alpha^2}{\mu_{eff}}$$
$$2c_c - c_c^2 = \frac{\alpha^2}{\mu_{eff}}$$
$$\alpha = \sqrt{c_c(2-c_c)\mu_{eff}}$$
The factor consists of two distinct parts that serve different scaling needs:
If we didn't normalize, the length of $\mathbf{p}_c$ would depend on the population size and the learning rate. When we eventually perform the Rank-One update ($\mathbf{C} \approx \mathbf{p}_c \mathbf{p}_c^T$), a non-normalized path would cause the volume of the search ellipsoid to explode or collapse. By using this factor, the Rank-One update only changes the shape and orientation of the covariance matrix, leaving the global scale to be handled exclusively by the step-size $\sigma$.
The information from this path is then injected into the covariance matrix as a rank-one update:$$\mathbf{P}_1 = \mathbf{p}_c^{(g+1)} \left( \mathbf{p}_c^{(g+1)} \right)^T$$
Mathematically, the outer product of a single vector $\mathbf{p}_c$ creates a matrix with a rank of exactly 1. The Rank-One update "stretches" the search ellipsoid specifically along the axis of the evolution path.
The complete CMA-ES update for the covariance matrix integrates the previous state of the search with both immediate population feedback and long-term directional trends.
Substituting our definitions for $\mathbf{P}_1$ and $\mathbf{P}_{\mu}$ back into the primary equation, we arrive at the final update rule:
$$\mathbf{C}^{(g+1)} = \underbrace{(1 - c_1 - c_\mu) \mathbf{C}^{(g)}}_{\text{Decay (Memory)}} + \underbrace{c_1 \mathbf{p}_c^{(g+1)} (\mathbf{p}_c^{(g+1)})^T}_{\text{Rank-One Update}} + \underbrace{c_\mu \sum_{i=1}^{\lambda} w_i \mathbf{y}_{i:\lambda} \mathbf{y}_{i:\lambda}^T}_{\text{Rank-}\mu \text{ Update}}$$
Why this dual approach works:
By balancing these two updates, CMA-ES achieves its "State-of-the-Art" status: it is as efficient as a second-order (Newton-type) optimizer but requires zero derivatives and remains robust against noisy or non-smooth landscapes.
While the Covariance Matrix ($\mathbf{C}$) is highly effective at adapting the shape and orientation of the search distribution, it is fundamentally ill-equipped to manage the overall scale (the step-size $\sigma$). There are two primary reasons why we must decouple the step-size control from the covariance update:
The mathematical logic required to find the optimal "length" of a step is different from the logic used to find the optimal "direction."The Population Effect: On fundamental landscapes like the Sphere function ($f(x) = \sum x_i^2$), the optimal step-size is heavily dependent on the effective population size ($\mu_{eff}$).The Constraint: The covariance update rule (Equation 30) does not scale with $\mu_{eff}$ in a way that matches these theoretical optima. If we relied solely on $\mathbf{C}$ to manage the volume of the search, the algorithm would be unable to exploit the efficiency gains offered by larger populations.
The most practical reason for a separate control is latency. The covariance matrix is a "slow learner" by design, while the step-size needs to be a "fast responder."
The decoupling of step-size control from covariance adaptation is a functional necessity based on their differing mathematical roles. While the covariance matrix $\mathbf{C}$ models the second-order global topology, the step-size $\sigma$ manages the rate of convergence.To address this, CMA-ES introduces Cumulative Step-Size Adaptation (CSA). By isolating the scalar $\sigma$, the algorithm can react to evidence of overshooting or undershooting the optimum on a much shorter time horizon ($O(n)$). This separation ensures that the search remains agile even when the coordinate system adaptation is restricted by a conservative learning rate ($O(n^2)$).
The core idea behind CSA is to analyze the "directionality" of the search path.
To measure this, we maintain a second evolution path, $\mathbf{p}_\sigma$, specifically for the step-size. It is similar to the path used for the covariance matrix, but it is "scaled" to remove the influence of the current shape $\mathbf{C}$:
$$\mathbf{p}_\sigma^{(g+1)} = (1 - c_\sigma) \mathbf{p}_\sigma^{(g)} + \sqrt{c_\sigma (2 - c_\sigma) \mu_{eff}} \underbrace{\mathbf{C}^{(g)-1/2} \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}}_{\text{Isotropic Step}}$$
By applying the transformation $\mathbf{C}^{-1/2} = \mathbf{Q} \mathbf{\Lambda}^{-1/2} \mathbf{Q}^T$, we transform the elliptical search space back into a sphere. The steps of the transformation are:
This ensures that a step of length $\sigma$ has the same weight in the path $\mathbf{p}_\sigma$ regardless of whether it was taken along a "long" or "short" axis of the covariance ellipsoid. In other words, by "whitening" the data with $\mathbf{C}^{-1/2}$ for $\mathbf{p}_\sigma$, CMA-ES ensures that the step-size doesn't get confused by the shape of the ellipsoid. It sees the world as a sphere, allowing it to judge purely whether the steps are "zig-zagging" or "sprinting" in a coordinate-invariant way.
The derivation for the nomalizing factor $\sqrt{c_\sigma (2 - c_\sigma) \mu_{eff}}$ is analog to the normalizing factor of the evolution path (see above).
The update rule is derived by comparing the actual length of the evolution path to the expected length of a path generated by purely random, uncorrelated steps.
The Reference Value
In a purely random search (a "random walk"), each step is independent. The expected length of the vector $\mathbf{p}_\sigma$ under these conditions is the expected norm of a standard normal distribution, denoted as $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|$.
Note: $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\| \approx \sqrt{n} + \mathcal O(1/n)$, i.e. the expected length of a random vector is approximately $\sqrt{n}$. This serves as the "neutral" baseline: if your path is significantly longer than $\sqrt{n}$, you are moving in a consistent direction.
In Practice: To ensure accuracy in lower dimensions, CMA-ES uses the approximation $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\| \approx \sqrt{n}(1 - \frac{1}{4n} + \frac{1}{21n^2})$.
The Comparison
We look at the ratio between the realized path length and the expected path length:
$$\frac{\|\mathbf{p}_\sigma^{(g+1)}\|}{\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|}$$
The Exponential Update (Derivation)
To ensure $\sigma$ stays positive and scales multiplicatively, we use the natural logarithm of this ratio. We also introduce a damping parameter $d_\sigma$ and a learning rate $c_\sigma$ to control the update speed:
$$\ln(\sigma^{(g+1)}) = \ln(\sigma^{(g)}) + \frac{c_\sigma}{d_\sigma} \left( \frac{\|\mathbf{p}_\sigma^{(g+1)}\|}{\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|} - 1 \right)$$
Applying the exponential function to both sides yields the final update rule:
$$\sigma^{(g+1)} = \sigma^{(g)} \cdot \exp \left( \frac{c_\sigma}{d_\sigma} \left( \frac{\|\mathbf{p}_\sigma^{(g+1)}\|}{\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|} - 1 \right) \right)$$
A remarkable property of CSA is that by adapting $\sigma$ such that the length of the evolution path $\|\mathbf{p}_\sigma\|$ remains approximately equal to the expected length $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|$, the algorithm forces successive search steps to become $\mathbf{C}^{-1}$-conjugate.
If we assume the step-size is at an equilibrium, then $\|\mathbf{p}_\sigma^{(g+1)}\|^2 \approx \|\mathbf{p}_\sigma^{(g)}\|^2$. By expanding the recursive definition of $\mathbf{p}_\sigma$, we find that the "old" path and the "new" isotropic step must be orthogonal:
$$(\mathbf{p}_\sigma^{(g)})^T \left( \mathbf{C}^{(g)-1/2} \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}} \right) \approx 0$$
If we substitute the previous step ($\mathbf{m}^{(g)} - \mathbf{m}^{(g-1)}$) into the path and simplify, we arrive at:
$$(\mathbf{m}^{(g)} - \mathbf{m}^{(g-1)})^T \underbrace{\mathbf{C}^{(g)-1}}_{\text{Metric}} (\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}) \approx 0$$
This identity proves that CMA-ES is performing a second-order optimization.
In plain English: The Step-Size Control ensures that your next move doesn't "undo" the work of your previous move. It ensures that the algorithm heads directly toward the optimum along the "canyon floor" rather than bouncing off the walls, achieving the efficiency of a Quasi-Newton method without ever calculating a derivative.
CMA-ES uses two separate "memories" (evolution paths) to manage different aspects of the search. While they look similar mathematically, their goals are distinct.
The Synergy: How They Work Together
We have now covered the entire state-of-the-art process:
These must be set by the user based on the objective function and search space.
These are calculated automatically by the algorithm based on the dimension $n$ and the population size $\lambda$. Changing these is rare and typically only for specialized research.
Covariance Learning Rates ($c_1$ and $c_\mu$)
These control the "memory" of the covariance matrix. They are scaled by $1/n^2$ because there are approximately $n^2/2$ degrees of freedom in the matrix to be estimated.
Step-Size Control ($c_\sigma$ and $d_\sigma$)
These control the "agility" of the step-size $\sigma$. They operate on a much faster time horizon than the covariance matrix.
Cumulation for Covariance ($c_c$)
If the algorithm is failing, don't change the learning rates; change the strategy: