Covariance Matrix Adaptation Evolution Strategy¶

Moving Toward CMA-ES: The State-of-the-Art¶

While the basic Evolution Strategy we have built is robust, it faces a major hurdle in the real world: Ill-conditioning. In many optimization problems, the "valley" leading to the minimum is not a perfect circle; it is a long, narrow, and curved canyon.

In these scenarios, a circular mutation cloud (using the Identity Matrix) is highly inefficient. Most mutations will hit the "walls" of the canyon, and only a tiny fraction will move down the "floor." To solve this, we need the algorithm to learn the topology of the landscape as it moves.

In [1]:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

# --- 1. DESIGN THE EXTREME LANDSCAPE ---
def create_extreme_valley(angle_deg, scale_ratio):
    angle = np.radians(angle_deg)
    c, s = np.cos(angle), np.sin(angle)
    R = np.array([[c, -s], [s, c]]) 
    S = np.diag([scale_ratio, 1.0])  # 50:1 Elongation
    return R @ S @ R.T, angle

# Angle of 60 degrees for a distinct diagonal look
A, rot_angle = create_extreme_valley(angle_deg=60, scale_ratio=50.0)

def fitness(x, y):
    # Centered at (0,0)
    return A[0,0]*x**2 + (A[0,1]+A[1,0])*x*y + A[1,1]*y**2

# --- 2. POSITION THE MEAN ON THE VALLEY FLOOR ---
# To be in the valley, the mean must align with the 'easy' eigenvector
distance_from_center = 1.2
m = np.array([
    distance_from_center * np.cos(rot_angle + np.pi/2), 
    distance_from_center * np.sin(rot_angle + np.pi/2)
])
sigma = 0.2

# --- 3. SAMPLING ---
# Basic ES (Identity)
C_iso = np.eye(2)
samples_iso = m + sigma * np.random.multivariate_normal([0, 0], C_iso, 50)

# CMA-ES Goal (Optimal alignment)
C_aligned = np.linalg.inv(A)
C_aligned /= np.sqrt(np.linalg.det(C_aligned)) # Normalize area
samples_aligned = m + sigma * np.random.multivariate_normal([0, 0], C_aligned, 50)

# --- 4. VISUALIZATION ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))

data = [(ax1, samples_iso, C_iso, "Basic ES (Isotropic)"), 
        (ax2, samples_aligned, C_aligned, "CMA-ES (Aligned)")]

# Grid setup
lim = 2.0
x_grid = np.linspace(-lim, lim, 150)
y_grid = np.linspace(-lim, lim, 150)
X, Y = np.meshgrid(x_grid, y_grid)
Z = fitness(X, Y)

for ax, samples, C, title in data:
    # Using log-scale for better visualization of the narrow floor
    ax.contourf(X, Y, np.log1p(Z), levels=30, cmap='plasma', alpha=0.5)
    ax.scatter(samples[:,0], samples[:,1], c='white', edgecolors='blue', s=15, alpha=0.8)
    ax.scatter(m[0], m[1], c='red', marker='P', s=150, label="Mean (m)", zorder=6)
    ax.scatter(0, 0, c='lime', marker='*', s=300, label="Target (0,0)", zorder=6)
    
    # Calculate search boundary
    vals, vecs = np.linalg.eigh(C)
    order = vals.argsort()[::-1]
    v, w = vals[order], vecs[:, order]
    theta = np.degrees(np.arctan2(w[1, 0], w[0, 0]))
    
    # Visualize the elongation
    width, height = 4 * sigma * np.sqrt(v)
    ellipse = Ellipse(xy=m, width=width*2, height=height*2, angle=theta, 
                      edgecolor='cyan', fc='none', lw=2.5, ls='--', label='Search Area')
    ax.add_patch(ellipse)
    
    ax.set_title(title, fontsize=14)
    ax.set_aspect('equal')
    ax.set_xlim(-lim, lim)
    ax.set_ylim(-lim, lim)
    ax.grid(alpha=0.2)
    ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

No description has been provided for this image

What is CMA-ES?¶

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) represents a significant advancement in derivative-free, black-box optimization. It extends the $(\mu/\mu_w, \lambda)$ framework by incorporating a second-order model of the objective function's landscape through an adaptive covariance matrix.

Instead of just moving a fixed-shape cloud, CMA-ES adapts the shape and orientation of the mutation distribution.

Key Mechanisms of CMA-ES¶

CMA-ES utilizes two distinct mechanisms to adapt the search distribution dynamically:

Covariance Matrix Adaptation (CMA): This mechanism adjusts the geometric shape of the search distribution. By analyzing the successful steps taken by the population, the algorithm "stretches" the distribution along directions of high improvement and "compresses" it along directions of high cost. This effectively learns a local approximation of the inverse Hessian matrix, allowing the solver to navigate ill-conditioned landscapes—such as narrow, curved valleys—with high efficiency.
Cumulative Step-Size Adaptation (CSA): Replacing simpler heuristics like the 1/5th Success Rule, CSA utilizes an evolution path. This path tracks the cumulative movement of the distribution mean over several generations.
- Coherent Movement: If the evolution path is long (meaning successive steps are aligned), the algorithm increases the step-size $\sigma$ to accelerate convergence.
- Incoherent Movement: If the path is short or "zig-zags" (meaning successive steps cancel each other out), the algorithm decreases $\sigma$ to facilitate local refinement or convergence to the optimum.

Theoretical Advantages¶

Parameter Autonomy: CMA-ES is designed to be highly autonomous; its internal strategy parameters are typically self-adjusting and require minimal user intervention.
Invariance Properties: The algorithm is invariant to strictly monotonic transformations of the objective function value and to orthogonal transformations (rotations) of the search space.
Sample Efficiency: It maintains robust performance even with limited population sizes, making it suitable for high-dimensional, non-linear optimization problems.

Feature	Basic ES	CMA-ES
Search Shape	Static Sphere	Adaptive Ellipsoid
Feedback	Immediate (Current Gen)	Historical (Evolution Path)
Best For	Simple, "Round" Bowls	Complex, Narrow, or Curved Valleys
Complexity	Easy to implement	Mathematically more complex

Basic Sampling equation¶

In the CMA Evolution Strategy, a population of new search points (individuals, offspring) is generated by sampling a multivariate normal distribution:

$$\mathbf{x}_k^{(g+1)} \sim \mathbf{m}^{(g)} + \sigma^{(g)} \mathcal{N}\left(\mathbf{0}, \mathbf{C}^{(g)}\right) \quad \text{for } k = 1, \dots, \lambda$$

where

$\mathbf{x}_k^{(g+1)} \in \mathbb{R}^n$: The $k$-th offspring in the new generation $(g+1)$.
$\mathbf{m}^{(g)} \in \mathbb{R}^n$: The mean value of the search distribution at generation $g$. It represents the center of the current "cloud" of guesses.
$\sigma^{(g)} > 0$: The step-size. This is a scalar value that controls the overall "reach" or mutation strength. It determines the total volume of the search space covered.
$\mathcal{N}\left(\mathbf{0}, \mathbf{C}^{(g)}\right)$: A multivariate normal distribution with mean $0$ and covariance matrix $\mathbf{C}$.
$\mathbf{C}^{(g)} \in \mathbb{R}^{n \times n}$: The covariance matrix. Unlike basic ES where we assume every direction is equal ($\mathbf{C} = \mathbf{I}$), here $\mathbf{C}$ defines the shape and orientation of the search ellipsoid.

The Geometric Interpretation¶

While the formula looks like standard statistics, it describes a physical "search ellipsoid":

The Center ($\mathbf{m}$): Where the ellipsoid is located in the landscape.
The Scale ($\sigma$): How large the ellipsoid is.
The Shape ($\mathbf{C}$): How the ellipsoid is stretched and rotated.

The goal of CMA-ES is to adapt $\mathbf{C}$ so that the ellipsoid aligns with the "valleys" of the objective function. If the function is a long, narrow ridge, CMA-ES will learn a $\mathbf{C}$ that stretches the ellipsoid along that ridge, allowing the algorithm to take massive steps toward the optimum where other algorithms would be forced to take tiny, inefficient "zig-zag" steps.

Selection and Recombination: Moving the Mean¶

The transition from the current generation to the next begins with the update of the mean $\mathbf{m}^{(g+1)}$. This process involves two sub-steps: Selection of the best performers and their Recombination into a single new center.

Calculating the New Mean¶

From the $\lambda$ offspring sampled in the current generation, we select the $\mu$ best individuals (those with the highest fitness/lowest objective value). The new mean is then calculated as the weighted average of these $\mu$ points:

$$\mathbf{m}^{(g+1)} = \sum_{i=1}^{\mu} w_i \mathbf{x}_{i:\lambda}$$

Where:

$\mathbf{x}_{i:\lambda}$: The $i$-th best individual among the $\lambda$ offspring.
$w_i$: The positive recombination weights, which typically sum to one ($\sum w_i = 1$).
$w_1 \geq w_2 \geq \dots \geq w_{\mu} > 0$: The weights are non-increasing, ensuring that the "champion" of the generation has more influence on the next mean than the individuals at the bottom of the selection list.

Variance Effectiveness: $\mu_{eff}$¶

While $\mu$ counts the selected individuals, the variance-effective selection mass ($\mu_{eff}$) quantifies the reliable information gain. It is defined as:

$$\mu_{eff} = \left( \sum_{i=1}^{\mu} w_i^2 \right)^{-1}$$

Interpretation of $\mu_{eff}$:

Equal Weights: If all individuals have equal weights ($w_i = 1/\mu$), then $\mu_{eff} = \mu$.
Single Winner: If one individual carries all the weight ($w_1 = 1$, others $0$), then $\mu_{eff} = 1$.
The "Smoothing" Factor: Typically, $1 \leq \mu_{eff} \leq \mu$. A higher $\mu_{eff}$ indicates a more stable, "averaged" update that is less sensitive to noise, whereas a lower $\mu_{eff}$ indicates a more aggressive search that follows the single best point more closely.
Information Gain: It serves as a measure of the "signal-to-noise ratio" of the population update, used to tune the learning rates for the covariance matrix and step-size (see below).

The Mathematical Logic¶

Why is the variance $\sum w_i^2$?

For independent random variables $X_i$, the variance of their weighted sum is the sum of their weighted variances. Since $Var(aX) = a^2 Var(X)$:$$Var\left( \sum w_i X_i \right) = \sum w_i^2 Var(X_i)$$Assuming standardized mutation steps where $Var(X_i) = 1$, the total variance becomes $\sum w_i^2$.

Why the Inverse? (The Power of Averaging)

In signal processing, variance is noise and averaging is a filter.By averaging $\mu$ samples, we "clean" the search signal. $\mu_{eff}$ is the inverse of the variance because it represents the strength of this filter:

High Variance ($\sum w_i^2$ is large): Minimal noise reduction; the update is based on very few points. $\mu_{eff}$ is small.
Low Variance ($\sum w_i^2$ is small): Strong noise reduction; the signal is a consensus of many points. $\mu_{eff}$ is large.

$\mu_{eff}$ essentially tells us: "How many independent random samples would I need to average together to get this same level of noise reduction?"

Selection Mass and Weight Heuristics¶

The efficiency of the search is heavily influenced by the relationship between the number of offspring ($\lambda$), the number of selected parents ($\mu$), and the resulting effective selection mass ($\mu_{eff}$).

The following heuristics are standard in the literature for balancing selection pressure and search stability:

Optimal $\mu_{eff}$: A value of $\mu_{eff} \approx \lambda/4$ is generally considered a reasonable setting for the selection mass. This provides a balance between exploiting the best found points and maintaining enough diversity to avoid premature convergence.
Linear Weighting Scheme: A common, simplified approach for defining the recombination weights is:$$w_i \propto \mu - i + 1 \quad \text{for } i = 1 \dots \mu$$When combined with a selection ratio of $\mu \approx \lambda/2$, this yields an effective selection mass of $\mu_{eff} \approx 3\lambda/8$.

The Final Update Equation for the Mean¶

The update to the distribution mean represents the "Tell" phase of the algorithm. It transitions the search center from the current generation to the next by aggregating the information from the most successful candidates.The final update equation for the mean is expressed as:

$$\mathbf{m}^{(g+1)} = \mathbf{m}^{(g)} + c_m \sum_{i=1}^{\mu} w_i (\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})$$

$c_m$ is a learning rate, usually set to $1$. Lower values can be advantageous on noisy functions.

Analysis of the Equation

This formulation is mathematically equivalent to the weighted sum shown previously, but it offers a clearer insight into the algorithm's mechanics:

The Difference Vector: The term $(\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})$ represents the "step" or displacement of the $i$-th best individual from the current mean.
The Innovation: The summation $\sum_{i=1}^{\mu} w_i (\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})$ represents the weighted mutation vector. It signifies the collective "direction of success" discovered by the population.
Shift in Distribution: The mean is not simply replaced; it is shifted by the consensus of the winners.

This specific structure—defining the update as a displacement from the current mean—is fundamental because it allows the algorithm to decouple the direction of the move from the step-size ($\sigma$) and the covariance shape ($\mathbf{C}$) during subsequent adaptation steps.

Adapting the Covariance Matrix¶

To understand Covariance Matrix Adaptation, one must view the mutation cloud not as a collection of random points, but as a probability distribution that the algorithm is iteratively "fitting" to the search space.

The objective is to update the covariance matrix $\mathbf{C}$ such that the likelihood of generating previously successful steps is increased in future generations.

The Covariance Update Equation¶

In modern CMA-ES, the update to $\mathbf{C}$ combines two distinct "memories" of success:

$$\mathbf{C}^{(g+1)} = (1 - c_{1} - c_{\mu}) \mathbf{C}^{(g)} + c_{1} \underbrace{\mathbf{p}_c \mathbf{p}_c^T}_{\mathbf{P}_1} + c_{\mu} \underbrace{\sum_{i=1}^{\mu} w_i \mathbf{y}_i \mathbf{y}_i^T}_{ \mathbf{P}_{\mu}}$$

$\mathbf{P}_\mu$ for Rank-$\mu$ Update(The "Short-term" Memory) : This adapts the matrix based on the distribution of the current $\mu$ successful offspring. By looking at the spread of the winners in a single generation, it captures the local "shape" of the landscape. This is particularly powerful with large population sizes ($\lambda$).
$\mathbf{P}_1$ for Rank-one Update (The "Long-term" Memory) $$: This adapts the matrix based on the Evolution Path. It tracks the correlated direction the mean has moved over many generations. It is the "specialty" of CMA-ES, allowing it to learn the orientation of narrow, "cigar-like" valleys even with tiny populations.

with the learning rate coefficients:

$c_{\mu}$: The reliance on the current population.
$c_1$: The reliance on the evolution path.

The Rank-$\mu$ Update¶

The Rank-$\mu$ update allows the algorithm to extract geometric information from the entire selected population. This mechanism is primarily responsible for adapting the search distribution to the local topology of the objective function.

The Innovation Matrix $\mathbf{P}_{\mu}$¶

The Rank-$\mu$ Innovation Matrix is the weighted sample covariance of the successful steps in the current generation. We define the standardized step $\mathbf{y}_{i:\lambda}$ as:

$$\mathbf{y}_{i:\lambda} = \frac{\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)}}{\sigma^{(g)}}$$The innovation matrix is then:

$$\mathbf{P}_{\mu} = \sum_{i=1}^{\lambda} w_i \mathbf{y}_{i:\lambda} \mathbf{y}_{i:\lambda}^T$$

Where:

$(\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})$ represents the mutation vector of the $i$-th best offspring.
The division by $\sigma^{(g)}$ normalizes the update, ensuring the adaptation is scale-invariant.
The outer product (the term in brackets) produces a matrix that expands the search distribution specifically along the axis of the successful step.

Active Covariance Update (The "Avoidance" Strategy)¶

In modern CMA-ES, we use Active Selection. While the mean $\mathbf{m}$ only follows the best $\mu$ individuals, the covariance matrix $\mathbf{C}$ also learns from the worst.

Positive Weights ($w_i > 0$): Assigned to the top $\mu \approx \lambda/2$ individuals. These directions are "rewarded" by expanding the matrix toward them.
Negative Weights ($w_i < 0$): Assigned to the bottom individuals. These directions are "punished" by shrinking the matrix along those axes, "squashing" the ellipsoid to avoid known bad regions.

Weight Constraints: To maintain stability, the positive weights sum to one ($\sum_{w_i > 0} w_i = 1$). When including negative weights, the total sum $\sum_{i=1}^{\lambda} w_i$ is typically close to zero, ensuring the update primarily changes the shape of the distribution rather than just its overall volume.

The Learning Rate $c_{\mu}$ and the Time Horizon¶

The learning rate $c_{\mu}$ acts as a relative weight determining how much the current population's "snapshot" ($\mathbf{P}_{\mu}$) influences the update compared to the long-term evolution path ($\mathbf{P}_1$) and the existing memory ($\mathbf{C}^{(g)}$).

The 63% Rule and Backward Time Horizon¶

A critical intuition is that $1/c_{\mu}$ represents the Backward Time Horizon (or time constant) of the algorithm. Because the influence of old data fades exponentially, the "age" of the information follows a specific decay pattern.

The Derivation:

The update for the covariance matrix is defined by the learning rate $c_{\mu}$, which determines how much weight is given to new information versus the existing memory.

The current state of the covariance matrix $\mathbf{C}^{(g+1)}$ is a weighted sum of all past innovations $\mathbf{P}_{\mu}$. By expanding the update equation $\mathbf{C}^{(g+1)} = (1 - c_{\mu})\mathbf{C}^{(g)} + c_{\mu}\mathbf{P}_{\mu}^{(g)}$ backwards to the beginning of the search ($g=0$), we get:$$\mathbf{C}^{(g+1)} = (1-c_{\mu})^{g+1}\mathbf{C}^{(0)} + c_{\mu} \sum_{i=0}^{g} (1-c_{\mu})^i \mathbf{P}_{\mu}^{(g-i)}$$ This shows that the influence of an innovation from $i$ generations ago decays exponentially at a rate of $(1-c_{\mu})^i$. To find the "Backward Time Horizon," we look for the number of generations $\Delta g$ required for the sum of the weights to reach a certain fraction $\alpha$ of the total information. We define $\alpha$ as: $$\alpha = \sum_{i=0}^{\Delta g-1} c_{\mu}(1-c_{\mu})^i $$

The sum is a finite geometric series. Applying the formula for the sum of a geometric series results in: $$\alpha = 1 - (1-c_{\mu})^{\Delta g}$$

We want that $\alpha \approx 63\% \approx 1 - 1/e$, i.e. where approximately $63\%$ of the weights is summed up. This results in $$(1-c_{\mu})^{\Delta g} \approx e^{-1}$$

resp. $$ \Delta g = \frac{-1}{\ln(1-c_{\mu})} $$

For small learning rates ($c_\mu \ll 1$), we use the first-order Taylor expansion of the natural logarithm, $\ln(1-x) \approx -x$.

$$\Delta g \approx \frac{1}{c_{\mu}}$$

Intuition: The "Time Constant"

This approximation confirms that the reciprocal of the learning rate is the time constant of the system. Just as in physics or electronics (RC circuits), the value $1/c$ represents the point where the system has completed roughly 63.2% of its transition to a new state. In CMA-ES, this means that after $1/c_\mu$ generations, the "old" matrix has been 63% replaced by "new" geometric information.

What is a good setting for $c_{\mu}$?¶

To determine the optimal learning rate, we compare the information we gather against the complexity of the model we are trying to build:

Information Supply: In each generation, we gather approximately $\mu_{eff}$ independent pieces of information from the successful offspring.
Model Complexity: The covariance matrix $\mathbf{C}$ is a symmetric $n \times n$ matrix. To define its shape and orientation, we must estimate roughly $n^2/2$ independent parameters, meaning we need $O(n^2)$ total information.
The Time Window: As derived previously, the number of generations that contribute significantly to the current state of $\mathbf{C}$ is the time horizon $\Delta g \approx \frac{1}{c_{\mu}}$.

Therefore, to ensure the matrix is supported by enough data, we set: $$\text{"Information per generation"} \times \text{"Number of generations in memory"} \approx \text{Needed information}$$ $$\mu_{eff} \cdot \frac{1}{c_{\mu}} \approx n^2$$ Rearranging for the learning rate gives us the heuristic: $$c_{\mu} \approx \frac{\mu_{eff}}{n^2}$$ The setting of $c_{\mu} \approx \mu_{eff} / n^2$ ensures that we only update the matrix as fast as we can reliably "learn" the landscape.

Discussion: The Balance of Learning

1. Avoiding Rank Deficiency
If we were to set $c_{\mu}$ much higher than $\mu_{eff}/n^2$, we would be trying to solve a system with $n^2$ unknowns using fewer than $n^2$ data points. Mathematically, this leads to a rank-deficient covariance matrix. Geometrically, this means the search ellipsoid would "collapse" or become perfectly flat in certain directions, prematurely excluding parts of the search space where the solution might lie.

2. Scaling with Population Size
This relationship reveals one of the greatest strengths of CMA-ES: Parallelization. If we increase the population size $\lambda$ (and thus $\mu_{eff}$), we can increase $c_{\mu}$ proportionally. This shortens the time horizon $\Delta g$, allowing the algorithm to adapt the coordinate system much faster. In high-dimensional spaces ($n > 100$), a large population is often the only way to make the covariance adaptation happen in a reasonable number of generations.

3. The $n^2$ "Curse"
The $n^2$ in the denominator tells us that the difficulty of learning the landscape's geometry grows quadratically with the dimension. While the Mean Update can react to a gradient almost immediately, the Covariance Update is naturally more conservative, requiring many observations to "verify" that a change in the search shape is statistically justified.

The Rank-One Update¶

While the Rank-$\mu$ update captures the "spread" of the current population, it has a significant limitation: it is "sign-blind." Because it uses the outer product of mutation vectors, a step in direction $\mathbf{y}$ and a step in the opposite direction $-\mathbf{y}$ contribute identically to the matrix.

The Rank-One Update solves this by observing the Evolution Path—the correlated sequence of moves the distribution mean has taken over many generations.

The Evolution Path ($\mathbf{p}_c$)¶

Instead of looking only at the current generation, the algorithm maintains a strategy state called the evolution path $\mathbf{p}_c^{(g+1)}$. This is an exponentially weighted moving average of the steps taken by the mean $\mathbf{m}$:$$\mathbf{p}_c^{(g+1)} = (1 - c_c) \mathbf{p}_c^{(g)} + \sqrt{c_c (2 - c_c) \mu_{eff}} \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}$$

$c_c$: The cumulation learning rate. It determines the memory of the path; the backward time horizon is approximately $1/c_c$.
Directional Memory: If the mean moves consistently in one direction, $\mathbf{p}_c$ accumulates and "stretches" in that direction. If the mean zig-zags, the vectors cancel each other out, keeping the path small.
Normalization Factor $\sqrt{c_c (2 - c_c) \mu_{eff}}$: This constant ensures that the path stays "in sync" with the current search distribution.

In-Depth: The Normalization Factor $\sqrt{c_c (2 - c_c) \mu_{eff}}$¶

The normalization factor is essential for the stability of the Rank-One update. Its purpose is to ensure that under random selection (where there is no clear trend), the evolution path $\mathbf{p}_c$ follows the same distribution as a single mutation step: $\mathbf{p}_c \sim \mathcal{N}(\mathbf{0}, \mathbf{C})$.

Derivation from Variance Equilibrium

To keep the algorithm stable, the "energy" (variance) we add in each step must equal the "energy" lost through the decay $(1-c_c)$.

Let $\mathbf{z} = \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}$ be the latest step. If selection is random, $\mathbf{z}$ is the mean of $\mu$ samples from $\mathcal{N}(\mathbf{0}, \mathbf{C})$, so its variance is $Var(\mathbf{z}) = \frac{1}{\mu_{eff}} \mathbf{C}$.

We want the variance of the path to be stationary ($Var(\mathbf{p}_{new}) = Var(\mathbf{p}_{old}) = \mathbf{C}$).

Using the update rule $\mathbf{p}_{new} = (1-c_c)\mathbf{p}_{old} + \alpha \mathbf{z}$:$$Var(\mathbf{p}) = (1-c_c)^2 Var(\mathbf{p}) + \alpha^2 Var(\mathbf{z})$$

$$\mathbf{C} = (1-c_c)^2 \mathbf{C} + \frac{\alpha^2}{\mu_{eff}} \mathbf{C}$$

Dividing by $\mathbf{C}$ and solving for $\alpha$:

$$1 = (1 - 2c_c + c_c^2) + \frac{\alpha^2}{\mu_{eff}}$$

$$2c_c - c_c^2 = \frac{\alpha^2}{\mu_{eff}}$$

$$\alpha = \sqrt{c_c(2-c_c)\mu_{eff}}$$

The factor consists of two distinct parts that serve different scaling needs:

The Cumulation Scaling $\sqrt{c_c(2-c_c)}$: This acts as a variance normalizer. It ensures that the "steady-state" length of the evolution path remains constant regardless of the learning rate. As $c_c$ gets smaller, the path integrates more generations of data; this factor scales down each individual step so that the cumulative sum does not grow or shrink, but remains at a unit variance ($\approx 1$).
The Selection Scaling $\sqrt{\mu_{eff}}$: When we move the mean $\mathbf{m}$ using many individuals ($\mu$), the variance of that move decreases (the law of large numbers). We multiply by $\sqrt{\mu_{eff}}$ to "inflate" the vector back to the scale of a single individual mutation.

If we didn't normalize, the length of $\mathbf{p}_c$ would depend on the population size and the learning rate. When we eventually perform the Rank-One update ($\mathbf{C} \approx \mathbf{p}_c \mathbf{p}_c^T$), a non-normalized path would cause the volume of the search ellipsoid to explode or collapse. By using this factor, the Rank-One update only changes the shape and orientation of the covariance matrix, leaving the global scale to be handled exclusively by the step-size $\sigma$.

The Rank-One Innovation Matrix ($\mathbf{P}_1$)¶

The information from this path is then injected into the covariance matrix as a rank-one update:$$\mathbf{P}_1 = \mathbf{p}_c^{(g+1)} \left( \mathbf{p}_c^{(g+1)} \right)^T$$

Mathematically, the outer product of a single vector $\mathbf{p}_c$ creates a matrix with a rank of exactly 1. The Rank-One update "stretches" the search ellipsoid specifically along the axis of the evolution path.

Summary: The Final Covariance Update¶

The complete CMA-ES update for the covariance matrix integrates the previous state of the search with both immediate population feedback and long-term directional trends.

Substituting our definitions for $\mathbf{P}_1$ and $\mathbf{P}_{\mu}$ back into the primary equation, we arrive at the final update rule:

$$\mathbf{C}^{(g+1)} = \underbrace{(1 - c_1 - c_\mu) \mathbf{C}^{(g)}}_{\text{Decay (Memory)}} + \underbrace{c_1 \mathbf{p}_c^{(g+1)} (\mathbf{p}_c^{(g+1)})^T}_{\text{Rank-One Update}} + \underbrace{c_\mu \sum_{i=1}^{\lambda} w_i \mathbf{y}_{i:\lambda} \mathbf{y}_{i:\lambda}^T}_{\text{Rank-}\mu \text{ Update}}$$

Why this dual approach works:

Learning the Canyon (Rank-One): If you are in a long, narrow valley, your mean will move in the same direction for many generations. The evolution path $\mathbf{p}_c$ will grow long, and the Rank-One update will aggressively stretch $\mathbf{C}$ into a "needle" shape along that path.
Learning the Cross-Section (Rank-$\mu$): Even if you aren't moving much (near the optimum), the spread of your $\mu$ winners tells the algorithm which directions are "safer" than others. The Rank-$\mu$ update uses this population spread to orient the ellipsoid's width.
Note on the Decay Term:
You may notice that in some literature (e.g., [Hansen]), the decay term is written more generally as $(1 - c_1 - c_\mu \sum w_i)$.
- In Theory: If we include the negative weights in the sum ($\sum_{i=1}^{\lambda} w_i$), we can achieve a state of "zero-decay" (stationarity) where the matrix volume remains perfectly constant.
- In Practice: Most robust implementations use $(1 - c_1 - c_\mu)$, which assumes $\sum w_i^+ = 1$ and ignores negative weights in the decay. This ensures the "forgetting factor" is a stable constant, preventing the search space from exploding if negative "punishments" become too aggressive.

By balancing these two updates, CMA-ES achieves its "State-of-the-Art" status: it is as efficient as a second-order (Newton-type) optimizer but requires zero derivatives and remains robust against noisy or non-smooth landscapes.

Cumulative Step-Size Adaptation (CSA)¶

Motivation: Why a Separate Step-Size Control?¶

While the Covariance Matrix ($\mathbf{C}$) is highly effective at adapting the shape and orientation of the search distribution, it is fundamentally ill-equipped to manage the overall scale (the step-size $\sigma$). There are two primary reasons why we must decouple the step-size control from the covariance update:

1. The Mathematical Mismatch (Optimal Scale)¶

The mathematical logic required to find the optimal "length" of a step is different from the logic used to find the optimal "direction."The Population Effect: On fundamental landscapes like the Sphere function ($f(x) = \sum x_i^2$), the optimal step-size is heavily dependent on the effective population size ($\mu_{eff}$).The Constraint: The covariance update rule (Equation 30) does not scale with $\mu_{eff}$ in a way that matches these theoretical optima. If we relied solely on $\mathbf{C}$ to manage the volume of the search, the algorithm would be unable to exploit the efficiency gains offered by larger populations.

2. The Speed of Adaptation (The $n$ vs. $n^2$ Problem)¶

The most practical reason for a separate control is latency. The covariance matrix is a "slow learner" by design, while the step-size needs to be a "fast responder."

The $O(n^2)$ Bottleneck: As derived previously, the time horizon for the covariance matrix is proportional to $n^2$. This conservative rate is necessary to reliably estimate the $n^2/2$ degrees of freedom in the matrix without it collapsing.
The $O(n)$ Requirement: To achieve optimal progress—for instance, to "zoom in" on a target at a competitive rate—the overall step length must be able to change significantly within only $n$ function evaluations.

Conclusion: Separation of Concerns¶

The decoupling of step-size control from covariance adaptation is a functional necessity based on their differing mathematical roles. While the covariance matrix $\mathbf{C}$ models the second-order global topology, the step-size $\sigma$ manages the rate of convergence.To address this, CMA-ES introduces Cumulative Step-Size Adaptation (CSA). By isolating the scalar $\sigma$, the algorithm can react to evidence of overshooting or undershooting the optimum on a much shorter time horizon ($O(n)$). This separation ensures that the search remains agile even when the coordinate system adaptation is restricted by a conservative learning rate ($O(n^2)$).

The Intuition: Path Length¶

The core idea behind CSA is to analyze the "directionality" of the search path.

Sprinting (Under-sampling): If the algorithm takes several steps in the same direction, the evolution path becomes very long. This suggests the step-size is too small; we are "crawling" when we should be "sprinting." We should increase $\sigma$.
Zig-Zagging (Over-sampling): If the algorithm's steps constantly reverse direction or "zig-zag," the evolution path becomes very short (they cancel each other out). This suggests the step-size is too large, causing the algorithm to overshoot the optimum. We should decrease $\sigma$.
Optimal Search: If the steps are roughly perpendicular (uncorrelated), the path length matches what we would expect from a random walk. We keep $\sigma$ the same.

The Conjugate Evolution Path ($p_\sigma$)¶

To measure this, we maintain a second evolution path, $\mathbf{p}_\sigma$, specifically for the step-size. It is similar to the path used for the covariance matrix, but it is "scaled" to remove the influence of the current shape $\mathbf{C}$:

$$\mathbf{p}_\sigma^{(g+1)} = (1 - c_\sigma) \mathbf{p}_\sigma^{(g)} + \sqrt{c_\sigma (2 - c_\sigma) \mu_{eff}} \underbrace{\mathbf{C}^{(g)-1/2} \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}}_{\text{Isotropic Step}}$$

By applying the transformation $\mathbf{C}^{-1/2} = \mathbf{Q} \mathbf{\Lambda}^{-1/2} \mathbf{Q}^T$, we transform the elliptical search space back into a sphere. The steps of the transformation are:

$\mathbf{Q}^T$: Rotates the coordinate system so that the axes align with the principal components (eigenvectors) of the current distribution.
$\mathbf{\Lambda}^{-1/2}$: Scales the axes. It "shrinks" the directions where the variance was high and "stretches" the directions where variance was low, effectively normalizing all axes to a length of 1.
$\mathbf{Q}$: Rotates the normalized space back into the original coordinate frame.

This ensures that a step of length $\sigma$ has the same weight in the path $\mathbf{p}_\sigma$ regardless of whether it was taken along a "long" or "short" axis of the covariance ellipsoid. In other words, by "whitening" the data with $\mathbf{C}^{-1/2}$ for $\mathbf{p}_\sigma$, CMA-ES ensures that the step-size doesn't get confused by the shape of the ellipsoid. It sees the world as a sphere, allowing it to judge purely whether the steps are "zig-zagging" or "sprinting" in a coordinate-invariant way.

The derivation for the nomalizing factor $\sqrt{c_\sigma (2 - c_\sigma) \mu_{eff}}$ is analog to the normalizing factor of the evolution path (see above).

The Update Rule for $\sigma$¶

The update rule is derived by comparing the actual length of the evolution path to the expected length of a path generated by purely random, uncorrelated steps.

The Reference Value
In a purely random search (a "random walk"), each step is independent. The expected length of the vector $\mathbf{p}_\sigma$ under these conditions is the expected norm of a standard normal distribution, denoted as $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|$.

Note: $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\| \approx \sqrt{n} + \mathcal O(1/n)$, i.e. the expected length of a random vector is approximately $\sqrt{n}$. This serves as the "neutral" baseline: if your path is significantly longer than $\sqrt{n}$, you are moving in a consistent direction.
In Practice: To ensure accuracy in lower dimensions, CMA-ES uses the approximation $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\| \approx \sqrt{n}(1 - \frac{1}{4n} + \frac{1}{21n^2})$.

The Comparison

We look at the ratio between the realized path length and the expected path length:

$$\frac{\|\mathbf{p}_\sigma^{(g+1)}\|}{\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|}$$

If this ratio is $> 1$, the steps are correlated (sprinting), and we must increase $\sigma$.
If this ratio is $< 1$, the steps are anti-correlated (zig-zagging), and we must decrease $\sigma$.

The Exponential Update (Derivation)

To ensure $\sigma$ stays positive and scales multiplicatively, we use the natural logarithm of this ratio. We also introduce a damping parameter $d_\sigma$ and a learning rate $c_\sigma$ to control the update speed:

$$\ln(\sigma^{(g+1)}) = \ln(\sigma^{(g)}) + \frac{c_\sigma}{d_\sigma} \left( \frac{\|\mathbf{p}_\sigma^{(g+1)}\|}{\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|} - 1 \right)$$

Applying the exponential function to both sides yields the final update rule:

$$\sigma^{(g+1)} = \sigma^{(g)} \cdot \exp \left( \frac{c_\sigma}{d_\sigma} \left( \frac{\|\mathbf{p}_\sigma^{(g+1)}\|}{\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|} - 1 \right) \right)$$

$d_\sigma$ (Damping): Usually close to 1. It ensures that even if a single generation has a very long path, the step-size does not change by an order of magnitude instantly, preventing numerical instability.
If the path is longer than expected: The term in the parentheses is positive, the exponential is $>1$, and $\sigma$ increases.
If the path is shorter than expected: The term is negative, the exponential is $<1$, and $\sigma$ decreases.

The Result: Self-Induced Conjugacy¶

A remarkable property of CSA is that by adapting $\sigma$ such that the length of the evolution path $\|\mathbf{p}_\sigma\|$ remains approximately equal to the expected length $\mathbb{E}\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|$, the algorithm forces successive search steps to become $\mathbf{C}^{-1}$-conjugate.

The Mathematical Proof¶

If we assume the step-size is at an equilibrium, then $\|\mathbf{p}_\sigma^{(g+1)}\|^2 \approx \|\mathbf{p}_\sigma^{(g)}\|^2$. By expanding the recursive definition of $\mathbf{p}_\sigma$, we find that the "old" path and the "new" isotropic step must be orthogonal:

$$(\mathbf{p}_\sigma^{(g)})^T \left( \mathbf{C}^{(g)-1/2} \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}} \right) \approx 0$$

If we substitute the previous step ($\mathbf{m}^{(g)} - \mathbf{m}^{(g-1)}$) into the path and simplify, we arrive at:

$$(\mathbf{m}^{(g)} - \mathbf{m}^{(g-1)})^T \underbrace{\mathbf{C}^{(g)-1}}_{\text{Metric}} (\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}) \approx 0$$

Why this matters¶

This identity proves that CMA-ES is performing a second-order optimization.

The Inverse Hessian: Over time, $\mathbf{C}$ adapts to approximate the inverse Hessian matrix ($\mathbf{H}^{-1}$).
Efficiency: When $\mathbf{C} \approx \mathbf{H}^{-1}$, then $\mathbf{C}^{-1} \approx \mathbf{H}$. The formula above then matches the definition of H-conjugate steps.

In plain English: The Step-Size Control ensures that your next move doesn't "undo" the work of your previous move. It ensures that the algorithm heads directly toward the optimum along the "canyon floor" rather than bouncing off the walls, achieving the efficiency of a Quasi-Newton method without ever calculating a derivative.

Recap: The Two Paths of CMA-ES¶

CMA-ES uses two separate "memories" (evolution paths) to manage different aspects of the search. While they look similar mathematically, their goals are distinct.

The Synergy: How They Work Together

The Orientation ($\mathbf{p}_c$ and $\mathbf{C}$): $\mathbf{p}_c$ tracks the long-term trend of the mean. If the mean moves consistently in a specific direction, the Rank-one update stretches the ellipsoid into a needle shape along that axis. This builds the "coordinate system" of the landscape.
The Scaling ($\mathbf{p}_\sigma$ and $\sigma$): $\mathbf{p}_\sigma$ monitors the efficiency of those steps. By "whitening" the steps with $\mathbf{C}^{-1/2}$, it ignores the shape and looks only at the logic: are we zig-zagging or sprinting? It then scales the volume of the search accordingly.
The Result (Conjugacy): Together, they ensure that the search is Conjugate. The algorithm doesn't just find the optimum; it finds the most efficient path to it by learning the curvature of the function and ensuring each step is mathematically orthogonal to the last in the Hessian's metric.

Summary of the "Complete" CMA-ES Loop¶

We have now covered the entire state-of-the-art process:

Selection & Recombination: Move the mean $\mathbf{m}$ to the weighted center of the winners.
Rank-$\mu$ Update: Reshape $\mathbf{C}$ to match the current population spread.
Rank-1 Update: Reshape $\mathbf{C}$ to align with the long-term momentum ($p_c$).
CSA: Adjust the speed $\sigma$ based on the path length ($p_\sigma$).

Parameter¶

Strategy Parameters (User-Defined)¶

These must be set by the user based on the objective function and search space.

The Initial Mean: $\mathbf{m}^{(0)}$ (Location)
The Initial Step-Size: $\sigma^{(0)}$ (Scale)
The Population Size: $\lambda$ (Breadth/Robustness)

Internal Parameters (Default/Automated)¶

These are calculated automatically by the algorithm based on the dimension $n$ and the population size $\lambda$. Changing these is rare and typically only for specialized research.

Covariance Learning Rates ($c_1$ and $c_\mu$)

These control the "memory" of the covariance matrix. They are scaled by $1/n^2$ because there are approximately $n^2/2$ degrees of freedom in the matrix to be estimated.

$c_1 \approx \frac{2}{n^2}$: The learning rate for the Rank-One update (long-term trends).
$c_\mu \approx \frac{\mu_{eff}}{n^2}$: The learning rate for the Rank-$\mu$ update (population spread).

Step-Size Control ($c_\sigma$ and $d_\sigma$)

These control the "agility" of the step-size $\sigma$. They operate on a much faster time horizon than the covariance matrix.

$c_\sigma \approx \frac{\mu_{eff} + 2}{n + \mu_{eff} + 3}$: The learning rate for the evolution path $\mathbf{p}_\sigma$. Note it is $O(1/n)$, allowing $\sigma$ to change much faster than $\mathbf{C}$.
$d_\sigma \approx 1$: The damping parameter. It ensures $\sigma$ changes are stable and typically set near

Cumulation for Covariance ($c_c$)

$c_c \approx \frac{4}{n+4}$: Controls the "backward time horizon" for the evolution path $\mathbf{p}_c$.

Practical "Tuning" Flowchart¶

If the algorithm is failing, don't change the learning rates; change the strategy:

Stuck in Local Optima? Increase $\lambda$ (Population size).
Converging too slowly on a simple function? Increase $\sigma^{(0)}$.
Converging to a point outside bounds? Implement a boundary handling method (like a penalty function or reflection).
Premature Convergence? Increase $\lambda$ or restart with a larger $\sigma^{(0)}$ (IPOP-CMA-ES strategy).

Literature:¶

[Hansen] The CMA Evolution Strategy: A Tutorial, Nikolaus Hansen