Evolution Strategies (ES) are a class of stochastic, population-based optimization algorithms inspired by natural evolution. They are "black-box" optimizers, meaning they treat the objective function as a mystery: they don't require gradients or knowledge of the function’s internal structure—they simply observe the output for a given input.
Mathematically, we seek to minimize an objective function $f(\mathbf{x})$ (in general a non-linear, non-convex function) with respect to a real-parameter vector $\mathbf{x} \in \mathbb{R}^n$ (continuous domain). $$\mathbf{x}: \mathbb R^n \rightarrow \mathbb R$$ Typically, the global optimum is often neither feasible (NP-hard) nor relevant in practice. Therefore, we want to find $\mathbf x$-values where $f(\mathbf{x})$ is as small as possible.
Here, we just consider the family of Estimation of Distribution Algorithms. Unlike traditional Genetic Algorithms that manipulate individuals directly (via crossover or bit-flips), an EDA maintains a probabilistic model of the search space. In each generation, it "estimates" where the best solutions lie and updates its parameters accordingly.
General Framework
Initialization: Choose $\theta^{(0)}$
The optimization follows a repetitive four-step cycle:
The software component solver encapsulate the algorithmic engine that maintains a probabilistic model of the search space, proposing candidate solutions through sampling and iteratively refining its internal parameters based on fitness feedback.
Before looking at a specific example, we must define how the algorithm chooses which solutions "win" and how it uses them to move forward. In ES literature, we use two Greek letters to describe the population:
There are two primary ways to handle the transition between generations:
While these discrete selection methods were the standard in early Evolution Strategies, modern ES has evolved toward Recombination. Instead of treating the $\mu$ survivors as "individuals" that simply live to the next round, modern strategies treat them as statistical data points. We don't just "pick" the winners; we recombine them to update the center of our search distribution.
In a strategy where $\mu > 1$, we have multiple survivors. Instead of having each survivor start its own independent family, we merge their "knowledge." This process is called Recombination.
In Evolution Strategies, we use Intermediate Recombination. We calculate a single "Center of Mass" from all $\mu$ survivors. This new center becomes the mean $\mathbf{m}^{(g+1)}$ for the entire next generation. Mathematically: $$\mathbf{m}^{(g+1)} = \sum_{i=1}^{\mu} w_i \mathbf{x}_{i:\lambda}$$ Where:
How much influence should each survivor have? There are two main approaches:
| Strategy | Notation | Description |
|---|---|---|
| Equal Weights | $(\mu / \mu, \lambda)$ | All $\mu$ survivors contribute equally ($w_i = 1/\mu$). The new mean is a simple average. |
| Weighted Recombination | $(\mu / \mu_w, \lambda)$ | The best individual has the highest weight, the second-best slightly less, and so on. |
"Weighted Recombination" is explained in detail later. Let's look at an example before.
Recombination is generally more robust than selecting a single winner because it provides:
Mutation is the process of adding stochastic (random) noise to the parameters to explore the search space. In Evolution Strategies, this is typically done using a Gaussian (Normal) Distribution.
In this context, the state of our search is defined by the parameter set $\theta$. In the most general case, $\theta$ is a composite of three elements: $$\theta = \{ \mathbf{m}, \sigma, \mathbf{C} \}$$
The standard mutation for an individual $i$ is the realization of the current model:
$$\mathbf{x}_i = \mathbf{m} + \sigma \cdot \mathcal{N}(0, \mathbf{C})$$ Where:
In many basic Evolution Strategies, we do not yet try to learn the correlations between variables. If we fix the covariance matrix as the Identity matrix ($\mathbf{C} = \mathbf{I}$), our distribution remains perfectly spherical (isotropic). In this case, our state parameters simplify to:
$$\theta = \{ \mathbf{m}, \sigma \}$$
The "intelligence" of the algorithm resides in the update function $F_\theta$. It takes the results from the current generation and calculates the parameters for the next. For the simplified case where $\mathbf{C} = \mathbf{I}$, the function $F_\theta$ is composed of two distinct parts:
Now, let's visualize this. Imagine we are searching for the minimum of a function $f(\mathbf{x})$. We use a $(2/2, 10)$ strategy: we create 10 offspring, and the top 2 are averaged to find the next center.
import numpy as np
import matplotlib.pyplot as plt
# 1. Setup Parameters
mu_size = 2 # Number of parents to recombine
lambda_size = 10 # Total offspring per generation
sigma = 0.5
target = np.array([2.0, -3.0])
# Initialize a SINGLE mean (the center of the distribution)
m = np.array([-4.0, 4.0])
def evaluate(x):
return np.sum((x - target)**2)
# Grid for the landscape visualization
x_grid = np.linspace(-6, 6, 100)
y_grid = np.linspace(-6, 6, 100)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (X - target[0])**2 + (Y - target[1])**2
# Visualization setup
gens_to_plot = [0, 2, 9, 14]
fig, axes = plt.subplots(2, 2, figsize=(12, 11))
axes = axes.flatten()
def plot_generation(ax, gen, target, offspring, current_mean, winners, new_mean, is_first):
ax.contourf(X, Y, Z, levels=25, cmap='viridis', alpha=0.2)
ax.scatter(target[0], target[1], c='green', marker='*', s=200, label='Target', zorder=5)
# Plot all offspring (The "Cloud")
ax.scatter(offspring[:, 0], offspring[:, 1], c='blue', alpha=0.4, s=20, label='Offspring')
# Plot current mean (The Center)
ax.scatter(current_mean[0], current_mean[1], edgecolors='blue', marker='o',
facecolors='none', s=100, linewidth=2, label='Current Mean (m)')
# Plot the survivors (The best mu)
ax.scatter(winners[:, 0], winners[:, 1], c='red', marker='x', s=80, label='Best mu=2')
# Plot the resulting new mean (The average of survivors)
ax.scatter(new_mean[0], new_mean[1], c='red', marker='P', s=100, label='New Mean (m_next)', zorder=6)
ax.set_title(f"Generation {gen}")
ax.set_aspect('equal')
if is_first: ax.legend(loc='upper right', fontsize='8')
# 2. The Evolution Loop (Ask-Tell)
plot_idx = 0
for gen in range(15):
# --- ASK ---
# All offspring are sampled from the same mean m
noise = np.random.standard_normal((lambda_size, 2))
offspring = m + sigma * noise
# --- EVALUATE ---
costs = np.array([evaluate(child) for child in offspring])
# --- TELL ---
# Sort by fitness and pick the mu_size best
best_indices = np.argsort(costs)[:mu_size]
winners = offspring[best_indices]
# RECOMBINATION: Calculate the new mean (Intermediate Recombination)
# Using equal weights (1/mu) for this example
m_next = np.mean(winners, axis=0)
# Snapshot plotting
if gen in gens_to_plot:
plot_generation(axes[plot_idx], gen, target, offspring, m, winners, m_next, plot_idx == 0)
plot_idx += 1
# UPDATE parameter for next generation
m = m_next
plt.tight_layout()
plt.show()
When the solver performs the Tell & Update step, it must decide how much to trust each survivor. While equal weights are simple, they treat the "champion" and the "borderline survivor" as equals.
A common choice in modern ES (and the default in CMA-ES) is logarithmic weighting. This strategy penalizes lower-ranked individuals more harshly than a linear drop-off would. It ensures the "champion" has a significantly stronger influence on the new mean $\mathbf{m}^{(g+1)}$ than the mediocre survivors at the bottom of the top-$\mu$ list.
The raw weights $w'$ are calculated as:
$$w'_i = \ln(\mu + 0.5) - \ln(i) \quad \text{for } i = 1 \dots \mu$$
Normalization:
To ensure the weights sum to 1, we normalize the raw values:
$$w_i = \frac{w'_i}{\sum_{j=1}^{\mu} w'_j}$$
Example Weight Calculation ($\mu=5$)
In this example, we use the common logarithmic weighting scheme: $w'_i = \ln(\mu + 0.5) - \ln(i)$.
| Rank ($i$) | Formula: $\ln(5.5) - \ln(i)$ | Raw Weight ($w'$) | Normalized Weight ($w_i$) |
|---|---|---|---|
| 1 (Best) | $\ln(5.5) - \ln(1)$ | $1.7047$ | 0.458 |
| 2 | $\ln(5.5) - \ln(2)$ | $1.0116$ | 0.272 |
| 3 | $\ln(5.5) - \ln(3)$ | $0.6061$ | 0.163 |
| 4 | $\ln(5.5) - \ln(4)$ | $0.3184$ | 0.086 |
| 5 | $\ln(5.5) - \ln(5)$ | $0.0953$ | 0.021 |
| Sum | 3.7361 | 1.000 |
Notice that the Best individual has over 20 times the influence of the 5th individual. This aggressive weighting allows the search center to shift decisively toward the most promising direction while still benefiting from the "smoothing" effect of the other survivors.
One of the most critical challenges in ES is determining the value of $\sigma$ (mutation rate).
Step-size control allows the algorithm to automatically adapt $\sigma$ during the search. This is often called Meta-Evolution because the algorithm is evolving not just the solution, but also the parameters of the search itself.
One of the oldest and most intuitive methods for controlling step size is the 1/5th Success Rule. It is based on the observation that if more than 20% of mutations are successful (better than the parent), the search is likely in a smooth area and should move faster.
| Observation | Action | Logic |
|---|---|---|
| Success rate > 1/5 | Increase $\sigma = \sigma / 0.85$ ($\approx 1.17$) | We are finding better points easily; we should explore further and "sprint." |
| Success rate < 1/5 | Decrease $\sigma = \sigma * 0.85$ | We are missing the "good" areas; we should search more locally to "settle." |
Why $0.85$ for increase/decrease?
Modern algorithms like CMA-ES (later in this course) use Evolution Paths. Instead of just looking at the current generation, they track the direction the mean has moved over several generations.
| Scenario | Movement Pattern | Adjustment | Goal |
|---|---|---|---|
| Sprinting | Multiple generations move in the same direction. | Increase $\sigma$ | Reach the neighborhood of the optimum faster. |
| Zig-Zagging | Generations jump back and forth across a valley. | Decrease $\sigma$ | Increase precision and "settle" into the minimum. |
| Stagnation | No offspring are better than the parent. | Increase $\sigma$ | Break out of a local optimum or flat region. |
| Principle | Basic Evolution Strategy | CMA-ES (The Next Step) |
|---|---|---|
| Mutation | Circular: Uses an Identity Matrix. All directions are explored with equal probability. | Ellipsoidal: Learns a Covariance Matrix ($C$) to "stretch" and "rotate" the search toward the goal. |
| Recombination | Weighted Mean: The center of the next generation is the average of the survivors. | Strategic Mean: The mean update is used to calculate the "Evolution Path" and the new matrix. |
| Step-Size Control | 1/5th Success Rule: Reactive adjustment based on the ratio of "better" offspring. | Cumulative Step Adaptation (CSA): Analyzes the search path to detect "sprinting" vs. "zig-zagging." |
| Selection | Rank-based $(\mu/\mu_w, \lambda)$: Only the fitness rank matters, not the absolute values. | Same Logic: Remains a robust, gradient-free black-box optimizer. |