Introduction into Evolution Strategy¶

The Principle of Evolution¶

Evolution Strategies (ES) are a class of stochastic, population-based optimization algorithms inspired by natural evolution. They are "black-box" optimizers, meaning they treat the objective function as a mystery: they don't require gradients or knowledge of the function’s internal structure—they simply observe the output for a given input.

Mathematically, we seek to minimize an objective function $f(\mathbf{x})$ (in general a non-linear, non-convex function) with respect to a real-parameter vector $\mathbf{x} \in \mathbb{R}^n$ (continuous domain). $$\mathbf{x}: \mathbb R^n \rightarrow \mathbb R$$ Typically, the global optimum is often neither feasible (NP-hard) nor relevant in practice. Therefore, we want to find $\mathbf x$-values where $f(\mathbf{x})$ is as small as possible.

Estimation of Distribution Algorithm (EDA)¶

Here, we just consider the family of Estimation of Distribution Algorithms. Unlike traditional Genetic Algorithms that manipulate individuals directly (via crossover or bit-flips), an EDA maintains a probabilistic model of the search space. In each generation, it "estimates" where the best solutions lie and updates its parameters accordingly.

General Framework

$\lambda$: Population size (number of offspring per generation).
$\theta^{(g)}$: State parameters of the distribution at generation $g$ (e.g., mean, variance).
$P(x|\theta)$: The probability distribution (typically Gaussian) to mutate the offsprings from the parents.
$F_\theta$: The update function for $\theta$: it adjusts $\theta$ so that the probability of generating the successful individuals from the current generation is maximized for the next.

The Algorithm and the Ask-Tell design¶

Initialization: Choose $\theta^{(0)}$

The optimization follows a repetitive four-step cycle:

Sampling (Ask): The solver generates $\lambda$ new candidate solutions (offspring) independently from the current distribution:$$\mathbf{x}_1, \dots, \mathbf{x}_\lambda \sim P(\mathbf{x} | \theta^{(g)})$$
Evaluation: Calculate the fitness (objective value) for each offspring with the objecttive function: $$f_k = f(\mathbf{x}_k) \quad \text{for } k=1 \dots \lambda$$ The solver doesn't know how this function works; it only receives the resulting fitness scores $f_k$.
Tell & Update: The fitness scores are returned to the solver. This is where the "intelligence" of the algorithm resides. The update function $F_\theta$ processes this feedback to re-estimate $\theta$, shifting the distribution toward high-fitness regions: $$\theta^{(g+1)} = F_\theta\left(\theta^{(g)}, \{(\mathbf{x}_1, f_1), \dots, (\mathbf{x}_\lambda, f_\lambda)\}\right)$$ In modern Evolution Strategies, this update step implicitly includes Ranking, Selection, and Recombination.
Iteration: Repeat until a termination criterion (e.g., target fitness reached, max generations, or distribution collapse) is met.

The software component solver encapsulate the algorithmic engine that maintains a probabilistic model of the search space, proposing candidate solutions through sampling and iteratively refining its internal parameters based on fitness feedback.

Selection and Recombination: From Offspring to the New Mean¶

Before looking at a specific example, we must define how the algorithm chooses which solutions "win" and how it uses them to move forward. In ES literature, we use two Greek letters to describe the population:

$\lambda$ (Lambda): The number of offspring generated per generation. A larger $\lambda$ allows for better exploration but costs more function evaluations.
$\mu$ (Mu): The number of top-performing individuals selected to influence the next generation.

Selection: Who survives?¶

There are two primary ways to handle the transition between generations:

$(\mu, \lambda)$ Selection: The $\mu$ best are chosen only from the new offspring. The parents are discarded. This is often preferred because it allows the algorithm to "forget" bad areas and escape local optima.
$(\mu + \lambda)$ Selection: The $\mu$ best are chosen from both offspring and parents (elitism).

While these discrete selection methods were the standard in early Evolution Strategies, modern ES has evolved toward Recombination. Instead of treating the $\mu$ survivors as "individuals" that simply live to the next round, modern strategies treat them as statistical data points. We don't just "pick" the winners; we recombine them to update the center of our search distribution.

Recombination: Creating the new center¶

In a strategy where $\mu > 1$, we have multiple survivors. Instead of having each survivor start its own independent family, we merge their "knowledge." This process is called Recombination.

In Evolution Strategies, we use Intermediate Recombination. We calculate a single "Center of Mass" from all $\mu$ survivors. This new center becomes the mean $\mathbf{m}^{(g+1)}$ for the entire next generation. Mathematically: $$\mathbf{m}^{(g+1)} = \sum_{i=1}^{\mu} w_i \mathbf{x}_{i:\lambda}$$ Where:

$\mathbf{x}_{i:\lambda}$ represents the $i$-th best individual of the current generation.
$w_i$ are the weights assigned to each survivor.

Weighting Strategies: $(\mu/\mu, \lambda)$ vs. $(\mu/\mu_w, \lambda)$¶

How much influence should each survivor have? There are two main approaches:

Strategy	Notation	Description
Equal Weights	$(\mu / \mu, \lambda)$	All $\mu$ survivors contribute equally ($w_i = 1/\mu$). The new mean is a simple average.
Weighted Recombination	$(\mu / \mu_w, \lambda)$	The best individual has the highest weight, the second-best slightly less, and so on.

"Weighted Recombination" is explained in detail later. Let's look at an example before.

Why Recombine?¶

Recombination is generally more robust than selecting a single winner because it provides:

Noise Reduction: If one offspring is "lucky" due to a random jump in a noisy landscape, averaging it with other top performers pulls the search back toward a more reliable consensus.
Stability: The movement of the population center becomes smoother. Instead of "zig-zagging" toward the optimum by following single lucky points, the mean "drifts" steadily toward the best region.
Faster Convergence: In high-dimensional spaces, the average of several good points is often closer to the true optimum than any single point in the group.

Mutation and the Distribution Parameters $\theta$¶

Mutation is the process of adding stochastic (random) noise to the parameters to explore the search space. In Evolution Strategies, this is typically done using a Gaussian (Normal) Distribution.

In this context, the state of our search is defined by the parameter set $\theta$. In the most general case, $\theta$ is a composite of three elements: $$\theta = \{ \mathbf{m}, \sigma, \mathbf{C} \}$$

$\mathbf{m}$ (Location): The mean vector, representing the center of the "search cloud."
$\sigma$ (Scale): The step-size, controlling the global mutation strength.
$\mathbf{C}$ (Shape): The covariance matrix, determining the orientation and "stretch" of the cloud.

The standard mutation for an individual $i$ is the realization of the current model:

$$\mathbf{x}_i = \mathbf{m} + \sigma \cdot \mathcal{N}(0, \mathbf{C})$$ Where:

$\mathcal{N}(0, \mathbf{C})$: A random vector drawn from a normal distribution with mean $0$ and covariance matrix $\mathbf C$ .

Simplifying $\theta$ for Basic ES¶

In many basic Evolution Strategies, we do not yet try to learn the correlations between variables. If we fix the covariance matrix as the Identity matrix ($\mathbf{C} = \mathbf{I}$), our distribution remains perfectly spherical (isotropic). In this case, our state parameters simplify to:

$$\theta = \{ \mathbf{m}, \sigma \}$$

The "intelligence" of the algorithm resides in the update function $F_\theta$. It takes the results from the current generation and calculates the parameters for the next. For the simplified case where $\mathbf{C} = \mathbf{I}$, the function $F_\theta$ is composed of two distinct parts:

Updating the Mean ($F_{\mathbf{m}}$): This "Update of the Mean" was already explained.
Updating the Step-size ($F_\sigma$): This is explained later in section "Step-Size Control (Meta-Evolution)".

A Simple 2D Example: The $(2, 10)$ Strategy¶

Now, let's visualize this. Imagine we are searching for the minimum of a function $f(\mathbf{x})$. We use a $(2/2, 10)$ strategy: we create 10 offspring, and the top 2 are averaged to find the next center.

Initialization: Start with a mean $\mathbf{m}^{(g)}$ and step-size $\sigma$.
The Loop:
- Sample: Generate 10 offspring $\mathbf{x}_k = \mathbf{m}^{(g)} + \sigma \mathcal{N}(0, \mathbf{I})$.
- Evaluate: Calculate $f(\mathbf{x}_k)$ for all 10.
- Select & Recombine: Rank them, take the 2 best, and calculate the new mean: $\mathbf{m}^{(g+1)} = \frac{1}{2}\mathbf{x}_{1:10} + \frac{1}{2}\mathbf{x}_{2:10}$.
Iteration: The next 10 offspring are sampled from this updated $\mathbf{m}^{(g+1)}$.

In [1]:

import numpy as np
import matplotlib.pyplot as plt

# 1. Setup Parameters
mu_size = 2         # Number of parents to recombine
lambda_size = 10    # Total offspring per generation
sigma = 0.5
target = np.array([2.0, -3.0])

# Initialize a SINGLE mean (the center of the distribution)
m = np.array([-4.0, 4.0]) 

def evaluate(x):
    return np.sum((x - target)**2)

# Grid for the landscape visualization
x_grid = np.linspace(-6, 6, 100)
y_grid = np.linspace(-6, 6, 100)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (X - target[0])**2 + (Y - target[1])**2

# Visualization setup
gens_to_plot = [0, 2, 9, 14]
fig, axes = plt.subplots(2, 2, figsize=(12, 11))
axes = axes.flatten()

def plot_generation(ax, gen, target, offspring, current_mean, winners, new_mean, is_first):
    ax.contourf(X, Y, Z, levels=25, cmap='viridis', alpha=0.2)
    ax.scatter(target[0], target[1], c='green', marker='*', s=200, label='Target', zorder=5)
    
    # Plot all offspring (The "Cloud")
    ax.scatter(offspring[:, 0], offspring[:, 1], c='blue', alpha=0.4, s=20, label='Offspring')
    
    # Plot current mean (The Center)
    ax.scatter(current_mean[0], current_mean[1], edgecolors='blue', marker='o', 
               facecolors='none', s=100, linewidth=2, label='Current Mean (m)')
    
    # Plot the survivors (The best mu)
    ax.scatter(winners[:, 0], winners[:, 1], c='red', marker='x', s=80, label='Best mu=2')
    
    # Plot the resulting new mean (The average of survivors)
    ax.scatter(new_mean[0], new_mean[1], c='red', marker='P', s=100, label='New Mean (m_next)', zorder=6)
        
    ax.set_title(f"Generation {gen}")
    ax.set_aspect('equal')
    if is_first: ax.legend(loc='upper right', fontsize='8')

# 2. The Evolution Loop (Ask-Tell)
plot_idx = 0
for gen in range(15):
    # --- ASK ---
    # All offspring are sampled from the same mean m
    noise = np.random.standard_normal((lambda_size, 2))
    offspring = m + sigma * noise
    
    # --- EVALUATE ---
    costs = np.array([evaluate(child) for child in offspring])
    
    # --- TELL ---
    # Sort by fitness and pick the mu_size best
    best_indices = np.argsort(costs)[:mu_size]
    winners = offspring[best_indices]
    
    # RECOMBINATION: Calculate the new mean (Intermediate Recombination)
    # Using equal weights (1/mu) for this example
    m_next = np.mean(winners, axis=0)
    
    # Snapshot plotting
    if gen in gens_to_plot:
        plot_generation(axes[plot_idx], gen, target, offspring, m, winners, m_next, plot_idx == 0)
        plot_idx += 1
    
    # UPDATE parameter for next generation
    m = m_next

plt.tight_layout()
plt.show()

No description has been provided for this image

Weighting Strategies¶

When the solver performs the Tell & Update step, it must decide how much to trust each survivor. While equal weights are simple, they treat the "champion" and the "borderline survivor" as equals.

Logarithmic Weighting¶

A common choice in modern ES (and the default in CMA-ES) is logarithmic weighting. This strategy penalizes lower-ranked individuals more harshly than a linear drop-off would. It ensures the "champion" has a significantly stronger influence on the new mean $\mathbf{m}^{(g+1)}$ than the mediocre survivors at the bottom of the top-$\mu$ list.

The raw weights $w'$ are calculated as:

$$w'_i = \ln(\mu + 0.5) - \ln(i) \quad \text{for } i = 1 \dots \mu$$

Normalization:

To ensure the weights sum to 1, we normalize the raw values:

$$w_i = \frac{w'_i}{\sum_{j=1}^{\mu} w'_j}$$

Example Weight Calculation ($\mu=5$)

In this example, we use the common logarithmic weighting scheme: $w'_i = \ln(\mu + 0.5) - \ln(i)$.

Rank ($i$)	Formula: $\ln(5.5) - \ln(i)$	Raw Weight ($w'$)	Normalized Weight ($w_i$)
1 (Best)	$\ln(5.5) - \ln(1)$	$1.7047$	0.458
2	$\ln(5.5) - \ln(2)$	$1.0116$	0.272
3	$\ln(5.5) - \ln(3)$	$0.6061$	0.163
4	$\ln(5.5) - \ln(4)$	$0.3184$	0.086
5	$\ln(5.5) - \ln(5)$	$0.0953$	0.021
Sum		3.7361	1.000

Notice that the Best individual has over 20 times the influence of the 5th individual. This aggressive weighting allows the search center to shift decisively toward the most promising direction while still benefiting from the "smoothing" effect of the other survivors.

Step-Size Control (Meta-Evolution)¶

One of the most critical challenges in ES is determining the value of $\sigma$ (mutation rate).

If $\sigma$ is too large: The algorithm "overshoots" the target and bounces around the optimum without ever settling.
If $\sigma$ is too small: The algorithm moves too slowly (crawls) or gets stuck in tiny local "potholes."

Step-size control allows the algorithm to automatically adapt $\sigma$ during the search. This is often called Meta-Evolution because the algorithm is evolving not just the solution, but also the parameters of the search itself.

The 1/5th Success Rule¶

One of the oldest and most intuitive methods for controlling step size is the 1/5th Success Rule. It is based on the observation that if more than 20% of mutations are successful (better than the parent), the search is likely in a smooth area and should move faster.

Observation	Action	Logic
Success rate > 1/5	Increase $\sigma = \sigma / 0.85$ ($\approx 1.17$)	We are finding better points easily; we should explore further and "sprint."
Success rate < 1/5	Decrease $\sigma = \sigma * 0.85$	We are missing the "good" areas; we should search more locally to "settle."

Why $0.85$ for increase/decrease?

Stability: A $15\%$ change per generation allows the algorithm to adapt to the landscape's curvature without losing control.
Symmetry: Using $c$ and $1/c$ ensures that if the success rate is exactly $1/5$, the step size remains neutral over time.
Damping: In higher dimensions, the "volume" of the search space grows exponentially. Smaller multipliers prevent the "curse of dimensionality" from making $\sigma$ explode or collapse too quickly.

Modern Approach: Cumulative Step-Size Adaptation (CSA)¶

Modern algorithms like CMA-ES (later in this course) use Evolution Paths. Instead of just looking at the current generation, they track the direction the mean has moved over several generations.

Consistent direction: If the mean has been moving in a straight line, $\sigma$ is increased (the algorithm "sprints").
Changing direction: If the mean is zig-zagging or turning back, $\sigma$ is decreased (the algorithm "slows down" to investigate).

Scenario	Movement Pattern	Adjustment	Goal
Sprinting	Multiple generations move in the same direction.	Increase $\sigma$	Reach the neighborhood of the optimum faster.
Zig-Zagging	Generations jump back and forth across a valley.	Decrease $\sigma$	Increase precision and "settle" into the minimum.
Stagnation	No offspring are better than the parent.	Increase $\sigma$	Break out of a local optimum or flat region.

Summary: From Basic ES to CMA-ES¶

Principle	Basic Evolution Strategy	CMA-ES (The Next Step)
Mutation	Circular: Uses an Identity Matrix. All directions are explored with equal probability.	Ellipsoidal: Learns a Covariance Matrix ($C$) to "stretch" and "rotate" the search toward the goal.
Recombination	Weighted Mean: The center of the next generation is the average of the survivors.	Strategic Mean: The mean update is used to calculate the "Evolution Path" and the new matrix.
Step-Size Control	1/5th Success Rule: Reactive adjustment based on the ratio of "better" offspring.	Cumulative Step Adaptation (CSA): Analyzes the search path to detect "sprinting" vs. "zig-zagging."
Selection	Rank-based $(\mu/\mu_w, \lambda)$: Only the fitness rank matters, not the absolute values.	Same Logic: Remains a robust, gradient-free black-box optimizer.