• Praktische Grundlagen der Informatik
  • Algorithmen und Datenstrukturen
  • Foundations for Robotic and AI
  • Übersicht Data Science:
  • Data Science / Machine Learning
  • Projekt: Deep Teaching
  • Alte Veranstaltungen:
  • Grundlagen der Informatik (NF)
  • Octave/Matlab Tutorial
  • Game Physik
  • Home
  • Lehre
  • Publikationen
  • Kontakt/Impressum

Conjugate directions¶

Definition¶

Two directions $\mathbf{d}_1$ and $\mathbf{d}_2$ are conjugate with respect to a symmetric, positive-definite Hessian matrix $H$ if:

$$\mathbf{d}_2^T \mathbf{H} \mathbf{d}_1 = 0$$

In [1]:
import numpy as np
import matplotlib.pyplot as plt

# 1. Define a rotated Hessian
theta = np.radians(30)
c, s = np.cos(theta), np.sin(theta)
R = np.array([[c, -s], [s, c]])
Lambda = np.diag([1.0, 8.0])
H = R @ Lambda @ R.T

def f(x, y):
    return 0.5 * (H[0,0]*x**2 + (H[0,1]+H[1,0])*x*y + H[1,1]*y**2)

def grad_f(p):
    return H @ p

# 2. Starting point P0
P0 = np.array([3.5, -0.5])

# 3. Step 1: Steepest Descent Direction
d1 = -grad_f(P0)
alpha1 = -(grad_f(P0).dot(d1)) / (d1.dot(H @ d1))
P1 = P0 + alpha1 * d1

# 4. Step 2: H-Conjugate Direction (The efficient path)
w = H @ d1
d2_conj = np.array([-w[1], w[0]])
alpha2 = -(grad_f(P1).dot(d2_conj)) / (d2_conj.dot(H @ d2_conj))
P2_conj = P1 + alpha2 * d2_conj

# 5. Comparison: Negative Gradient at P1
d2_grad = -grad_f(P1)
# Scale it for visibility, usually used with a small step size in GD
P2_grad = P1 + d2_grad * 0.4 

# 6. Visualization
plt.figure(figsize=(10, 8))
x, y = np.meshgrid(np.linspace(-4, 4, 150), np.linspace(-4, 4, 150))
plt.contour(x, y, f(x, y), levels=25, cmap='viridis', alpha=0.3)

# Path 1
plt.annotate('', xy=P1, xytext=P0, arrowprops=dict(arrowstyle='->', color='black', lw=2.5))
# Conjugate Step
plt.annotate('', xy=P2_conj, xytext=P1, arrowprops=dict(arrowstyle='->', color='red', lw=2.5))
# Pure Gradient Step from P1
plt.annotate('', xy=P2_grad, xytext=P1, arrowprops=dict(arrowstyle='->', color='blue', lw=2.5, ls='--'))

plt.plot(P0[0], P0[1], 'ko', label='Start ($p_0$)')
plt.plot(P1[0], P1[1], 'go', label='Line Min ($p_1$)')
plt.plot(0, 0, 'X', color='gold', markersize=12, label='Optimum')

# Labels
plt.text(P2_conj[0]+0.2, P2_conj[1]+0.2, 'Conjugate Step', color='red', fontweight='bold')
plt.text(P2_grad[0]-0.5, P2_grad[1]-0.5, 'Negative Gradient', color='blue', fontweight='bold')

plt.title('Comparison: Conjugate Direction vs. Pure Negative Gradient at $p_1$')
plt.legend(loc='upper left')
plt.axis('equal')
plt.grid(alpha=0.2)
#plt.savefig('conjugate_vs_gradient.png')
plt.show()
No description has been provided for this image

The visualization demonstrates the critical difference between Steepest Descent (following the gradient) and Conjugate Directions on a quadratic "valley" with a high aspect ratio (a narrow ellipse).

  1. The Black Arrow ($\mathbf{d}_1$): This represents the first step starting from $\mathbf{p}_0$. The direction $\mathbf{d}_1$ is the negative gradient at the starting point ($-\nabla f(\mathbf{p}_0)$). Notice that it starts out perpendicular to the contour line at $\mathbf{p}_0$.
  2. The Green Point ($\mathbf{p}_1$): This is the result of a Line Search along $\mathbf{d}_1$. You travel in a straight line until you reach the lowest possible altitude in that direction. Geometrically, this is the point where the path becomes tangent to a contour line.
  3. The Blue Dashed Arrow: This shows the negative gradient at $\mathbf{p}_1$. As the math proves, it is exactly $90^\circ$ (orthogonal) to the black arrow. Note that it points across the valley, not toward the optimum. If you followed this, you would begin the inefficient "zig-zag" behavior.
  4. The Red Arrow ($\mathbf{d}_2$): This is the H-Conjugate direction. It is "tilted" by the Hessian to account for the shape of the valley. Unlike the gradient, this direction points directly to the optimum.

The Change in the Gradient ($\Delta \nabla f$)¶

In a quadratic model, the gradient at any vector $\mathbf{x}$ is defined by:

$$\nabla f(\mathbf{x}) = \mathbf{H}\mathbf{x} + \mathbf{b}$$

When we move from point $\mathbf{p}_1$ in direction $\mathbf{d}_2$ to a new point $\mathbf{p}_2 = \mathbf{p}_1 + \alpha \mathbf{d}_2$, the gradient changes. The difference between the old and the new gradient is:$$\Delta \nabla f = \nabla f(\mathbf{p}_2) - \nabla f(\mathbf{p}_1) = \mathbf{H}(\mathbf{p}_2 - \mathbf{p}_1) = \alpha \mathbf{H} \mathbf{d}_2$$Thus, the term $\mathbf{H} \mathbf{d}_2$ describes the evolution of the gradient as we travel along $\mathbf{d}_2$.

The Condition for "Non-Interference"¶

Suppose we have already minimized the function along direction $\mathbf{d}_1$, reaching point $\mathbf{p}_1$. At this minimum on the $\mathbf{d}_1$-line, the directional derivative along $\mathbf{d}_1$ is zero:$$\mathbf{d}_1^T \nabla f(\mathbf{p}_1) = 0$$To ensure that our next step along $\mathbf{d}_2$ does not "undo" this progress, we require the slope in direction $\mathbf{d}_1$ to remain zero at the new point $\mathbf{p}_2$:$$\mathbf{d}_1^T \nabla f(\mathbf{p}_2) = 0$$

The Bridge to Conjugacy¶

Substituting the expression for the updated gradient:$$\nabla f(\mathbf{p}_2) = \nabla f(\mathbf{p}_1) + \alpha \mathbf{H} \mathbf{d}_2$$Multiplying by $\mathbf{d}_1^T$:$$\mathbf{d}_1^T \nabla f(\mathbf{p}_2) = \underbrace{\mathbf{d}_1^T \nabla f(\mathbf{p}_1)}_{= 0} + \alpha \mathbf{d}_1^T \mathbf{H} \mathbf{d}_2$$For our requirement to be satisfied, the second term must vanish:$$\alpha (\mathbf{d}_1^T \mathbf{H} \mathbf{d}_2) = 0$$Since $\mathbf{H}$ is symmetric ($\mathbf{H} = \mathbf{H}^T$), we can transpose the scalar product to reach the standard definition:$$\mathbf{d}_2^T \mathbf{H} \mathbf{d}_1 = 0$$

Summary:
To prevent a step along $\mathbf{d}_2$ from disturbing the minimization achieved along $\mathbf{d}_1$, the change in the gradient ($\mathbf{H}\mathbf{d}_2$) must be orthogonal to the previous direction ($\mathbf{d}_1$). In a quadratic landscape, this decoupling allows for exact convergence in $n$ steps for $n$ dimensions.

What does this mean intuitively?¶

Imagine two faders on a mixing console ($\mathbf{d}_1$ and $\mathbf{d}_2$):

  • Standard Orthogonal Directions: Moving fader $\mathbf{d}_2$ slightly shifts the setting of fader $\mathbf{d}_1$. You have to constantly "zig-zag" back and forth to keep both at their ideal levels.
  • Conjugate Directions: The faders are mathematically "wired" via the Hessian $\mathbf{H}$ such that moving $\mathbf{d}_2$ leaves the value of $\mathbf{d}_1$ untouched.

The directions are decoupled. Any movement along $\mathbf{d}_2$ generates a gradient change that is perpendicular to $\mathbf{d}_1$, allowing you to move "sideways" relative to the first direction without climbing back up the slope you just descended.

The Global Convergence Property¶

The true power of this decoupling is realized when we consider all $n$ dimensions of a problem.

If we choose a sequence of directions $\{\mathbf{d}_1, \mathbf{d}_2, \dots, \mathbf{d}_n\}$ that are all mutually conjugate (meaning $\mathbf{d}_i^T \mathbf{H} \mathbf{d}_j = 0$ for all $i \neq j$), then:

  1. Permanent Minimization: After performing a line search along $\mathbf{d}_1$, the gradient in that direction becomes zero. Because subsequent steps $\mathbf{d}_2, \mathbf{d}_3, \dots$ are conjugate to $\mathbf{d}_1$, the gradient component in the $\mathbf{d}_1$ direction stays zero for the remainder of the optimization.
  2. $n$-Step Convergence: Each new conjugate step "clears" one more dimension of the problem without re-introducing error into the dimensions already optimized.
  3. The Result: After exactly $n$ such line-search steps, the gradient has no non-zero components left in any direction. This means the gradient $\nabla f$ must be the zero vector, and the global minimum of an $n$-dimensional quadratic function is reached.

Why this matters for non-quadratic functions (and CMA-ES)¶

In the real world, functions are rarely perfectly quadratic. However:

  • Locally Quadratic: Near a minimum, almost all smooth functions look like a quadratic "bowl" (this is the basis of the Taylor expansion).
  • CMA-ES Application: While CMA-ES doesn't do exact line searches, it uses the Evolution Path to "build" these conjugate directions. By adapting the Covariance Matrix $\mathbf{C}$ to match $\mathbf{H}^{-1}$, it turns the complex, tilted landscape into a simple "round" one where every step taken is effectively conjugate to the last. This is why CMA-ES can solve highly "ill-conditioned" problems (narrow, tilted valleys) where other methods get stuck zig-zagging forever.
Kontakt/Impressum