import numpy as np
import matplotlib.pyplot as plt
# 1. Define a rotated Hessian
theta = np.radians(30)
c, s = np.cos(theta), np.sin(theta)
R = np.array([[c, -s], [s, c]])
Lambda = np.diag([1.0, 8.0])
H = R @ Lambda @ R.T
def f(x, y):
return 0.5 * (H[0,0]*x**2 + (H[0,1]+H[1,0])*x*y + H[1,1]*y**2)
def grad_f(p):
return H @ p
# 2. Starting point P0
P0 = np.array([3.5, -0.5])
# 3. Step 1: Steepest Descent Direction
d1 = -grad_f(P0)
alpha1 = -(grad_f(P0).dot(d1)) / (d1.dot(H @ d1))
P1 = P0 + alpha1 * d1
# 4. Step 2: H-Conjugate Direction (The efficient path)
w = H @ d1
d2_conj = np.array([-w[1], w[0]])
alpha2 = -(grad_f(P1).dot(d2_conj)) / (d2_conj.dot(H @ d2_conj))
P2_conj = P1 + alpha2 * d2_conj
# 5. Comparison: Negative Gradient at P1
d2_grad = -grad_f(P1)
# Scale it for visibility, usually used with a small step size in GD
P2_grad = P1 + d2_grad * 0.4
# 6. Visualization
plt.figure(figsize=(10, 8))
x, y = np.meshgrid(np.linspace(-4, 4, 150), np.linspace(-4, 4, 150))
plt.contour(x, y, f(x, y), levels=25, cmap='viridis', alpha=0.3)
# Path 1
plt.annotate('', xy=P1, xytext=P0, arrowprops=dict(arrowstyle='->', color='black', lw=2.5))
# Conjugate Step
plt.annotate('', xy=P2_conj, xytext=P1, arrowprops=dict(arrowstyle='->', color='red', lw=2.5))
# Pure Gradient Step from P1
plt.annotate('', xy=P2_grad, xytext=P1, arrowprops=dict(arrowstyle='->', color='blue', lw=2.5, ls='--'))
plt.plot(P0[0], P0[1], 'ko', label='Start ($p_0$)')
plt.plot(P1[0], P1[1], 'go', label='Line Min ($p_1$)')
plt.plot(0, 0, 'X', color='gold', markersize=12, label='Optimum')
# Labels
plt.text(P2_conj[0]+0.2, P2_conj[1]+0.2, 'Conjugate Step', color='red', fontweight='bold')
plt.text(P2_grad[0]-0.5, P2_grad[1]-0.5, 'Negative Gradient', color='blue', fontweight='bold')
plt.title('Comparison: Conjugate Direction vs. Pure Negative Gradient at $p_1$')
plt.legend(loc='upper left')
plt.axis('equal')
plt.grid(alpha=0.2)
#plt.savefig('conjugate_vs_gradient.png')
plt.show()
The visualization demonstrates the critical difference between Steepest Descent (following the gradient) and Conjugate Directions on a quadratic "valley" with a high aspect ratio (a narrow ellipse).
In a quadratic model, the gradient at any vector $\mathbf{x}$ is defined by:
$$\nabla f(\mathbf{x}) = \mathbf{H}\mathbf{x} + \mathbf{b}$$
When we move from point $\mathbf{p}_1$ in direction $\mathbf{d}_2$ to a new point $\mathbf{p}_2 = \mathbf{p}_1 + \alpha \mathbf{d}_2$, the gradient changes. The difference between the old and the new gradient is:$$\Delta \nabla f = \nabla f(\mathbf{p}_2) - \nabla f(\mathbf{p}_1) = \mathbf{H}(\mathbf{p}_2 - \mathbf{p}_1) = \alpha \mathbf{H} \mathbf{d}_2$$Thus, the term $\mathbf{H} \mathbf{d}_2$ describes the evolution of the gradient as we travel along $\mathbf{d}_2$.
Suppose we have already minimized the function along direction $\mathbf{d}_1$, reaching point $\mathbf{p}_1$. At this minimum on the $\mathbf{d}_1$-line, the directional derivative along $\mathbf{d}_1$ is zero:$$\mathbf{d}_1^T \nabla f(\mathbf{p}_1) = 0$$To ensure that our next step along $\mathbf{d}_2$ does not "undo" this progress, we require the slope in direction $\mathbf{d}_1$ to remain zero at the new point $\mathbf{p}_2$:$$\mathbf{d}_1^T \nabla f(\mathbf{p}_2) = 0$$
Substituting the expression for the updated gradient:$$\nabla f(\mathbf{p}_2) = \nabla f(\mathbf{p}_1) + \alpha \mathbf{H} \mathbf{d}_2$$Multiplying by $\mathbf{d}_1^T$:$$\mathbf{d}_1^T \nabla f(\mathbf{p}_2) = \underbrace{\mathbf{d}_1^T \nabla f(\mathbf{p}_1)}_{= 0} + \alpha \mathbf{d}_1^T \mathbf{H} \mathbf{d}_2$$For our requirement to be satisfied, the second term must vanish:$$\alpha (\mathbf{d}_1^T \mathbf{H} \mathbf{d}_2) = 0$$Since $\mathbf{H}$ is symmetric ($\mathbf{H} = \mathbf{H}^T$), we can transpose the scalar product to reach the standard definition:$$\mathbf{d}_2^T \mathbf{H} \mathbf{d}_1 = 0$$
Summary:
To prevent a step along $\mathbf{d}_2$ from disturbing the minimization achieved along $\mathbf{d}_1$, the change in the gradient ($\mathbf{H}\mathbf{d}_2$) must be orthogonal to the previous direction ($\mathbf{d}_1$). In a quadratic landscape, this decoupling allows for exact convergence in $n$ steps for $n$ dimensions.
Imagine two faders on a mixing console ($\mathbf{d}_1$ and $\mathbf{d}_2$):
The directions are decoupled. Any movement along $\mathbf{d}_2$ generates a gradient change that is perpendicular to $\mathbf{d}_1$, allowing you to move "sideways" relative to the first direction without climbing back up the slope you just descended.
The true power of this decoupling is realized when we consider all $n$ dimensions of a problem.
If we choose a sequence of directions $\{\mathbf{d}_1, \mathbf{d}_2, \dots, \mathbf{d}_n\}$ that are all mutually conjugate (meaning $\mathbf{d}_i^T \mathbf{H} \mathbf{d}_j = 0$ for all $i \neq j$), then:
In the real world, functions are rarely perfectly quadratic. However: