on the example Covariance Matrix
Synthetic Data Generation
We start by generating a 2D Gaussian distribution with a known mean $\boldsymbol \mu$ and covariance $\boldsymbol \Sigma$. This simulates noisy sensor measurements.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(41)
# True parameters
od = 0.8
Sigma = np.array([[2, od], [od, 0.6]])
mu = np.array([2., 1.])
nb_data = 40
# Sample the data
X = np.random.multivariate_normal(mu, Sigma, nb_data)
# Visualization
plt.figure(figsize=(5, 5))
plt.scatter(X[:, 0], X[:, 1], c='b', marker='x', label="Data")
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title("Sampled 2D Data")
plt.legend()
plt.show()
In reality, we don't know the "True" $\boldsymbol \mu$ or $\boldsymbol \Sigma$. We must estimate them from the $n$ samples (for each data component).
The Empirical Mean ($\bar{x}$):$$\bar{x}_i = \frac{1}{n} \sum_{k=1}^n x_i^{(k)}$$
The Empirical Covariance:
To get an unbiased estimator, we use $n-1$ in the denominator. This is called Bessel's Correction; it compensates for the fact that we are measuring spread around the empirical mean $\bar{x}$ rather than the true population mean $\mu$, which slightly underestimates the true variance.
The covariance between dimension $i$ and dimension $j$ is:$$\text{cov}(x_i, x_j) = \frac{1}{n-1} \sum_{k=1}^n (x_i^{(k)} - \bar{x}_i)(x_j^{(k)} - \bar{x}_j)$$
To compute this efficiently for all dimensions at once, we center the data matrix $X$ by subtracting the mean (using NumPy broadcasting):
$$\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}\bar{\mathbf{x}}^T$$
where
Then, the empirically covariance matrix $C$ is found via the inner product:
$$\mathbf{C} = \frac{1}{n-1} \tilde{\mathbf{X}}^T \tilde{\mathbf{X}}$$
$$ \mathbf{C} = \frac{1}{n-1} \tilde{\mathbf{X}}^T \cdot \tilde{\mathbf{X}} = \frac{1}{n-1} \begin{pmatrix} \tilde x_1^{(1)}& \tilde x_1^{(2)} & \tilde x_1^{(3)} & \dots \\ \tilde x_2^{(1)}& \tilde x_2^{(2)} & \tilde x_2^{(3)} & \dots\\ \dots & \dots & \dots & \dots \\ \end{pmatrix} \cdot \begin{pmatrix} \tilde x_1^{(1)}& \tilde x_2^{(1)} & \dots \\ \tilde x_1^{(2)}& \tilde x_2^{(2)} & \dots \\ \tilde x_1^{(3)}& \tilde x_2^{(3)} & \dots \\ \dots & \dots & \dots \\ \end{pmatrix} = \frac{1}{n-1} \begin{pmatrix} \sum_k \tilde x_1^{(k)} \tilde x_1^{(k)} & \sum_k \tilde x_1^{(k)} \tilde x_2^{(k)} & \dots \\ \sum_k \tilde x_2^{(k)} \tilde x_1^{(k)} & \sum_k \tilde x_2^{(k)} \tilde x_2^{(k)} & \dots \\ \dots & \dots & \dots \\ \end{pmatrix} $$
n = X.shape[0]
mean_empirical = X.mean(axis=0)
X_centered = X - mean_empirical
# Manual computation
C = (1 / (n - 1)) * X_centered.T @ X_centered
# NumPy built-in
# cov_mat_np = np.cov(X, rowvar=False)
print(f"Empirical Covariance Matrix:\n{C}")
Definition: A matrix $\mathbf{C}$ is symmetric if it is equal to its transpose ($\mathbf{C} = \mathbf{C}^T$).
In the world of Covariance Matrices, symmetry is guaranteed because the covariance between random variable $\mathbf{X}_1$ and $\mathbf{X}_2$ is identical to the covariance between $\mathbf{X}_2$ and $\mathbf{X}_1$.
Definition: A square matrix $\mathbf{C} \in \mathbb{R}^{n \times n}$ is positive-definite if for all $\mathbf v \in \mathbb{R}^{n} $ and $\mathbf v \neq 0$ holds $$ \mathbf{v}^T \mathbf{C} \mathbf{v} > 0 $$
Let $\mathbf{y} = \mathbf{x} - \boldsymbol \mu $ be a random vector (centered at zero) drawn from our data distribution. The covariance is defined as the expected value:$$\mathbf{C} = \mathbb{E}[\mathbf{y}\mathbf{y}^T]$$
with an arbitrary vector $\mathbf{v}$:
$$\mathbf{v}^T \mathbf{C} \mathbf{v} = \mathbf{v}^T \mathbb{E}[\mathbf{y}\mathbf{y}^T] \mathbf{v}$$
Since $\mathbf{v}$ is a constant vector, we can move it inside the expectation: $$\mathbb{E}[\mathbf{v}^T \mathbf{y}\mathbf{y}^T \mathbf{v}]$$
Notice that $\mathbf{v}^T \mathbf{y}$ is a scalar (let's call it $a$). Then $\mathbf{y}^T \mathbf{v}$ is also $a$. So the equation becomes:
$$\mathbb{E}[a \cdot a] = \mathbb{E}[a^2]$$
Because $a^2$ is the square of a number, it is always $\ge 0$. Therefore, its average (Expectation) is also $\ge 0$.
The expression $\mathbf{v}^T \mathbf{C} \mathbf{v}$ represents the variance of the data projected onto the direction $\mathbf{v}$. Since variance cannot be negative, the matrix must be positive semi-definite. If the data has any spread at all in that direction, the variance is $> 0$, making it positive definite.
For a square matrix $\mathbf{C}$, a non-zero vector $\mathbf{q}$ is an eigenvector if the transformation of $\mathbf{q}$ by $\mathbf{C}$ results only in a change of scale, not direction:
$$\mathbf{C}\mathbf{q} = \lambda\mathbf{q}$$
Remember: The Eigenvectors $\bf{q}$ are not rotated by $C$.
$$\mathbf{C}\mathbf{q} = \lambda \mathbf{q} \implies \mathbf{q}^T \mathbf{C} \mathbf{q} = \lambda (\mathbf{q}^T \mathbf{q})$$ Since $\mathbf{q}^T \mathbf{C} \mathbf{q} > 0$ and $\mathbf{q}^T \mathbf{q} > 0$, it follows that $\lambda > 0$.
To find $\lambda$, we rearrange the equation:$$(\mathbf{C} - \lambda \mathbb{I}) \mathbf{q} = \mathbf{0}$$
This equation has a non-zero solution for $\vec{q}$ only if the matrix $(\mathbf{C} - \lambda \mathbb{I})$ is singular, meaning its determinant is zero:
$$\det(\mathbf{C} - \lambda \mathbb{I}) = 0$$
Mathematical Example:
For a matrix $\mathbf{C} = \begin{pmatrix} 3 & 1 \\ 0 & 2 \end{pmatrix}$:
$$\det \begin{pmatrix} 3-\lambda & 1 \\ 0 & 2-\lambda \end{pmatrix} = (3-\lambda)(2-\lambda) - 0 = 0$$
Results: $\lambda_1 = 3, \lambda_2 = 2$.
What happens if we repeatedly apply the covariance matrix $\mathbf{C}$ to random vector (not an eigenvector) and rescale it to length 1?
# 1. Define the function correctly
def transform_and_rescale(C_matrix, y, rescale=True):
y = C_matrix.dot(y)
if rescale:
y = y / np.linalg.norm(y)
return y
# 2. Setup context (Assuming C is our empirical covariance matrix from before)
# If you don't have C defined, uncomment the next two lines:
# Sigma = np.array([[2, 0.8], [0.8, 0.6]])
# C = Sigma
# 3. Initialize the vector
y = np.array([1., -1]) # Starting with a slight offset from the axis
y = y / np.linalg.norm(y)
# 4. Plotting
fig, ax = plt.subplots(figsize=(5, 5))
ax.set_xlim(-1.2, 1.2)
ax.set_ylim(-1.2, 1.2)
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)
# Plot the starting vector in red
ax.quiver(0, 0, y[0], y[1], color="red", scale=1, scale_units='xy',
angles='xy', label="Start", zorder=3)
# 5. Repeatedly transform and plot the "path" toward the eigenvector
for i in range(10):
y = transform_and_rescale(C, y)
alpha_val = (i + 1) / 10 # Make later iterations darker
ax.quiver(0, 0, y[0], y[1], alpha=alpha_val, color="black",
scale=1, scale_units='xy', angles='xy')
plt.title("Convergence toward the Dominant Eigenvector")
plt.legend()
plt.grid(True, linestyle=':', alpha=0.6)
plt.show()
The result:
No matter where you start (unless you are perfectly aligned with a smaller eigenvector), repeated transformation by $C$ will "pull" the vector toward the eigenvector with the largest eigenvalue.
In the context of our data, this direction is the axis of maximum variance.
| Term | Symbol | Meaning in Data Science |
|---|---|---|
| Covariance Matrix | $\mathbf{C}$ | Describes the "shape" and spread of the data cloud. |
| Eigenvector | $\bf{q}$ | The "Principal Axes" (directions of maximum/minimum spread). |
| Eigenvalue | $\lambda_i$ | The amount of variance captured along a specific eigenvector. |
| Trace($C$) | $\sum_i \lambda_i$ | The total variance in the dataset (sum of the diagonal). |
Premise: Let $\mathbf{C} \in \mathbb{R}^{n \times n}$ be a symmetric matrix ($\mathbf{C} = \mathbf{C}^T$). Let $\mathbf{q}_1, \mathbf{q}_2$ be eigenvectors with corresponding distinct eigenvalues $\lambda_1, \lambda_2$ (where $\lambda_1 \neq \lambda_2$).
Starting from the fundamental eigen-equations:
From the first eigen-equation:
$$(\lambda_1 \mathbf{q}_1)^T \cdot \mathbf{q}_2 = (\mathbf{C}\mathbf{q}_1)^T \cdot \mathbf{q}_2$$ By the property of transposes $(AB)^T = B^T A^T$: $$\lambda_1 \mathbf{q}_1^T \mathbf{q}_2 = \mathbf{q}_1^T \mathbf{C}^T \mathbf{q}_2$$
Apply the symmetry property. Since $\mathbf{C} = \mathbf{C}^T$: $$\lambda_1 \mathbf{q}_1^T \mathbf{q}_2 = \mathbf{q}_1^T \mathbf{C} \mathbf{q}_2$$
From the second eigen equation: $$\mathbf{q}_1^T \cdot (\lambda_2 \mathbf{q}_2) = \mathbf{q}_1^T \cdot (\mathbf{C} \mathbf{q}_2) $$ $$\lambda_2 \mathbf{q}_1^T \mathbf{q}_2 = \mathbf{q}_1^T \mathbf{C} \mathbf{q}_2$$
The difference of the two result: $$(\lambda_1 - \lambda_2) (\mathbf{q}_1^T \mathbf{q}_2) = 0$$
Conclusion: Since we assumed the eigenvalues are distinct ($\lambda_1 - \lambda_2 \neq 0$), the scalar product must be zero:$$\mathbf{q}_1^T \mathbf{q}_2 = 0$$This proves that the eigenvectors $\mathbf{q}_1$ and $\mathbf{q}_2$ are orthogonal.
Eigen-decomposition is the process of "deconstructing" a square matrix into a set of characteristic directions and scales. In the context of data science and robotics, it reveals the hidden geometric structure of a linear transformation or a covariance matrix.
1. The Starting Point: The Vector Level
By definition, for an individual eigenvector $\mathbf{q}_i$ and its eigenvalue $\lambda_i$, we have:
$$\mathbf{C}\mathbf{q}_i = \lambda_i \mathbf{q}_i$$
This tells us that $\mathbf{C}$ acts on $\mathbf{q}_i$ just like a simple scalar multiplication (no rotation).
2. Scaling to the Matrix Level
If our matrix $\mathbf{C}$ is $n \times n$ and has $n$ linearly independent eigenvectors, we can "stack" these individual equations side-by-side.
Let $\mathbf{Q}$ be the matrix where each column is one of these eigenvectors:
$$\mathbf{Q} = \begin{bmatrix} | & | & & | \\ \mathbf{q}_1 & \mathbf{q}_2 & \dots & \mathbf{q}_n \\ | & | & & | \end{bmatrix}$$
If we multiply $\mathbf{C}$ by this entire matrix $\mathbf{Q}$, the rules of matrix multiplication tell us:
$$\mathbf{C}\mathbf{Q} = \begin{bmatrix} | & | & & | \\ \mathbf{C}\mathbf{q}_1 & \mathbf{C}\mathbf{q}_2 & \dots & \mathbf{C}\mathbf{q}_n \\ | & | & & | \end{bmatrix}$$
Using our definition from Step 1, we can replace each $\mathbf{C}\mathbf{q}_i$ with $\lambda_i\mathbf{q}_i$:
$$\mathbf{C}\mathbf{Q} = \begin{bmatrix} | & | & & | \\ \lambda_1\mathbf{q}_1 & \lambda_2\mathbf{q}_2 & \dots & \lambda_n\mathbf{q}_n \\ | & | & & | \end{bmatrix}$$
3. Factoring out the Eigenvalues
Notice that the right side can be rewritten as the product of $\mathbf{Q}$ and a diagonal matrix $\mathbf{\Lambda}$:
$$\mathbf{C}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$
Where $\mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \dots, \lambda_n)$.
4. Completing the Factorization
Since we assumed the eigenvectors are linearly independent, the matrix $\mathbf{Q}$ is guaranteed to be invertible. We can therefore multiply both sides by $\mathbf{Q}^{-1}$ from the right:
$$\mathbf{C}\mathbf{Q}\mathbf{Q}^{-1} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1}$$$$\mathbf{C} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1}$$
Key Requirements (The "Catch")
Not every matrix can be factorized this way. For this math to work, two conditions must be met:
If a matrix $\mathbf{C} \in \mathbb{R}^{n \times n}$ has $n$ linearly independent eigenvectors, we can group them into a matrix $\mathbf{Q}$ and the eigenvalues into a diagonal matrix $\mathbf{\Lambda}$ (Lambda):
$$\mathbf{C} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1}$$
The Connection to the Sandwich Product:
If $\mathbf{C}$ is a symmetric matrix (like a covariance matrix), its eigenvectors are orthogonal (see above), meaning $\mathbf{Q}^{-1} = \mathbf{Q}^T$. This reveals that:
$$\mathbf{C} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T$$
Notice the structure!
This is the exact same sandwich product we used for rotations. This tells us that any covariance matrix can be viewed as a simple diagonal matrix $\mathbf{\Lambda}$ (where variances are independent) that has been actively rotated into its current position by the eigenvector matrix $\mathbf{Q}$.
Summary Table:
| Concept | Geometric Meaning | Role in PCA |
|---|---|---|
| Eigenvector | An axis that is not rotated by the matrix. | A Principal Component (a new feature axis). |
| Eigenvalue | The stretch factor along an eigenvector. | The "importance" or variance of that axis. |
| Diagonalization | Rotating the space to align with eigenvectors. | Decoupling features to make them uncorrelated. |
Why Factorize? The Change of Basis Perspective
To understand why we write $\mathbf{C} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1}$, think of it as a three-step pipeline for processing data or robot states. Instead of applying a complex transformation $\mathbf{C}$ directly, we "simplify" the space first.
Computational Advantage: In Applied CS, factorization is used for efficiency. If you need to calculate the state of a system 100 steps into the future, calculating $\mathbf{C}^{100}$ is computationally expensive and prone to rounding errors. However, using the factorized form:
$$\mathbf{C}^{100} = (\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1})^{100} = \mathbf{Q}\mathbf{\Lambda}^{100}\mathbf{Q}^{-1}$$
Since $\mathbf{\Lambda}$ is diagonal, $\mathbf{\Lambda}^{100}$ is just each diagonal element $\lambda_i$ raised to the power of 100. This turns a massive matrix-matrix multiplication problem into a simple element-wise power problem.
For a symmetric positive-definite covariance matrix $\mathbf{C} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T$, the inverse $\mathbf{C}^{-1}$ is calculated by applying the inverse property of matrix products $(ABC)^{-1} = C^{-1}B^{-1}A^{-1}$:
$$\mathbf{C}^{-1} = \left( \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T \right)^{-1} \\ = (\mathbf{Q}^T)^{-1} \mathbf{\Lambda}^{-1} \mathbf{Q}^{-1} = \mathbf{Q}\mathbf{\Lambda}^{-1}\mathbf{Q}^T$$
Where:
with normalized eigenvectors $ \mathbf{q}_i^T \mathbf{q}_j = \delta_{ij}$ in $\mathbf{Q}$ is the square root of $C$:
$$\mathbf{C}^{\frac{1}{2}} = \mathbf{Q} \mathbf{\Lambda}^{\frac{1}{2}} \mathbf{Q}^T$$
where $\mathbf{\Lambda}^{\frac{1}{2}}$ is the diagonal matrix containing the square roots of the (positive) eigenvalues $\sqrt{\lambda_i}$.
This definition is consistent because multiplying the matrix by itself retrieves the original covariance matrix $\mathbf{C}$:
$$\mathbf{C} = \mathbf{C}^{\frac{1}{2}} \mathbf{C}^{\frac{1}{2}} = \mathbf{Q} \mathbf{\Lambda}^{\frac{1}{2}} \mathbf{Q}^T \mathbf{Q} \mathbf{\Lambda}^{\frac{1}{2}} \mathbf{Q}^T = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T$$
Beyond data science, eigen-decomposition is the primary tool used to determine if a system—such as a self-balancing robot or an autonomous vehicle—will remain under control or spiral into chaos.
The Mathematical Intuition
Imagine a robot’s state (position and velocity) is described by a vector $\mathbf{x}$. The way this state changes over time is often modeled by a system of linear equations:
$$\mathbf{x}_{t+1} = \mathbf{Fx}_t$$
If we apply this repeatedly, the state at any time $n$ is $\mathbf{x}_n = \mathbf{F}^n \mathbf{x}_0$. This is where eigen-decomposition ($\mathbf{F} = \mathbf{VDV}^{-1}$) becomes powerful. It allows us to view the system not as a tangled mess of variables, but as a set of independent "modes":$$\mathbf{x}_n = \mathbf{V} \mathbf{D}^n \mathbf{V}^{-1} \mathbf{x}_0$$
Why the Eigenvalues ($\lambda$) Dictate Reality
The diagonal matrix $\mathbf{D}^n$ contains the eigenvalues raised to the power of $n$ ($\lambda_1^n, \lambda_2^n, \dots$). This reveals the "long-term fate" of the robot:
Example: The "Segway" Problem
Consider a self-balancing robot. The matrix $\mathbf{F}$ captures the physics of gravity. Previously, we used $\mathbf{x}_k = \mathbf{F}\mathbf{x}_{k-1} + \mathbf{G}\mathbf{u}_{k-1}$ for passive prediction. However, if $\mathbf{F}$ has an eigenvalue $\lambda > 1$ (e.g., $\lambda = 1.05$), our model simply predicts a crash as errors grow by 5% every step.
To prevent this, the computer scientist implements a feedback loop by defining the control input as $\mathbf{u}_k = -\mathbf{K}\mathbf{x}_k$.
Here, $\mathbf{K}$ is the Gain Matrix—the software logic that determines how strongly the robot reacts to its current state. By substituting this back into our prediction model, we "rewrite" the physics:
$$\mathbf{x}_k = (\mathbf{F} - \mathbf{GK})\mathbf{x}_{k-1}$$
By choosing a $\mathbf{K}$ that shifts all eigenvalues of this new matrix $(\mathbf{F} - \mathbf{GK})$ into the unit circle ($|\lambda| < 1$), we turn an unstable "falling" system into a stable "self-correcting" one.
Key Takeaway: In AI and Robotics, eigen-decomposition isn't just about "summarizing" data (like PCA); it is about designing the future. It allows us to verify if our software's logic ($\mathbf{K}$) will result in a smooth trajectory or a catastrophic crash.