Principal Component Analysis (PCA)¶

In PCA, we treat the Covariance Matrix $\mathbf{C}$ as a Rank 2 tensor that describes the physical "shape" of our data cloud. By applying eigen-decomposition, we perform a passive rotation to find a new coordinate system where the data is most simply described.

Why Eigen-decomposition works for PCA¶

When we decompose $\mathbf{C} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T$, we are essentially "un-sandwiching" the matrix to find its core properties:

Orthogonality: Because $\mathbf{C}$ is a symmetric matrix, its eigenvectors are always perpendicular (orthogonal) to each other. This is a massive advantage: it means our new Principal Components ($\text{PC}_1, \text{PC}_2, \dots$) form a perfect, rigid coordinate grid—just like the original $X$ and $Y$ axes, but rotated.
Variance Mapping: The eigenvalues $\lambda_i$ correspond exactly to the amount of variance in the data along the direction of their respective eigenvectors.The largest eigenvalue $\lambda_1$ points to the direction of maximum spread.The smallest eigenvalue points to the direction of least information.
Total Variance (The Trace): One of the most beautiful properties of this transformation is that the Trace is invariant under rotation. The sum of the diagonal of $\mathbf{C}$ (original variances) is equal to the sum of the eigenvalues (variances in the new system):$$\sum \text{Var}(\text{Original Axes}) = \sum \lambda_i$$This proves that PCA doesn't lose any information; it just "repackages" it.

In [1]:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import matplotlib.transforms as transforms

def draw_confidence_ellipse(mu, sigma, ax, n_std=2.0, facecolor='none', **kwargs):
    """
    Utility to create a plot of the covariance confidence ellipse.
    """
    # 1. Compute Eigen-decomposition of the covariance matrix
    vals, vecs = np.linalg.eigh(sigma)
    order = vals.argsort()[::-1]
    vals, vecs = vals[order], vecs[:, order]
    
    # 2. Calculate the angle of rotation
    theta = np.degrees(np.arctan2(*vecs[:, 0][::-1]))
    
    # 3. Width and height are standard deviations (sqrt of eigenvalues)
    width, height = 2 * n_std * np.sqrt(vals)
    
    # 4. Create the ellipse
    ell = Ellipse(xy=mu, width=width, height=height, angle=theta, 
                  facecolor=facecolor, **kwargs)
    return ax.add_patch(ell)

# --- Parameters & Data ---
np.random.seed(42)
od = 0.8
Sigma = np.array([[2, od], [od, 0.6]])
mu = np.array([2., 1.])
nb_data = 100
X = np.random.multivariate_normal(mu, Sigma, nb_data)

# --- Statistics ---
mean_emp = X.mean(axis=0)
cov_mat = np.cov(X, rowvar=False)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

# --- Visualization ---
fig, ax = plt.subplots(figsize=(7, 7))

# 1. Plot the sampled data
ax.scatter(X[:, 0], X[:, 1], c='b', marker='x', alpha=0.6, label="Sampled Data")

# 2. Plot the TRUE underlying distribution (The Ellipse)
# We draw the 2-sigma ellipse (approx 95% of data)
draw_confidence_ellipse(mu, Sigma, ax, n_std=2.0, edgecolor='black', 
                         linestyle='--', label=r'True Dist. (2$\sigma$)')

# 3. Plot the Eigenvectors (Empirical)
colors = ['r', 'g']
for i in range(2):
    std_dev = np.sqrt(eig_vals[i])
    v = eig_vecs[:, i] * std_dev
    ax.quiver(mean_emp[0], mean_emp[1], v[0], v[1], 
              color=colors[i], angles='xy', scale_units='xy', scale=1, 
              label=fr'PC {i+1} ($\sigma$)')

# Formatting
ax.axhline(0, color='grey', lw=0.5)
ax.axvline(0, color='grey', lw=0.5)
ax.set_aspect('equal')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_title('PCA and Underlying Gaussian Distribution')
ax.legend()
plt.grid(True, alpha=0.3)
plt.show()

No description has been provided for this image

The Data Cloud (Blue Crosses): These are your $n=40$ measurements. Their elongated shape shows correlation: when $x_1$ increases, $x_2$ usually does too. PCA aims to "unmix" this relationship.
The Ground Truth (Dashed Ellipse): This is the 2-sigma contour ($2\sigma$) of the true Gaussian distribution. It represents the "hidden" model that generated the data. About 95% of all potential points fall within this boundary.
The Principal Components (Arrows): These are the Eigenvectors of the covariance matrix, centered at the empirical mean:
- Red Arrow ($PC_1$): Points in the direction of maximum variance. This is the "main axis" of your data.
- Green Arrow ($PC_2$): Points in the direction of minimum variance, perpendicular to the first.
- Length: The arrows are scaled by $\sqrt{\lambda}$ (standard deviation). A longer arrow means more "information" or spread is captured along that axis.

The visualization summarizes the Eigen-decomposition of a Covariance Matrix:

Direction: The eigenvectors (the arrows) define the orientation of the data cloud.
Magnitude: The eigenvalues (captured here as standard deviations via sqrt) define the "stretch" or width of the distribution.
Orthogonality: Because the covariance matrix is symmetric, the eigenvectors are guaranteed to be orthogonal (perpendicular), creating a new, natural coordinate system for your data.

The Takeaway

The plot shows that PCA has found a new coordinate system. Instead of using $x_1$ and $x_2$, we can describe the data more efficiently using the red and green axes. In dimensionality reduction, we would keep the red axis ($PC_1$) and discard the green one ($PC_2$) because it contains less variance.

Dimensionality Reduction¶

To project the data onto the first Principal Component, we mathematically "collapse" every 2D point onto the line defined by the first eigenvector.

The Projection Step

If $\mathbf{v}_1$ is our first eigenvector (the red arrow), the 1D projection $z^{(k)}$ of a centered data point $\tilde{\mathbf{x}}^{(k)}$ is calculated via the dot product:$$z^{(k)} = \tilde{\mathbf{x}}^{(k)} \cdot \mathbf{v}_1$$This gives us a single coordinate representing the point's position along the axis of maximum variance.

In [2]:

# 1. Center the data (X is your original 40x2 array)
mean_emp = X.mean(axis=0)
X_centered = X - mean_emp

# 2. Get the first eigenvector (the direction of the red arrow)
# Ensure you use the eigenvector corresponding to the largest eigenvalue
v1 = eig_vecs[:, 0] 

# 3. Project the 2D data onto the 1D line
# This results in 40 scalar values
z1 = X_centered @ v1

# 4. Reconstruct for visualization
# This maps the 1D values back into 2D space along the PC1 axis
X_reconstructed = np.outer(z1, v1) + mean_emp

# --- Visualization ---
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c='b', marker='x', alpha=0.3, label="Original 2D Data")
plt.scatter(X_reconstructed[:, 0], X_reconstructed[:, 1], c='r', label="Projected 1D Data")

# Draw the axis line
plt.plot(X_reconstructed[:, 0], X_reconstructed[:, 1], color='r', alpha=0.2)

plt.gca().set_aspect('equal')
plt.legend()
plt.title("Projection onto the 1st Principal Component")
plt.show()

Why this matters

Compression: We now only need one number ($z_1$) per data point instead of two, yet we've preserved the maximum possible variance.
Information Loss: The distance of the black dashed lines represents the "error" or information lost. Because we chose the eigenvector with the largest eigenvalue, this error is mathematically minimized.

What is happening mathematically?

We are performing a change of basis.

The original data uses the standard basis $\mathbf{e}_1 = [1,0]$ and $\mathbf{e}_2 = [0,1]$.
The projection uses the Eigenvector basis. By calculating $\tilde{\mathbf{X}}\mathbf{v}_1$, we find the coordinates of our data points relative to that red arrow.

PCA as a Dimensionality Reduction Tool¶

The "magic" of PCA happens when we decide to ignore the eigenvectors with small eigenvalues.

If we have a 100-dimensional dataset, but the first 3 eigenvalues capture 95% of the total variance, we can "project" our data onto those 3 eigenvectors and discard the rest. We effectively simplify our "world" from 100 dimensions down to 3, while keeping almost all the important structural information.

The Scree Plot: Deciding What to Keep¶

A Scree Plot is a diagnostic tool used to visualize the importance of each Principal Component. It plots the eigenvalues (variance) against the component number.

In a tensor context, the scree plot tells us which "dimensions" of the Rank 2 tensor are physically significant and which are just noise. We look for the "elbow" of the plot—the point where the variance drops off sharply.

Python Code: Generating a Scree Plot This script calculates PCA on a synthetic dataset and visualizes how much "information" each eigenvalue captures.

In [3]:

import numpy as np
import matplotlib.pyplot as plt

# 1. Create synthetic data with 5 dimensions
# (Only 2 dimensions will have significant variance)
np.random.seed(42)
n_samples = 100
data = np.dot(np.random.randn(n_samples, 2), [[2, 0.5], [0.5, 1]]) # Main signal
noise = np.random.normal(0, 0.2, (n_samples, 3))                 # Noise
X = np.hstack([data, noise])

# 2. PCA Steps
X_centered = X - np.mean(X, axis=0)
cov_matrix = np.cov(X_centered, rowvar=False)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Sort eigenvalues in descending order
eigenvalues = eigenvalues[::-1]
total_variance = np.sum(eigenvalues)
explained_variance_ratio = eigenvalues / total_variance

# 3. Plotting the Scree Plot
plt.figure(figsize=(8, 5))
components = np.arange(1, len(eigenvalues) + 1)

# Bar plot for individual variance
plt.bar(components, explained_variance_ratio, alpha=0.7, color='blue', label='Individual Variance')

# Step plot for cumulative variance
plt.step(components, np.cumsum(explained_variance_ratio), where='mid', color='red', label='Cumulative Variance')

plt.xlabel('Principal Component Index')
plt.ylabel('Proportion of Explained Variance')
plt.title('Scree Plot: Finding the "Elbow"')
plt.xticks(components)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

Interpretation¶

The Bars: Show how much of the "Total Variance" (the Trace) is captured by each eigenvector.
The Line: Shows the cumulative total. If the first two components reach 90%, it means you can describe the 5D dataset using only 2 dimensions with minimal information loss.
The "Elbow": The point where the curve flattens out. In a tensor interpretation, this is where we distinguish the Signal (the primary axes of the tensor) from the Noise (the tiny, insignificant fluctuations).

Dimensionality Reduction: Truncating the Tensor¶

Dimensionality reduction is the ultimate goal of PCA. From a tensor perspective, we are identifying which dimensions of the space are physically meaningful (the signal) and which are just background fluctuations (the noise).

By looking at the Scree Plot, we can decide to keep only the top $k$ eigenvectors. We then form a projection matrix $\mathbf{V}$ using these $k$ vectors. When we project our data:$$\mathbf{X}_{reduced} = \mathbf{X}_{centered} \mathbf{V}$$We are performing a passive rotation into a new coordinate system and then simply "ignoring" the axes that don't carry enough variance.

The Geometry of Truncation When we perform PCA, we aren't just "deleting" columns of data; we are choosing a new subspace that best represents the "energy" or "mass" of our data cloud.

Constructing the Projection Matrix ($\mathbf{V}$)

After sorting our eigenvalues ($\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_d$), we select the first $k$ eigenvectors to form our projection matrix $\mathbf{V}$. This matrix is a $d \times k$ operator.

Each column of $\mathbf{V}$ is a unit vector (a direction) in the original $d$-dimensional space.
These $k$ vectors are all orthogonal to one another, defining a perfectly flat $k$-dimensional "hyperplane" that passes through the center of the data.

The Projection Equation: $\mathbf{X}_{reduced} = \mathbf{X}_{centered} \mathbf{V}$

This multiplication is a Passive Transformation. For every data point (row) in $\mathbf{X}_{centered}$:

The dot product between the data point and the first eigenvector gives its new "PC1 coordinate."
The dot product with the second eigenvector gives the "PC2 coordinate," and so on.

By doing this, we are effectively looking at the data from the perspective of the "best" $k$ axes and ignoring the remaining $d-k$ axes.

Why it is "Truncating the Tensor"

In its full form, the covariance matrix (our Rank 2 tensor) is:$$\mathbf{C} = \sum_{i=1}^{d} \lambda_i \mathbf{v}_i \mathbf{v}_i^T$$When we reduce dimensions, we are approximating the tensor by cutting off the end of this sum:$$\mathbf{C}_{approx} \approx \sum_{i=1}^{k} \lambda_i \mathbf{v}_i \mathbf{v}_i^T$$We are claiming that the "physical relationship" described by the tensor is almost entirely contained within those first $k$ directions. The remaining variance (the sum of the ignored eigenvalues) is treated as isotropic noise—shaking that doesn't follow a meaningful pattern.

Summary of Information Loss

The difference between $\mathbf{X}_{centered}$ and $\mathbf{X}_{reconstructed}$ is the Reconstruction Error.

If the eigenvalues we discarded are near zero, the error is negligible. In tensor terms, we have successfully removed the "dimensions of least resistance," keeping only the axes where the data actually lives.

Measuring Information Retention

In the tensor perspective, the Trace of the covariance matrix represents the "Total Energy" or "Total Variance" of the system. Since the Trace is invariant under rotation, the sum of our eigenvalues must equal the total variance of the original data.When we reduce the dimensions from $d$ to $k$, we can calculate the percentage of information kept using this formula:$$\text{Explained Variance} = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{j=1}^{d} \lambda_j}$$

PCA: Summary¶

Step in PCA	Tensor Interpretation
Calculate $\mathbf{C}$	Defining the Rank 2 "relationship" between all variables.
Find Eigenvectors	Finding the "Natural Axes" (characteristic directions) of the data cloud.
Sort Eigenvalues	Ranking axes by how much "information" (variance) they carry.
Project Data	A Change of Basis (Passive Rotation) to the new coordinate system.
Dimension Reduction	Truncating the tensor to its most significant physical dimensions.