import numpy as np
import pickle
import os
import urllib
import tarfile
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
The folowing python code downloads the image data and transforms the shapes.
# --- 1. Configuration and Paths ---
url = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
# We define a local folder to store the data
data_dir = "~/tmp/cifar_data"
tar_filename = "cifar-10-python.tar.gz"
tar_path = os.path.join(data_dir, tar_filename)
# This is the folder created after extraction
batch_folder = os.path.join(data_dir, "cifar-10-batches-py")
# --- 2. Automated Download ---
if not os.path.exists(data_dir):
os.makedirs(data_dir)
if not os.path.exists(tar_path):
print(f"Downloading CIFAR-10 from {url}...")
# This is the Python equivalent of 'wget'
urllib.request.urlretrieve(url, tar_path)
print("Download finished.")
else:
print("Archive already exists, skipping download.")
# --- 3. Automated Extraction ---
if not os.path.exists(batch_folder):
print("Extracting archive...")
# This is the Python equivalent of 'tar xf'
with tarfile.open(tar_path, "r:gz") as tar:
tar.extractall(path=data_dir)
print("Extraction finished.")
else:
print("Data already extracted.")
# --- 4. Loading the Data (Your Original Logic) ---
def load_cifar_batch(file_path):
with open(file_path, 'rb') as f:
# encoding='bytes' is required for CIFAR-10 pickle files
data_dict = pickle.load(f, encoding='bytes')
return data_dict
# Load the first batch
batch_1_path = os.path.join(batch_folder, "data_batch_1")
cifar_10 = load_cifar_batch(batch_1_path)
X_original = cifar_10[b'data'] # Shape: (10000, 3072)
# --- 5. Pre-processing for PCA ---
# Reshape to (N, Channels, Height, Width) -> (10000, 3, 32, 32)
X_reshaped = X_original.reshape((-1, 3, 32, 32))
# Transpose to (N, Height, Width, Channels) for visualization -> (10000, 32, 32, 3)
X_images = X_reshaped.transpose(0, 2, 3, 1)
# Back to design matrix for PCA (N, 3072)
X_data = X_images.reshape(-1, 3072)
print(f"Successfully loaded data matrix with shape: {X_data.shape}")
def show(i, autoscale = False, title="Sample Image"):
plt.figure(figsize=(1.5, 1.5))
image = i.reshape((32, 32, 3))
if autoscale:
# Min-Max Scaling: (value - min) / (max - min)
# This maps the tensor values to the [0, 1] range required by imshow
m, M = image.min(), image.max()
# Adding a small epsilon prevents division by zero if the image is a solid color
plt.imshow((image - m) / (M - m + 1e-8))
else:
plt.imshow(image)
plt.title(title)
plt.axis('off')
plt.show()
show(X_data[12])
Before we apply the PCA algorithm, we need to transform the images. In the context of machine learning, this is often called mean subtraction and length normalization.
Mathematically, this is a vector translation.
mean(axis=0)? It is the "Average Image" of your entire dataset. It represents the shared characteristics of all CIFAR-10 images (the common background colors, average lighting, etc.).# This shifts the entire data cloud so that its centroid (the average image) sits at the origin $(0,0, \dots, 0)$.
X_data = X_data - X_data.mean(axis=0)
This is a scaling operation that acts on each image vector individually.
# This is a Unit Norm scaling. It ensures that every image vector has a length (magnitude) of exactly $1$.
X_data = X_data / np.sqrt((X_data ** 2).sum(axis=1))[:,None]
X_data.dtype, X_data.shape
After these two steps, X_data has a specific structure:
Image compression is a direct application of Truncating the Tensor. Instead of keeping all $3072$ pixels ($32 \times 32 \times 3$), we keep only the $k$ most significant eigenvectors (the ones with the largest eigenvalues).
The Process:
n_components instead of $1024$ ($32^2$).X = X_data.copy()
X = X.reshape(-1, 32*32, 3)
n_components=1000 # we choose at least this number of components for compression
# do a pca compression for a given channel
def compress_channel(channel, n_components=n_components):
# Initialize the PCA model to find the 'n' most important directions (eigenvectors)
pca = PCA(n_components=n_components)
# calculate the covariance matrix and its eigen-decomposition.
pca.fit(channel)
# That line actually calculates the compression (projection) for educational clarity.
# It dots the 'loadings' (eigenvectors) with the data to project it
# into the reduced subspace.
# Note: Using pca.inverse_transform(principalComponents) would be equivalent.
compressed_channel = np.dot(channel, pca.components_.T)
# alternatively the two steps can be done with fit_transform:
# 1. Fit: Calculate the Covariance Matrix and its eigen-decomposition.
# 2. Transform: Project the pixels onto the new principal component axes.
# compressed_channel = pca.fit_transform(channel)
return compressed_channel, pca
1. The Design Matrix ($\mathbf{X}$):
The channel input is a 2d-array $\mathbf{X} \in \mathbb{R}^{N \times D}$.
2. The Covariance and Eigen-decomposition
The call pca.fit() calculates the Covariance Matrix $\mathbf{C}$:$$\mathbf{C} = \frac{1}{N-1} \mathbf{X}^T \mathbf{X}$$It then solves the Eigenvalue Problem:
$$\mathbf{C}\mathbf{q} = \lambda \mathbf{q}$$
This produces $D$ eigenvectors ($\mathbf{q}$). The eigenvectors with the $k$ largest eigenvalues ($\lambda$) are the principal components. These are stored in pca.components_.
3. The "Compression" (Projection)
In the code, we use np.dot(channel, pca.components_.T).
Let's look at the shapes to see the math:
channel is $\mathbf{X}$ of shape $(N, D)$.pca.components_.T is a matrix $\mathbf{Q}$ of shape $(D, k)$, where $k$ is n_components. The column vectors
are the $k$ principal components.The multiplication $\mathbf{X}\mathbf{Q}$ results in the matrix $\mathbf{Z}$. This projects the images on the principle components: $$\mathbf{Z} = \mathbf{X} \mathbf{Q}$$
PCA is simply a Linear Map: $$\text{Pixels} \xrightarrow{\text{Matrix } \mathbf{Q}} \text{Principal Components}$$
4. The Cumulative Explained Variance (see in the plot below)
Mathematically, the total variance is the Trace (sum of diagonal elements) of the Covariance matrix.$$\text{Total Variance} = \sum_{i=1}^{D} \lambda_i$$
The explained_variance_ratio_ is:$$\frac{\lambda_j}{\sum \lambda_i}$$
pca_info = []
# loop over all three channels
for c in range(3):
pca_info.append(compress_channel(X[:,:,c]))
plt.figure(figsize=(8, 5))
colors = ['red', 'green', 'blue']
for i, (_, pca) in enumerate(pca_info):
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumulative_variance, color=colors[i], label=f'Channel {i}')
# Draw a line at 99% to see where it crosses
plt.axhline(y=0.99, color='gray', linestyle='--', alpha=0.3)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('How much "Image Energy" are we keeping?')
plt.legend()
plt.show()
We choose the number of components used for the reconstruction of the images in the decoder:
k_prime = nb_components_for_reconstruction = 400
# 1. Total original bytes (Raw Pixels)
X_bytes = X.nbytes
# 2. Compute compressed bytes based on truncated shapes
compressed_bytes = 0
#for compressed_data, pca_model in pca_info:
# N = number of images, D = original dimensions (1024)
N, D = X.shape[0], 1024
# We store N images, but each has only k_prime components instead of D pixels
# We multiply by 8 because PCA outputs are float64 (8 bytes per element)
# for each channel ( * 3)
coords_bytes = N * k_prime * 8 * 3
# We store the "Dictionary" (Eigenvectors)
# Each eigenvector is D pixels long, and we keep k_prime of them for each channel (3)
eigenvector_bytes = k_prime * D * 8 * 3
compressed_bytes = (coords_bytes + eigenvector_bytes)
# Calculations
space_saved = 1.0 - (compressed_bytes / X_bytes)
compression_ratio = X_bytes / compressed_bytes
print(f"Space Saved: {space_saved:.2%}")
print(f"Compression Ratio: {compression_ratio:.2f}x")
The reconstruction process is a Change of Basis that maps our low-dimensional coordinates back into the original pixel space. By choosing a specific $k'$, we determine the "rank" of our approximation.
# reconstruct the data using nb_components
def reconstruct(compressed_channel, pca, nb_components):
return np.dot(compressed_channel[:,:nb_components], pca.components_[:nb_components])
In the function reconstruct(compressed_channel, pca, nb_components), we define the following matricies:
compressed_channel[:, :k']): The Truncated Score Matrix. This contains the first $k'$ columns of the projected data.pca.components_[:k']): The Truncated Basis Matrix. This contains the first $k'$ eigenvectors (Principal Components).The approximation $\mathbf{\hat{X}}$ is calculated via the dot product: $$\mathbf{\hat{X}} = \mathbf{Z}' \mathbf{Q}'^T$$
Mathematically, for a single image, this reconstruction is a weighted sum of patterns:$$\text{Image}_i \approx z_{i,1}\mathbf{q}_1 + z_{i,2}\mathbf{q}_2 + \dots + z_{i,k'}\mathbf{q}_{k'}$$By increasing $k'$, you add more "building blocks" (eigenvectors), which adds finer detail to the image. If $k'$ is low, the reconstruction only captures the global structure (low-frequency information). As $k'$ approaches $D$, the reconstruction becomes a perfect replica of the original.
Let's look at the shapes of the tensors in your reconstruct function:
compressed_channel[:, :nb_components]is $\mathbf{Z}'$. $\mathbf{Z}'$ are the first $k'$ colums of $\mathbf{Z}$ (projected data). Its shape is $(N, k')$. It represents how much of each "basis pattern" exists in each image.pca.components_[:nb_components]: This is the reduced eigenvector matrix $\mathbf{Q}'^T$. Its shape is $(k', D)$. Each row is a "Principal Component" (a 1024-pixel pattern).with
The operation np.dot(Z, V_T) calculates the reconstructed images:
$$\mathbf{\hat{X}} = \mathbf{Z}' \mathbf{Q}'^T$$
The result $\mathbf{\hat{X}}$ has the shape $(N, 1024)$. You have successfully moved from a small $k$-dimensional space back into the large $D$-dimensional pixel space.
reconstructed = []
for compressed, pca in pca_info:
reconstructed.append(reconstruct(compressed, pca, k_prime))
hat_X = np.concatenate(reconstructed, axis=1)
hat_X = hat_X.reshape(-1,3,1024).transpose(0,2,1)
hat_X[4].shape
X.shape
i=12
show(X[i], True, "original image")
show(hat_X[i], True, "compressed image")
Beyond simple dimensionality reduction, PCA can be used for Image Whitening to decorrelate pixels and standardize feature variance. By setting whiten=True in the PCA constructor, each principal component is scaled by the inverse square root of its eigenvalue, ensuring that every transformed feature has a variance of exactly 1.0. This process effectively removes redundant spatial information and acts like a high-pass filter, highlighting structural edges while suppressing global lighting gradients.
For machine learning models like CNNs, this "leveled" input space can help optimizers converge faster by preventing high-variance features from dominating the learning process. However, because standard PCA whitening discards the natural spatial locality of pixels, the resulting images appear as abstract, "ghostly" versions of the original. In modern deep learning, this global preprocessing step is frequently substituted by simpler channel-wise normalization or internal Batch Normalization layers.