While the element-wise definition ($cov(x_i, x_j)$) is useful for understanding the relationship between two specific variables, we rarely calculate the matrix element-by-element in practice. Instead, we use vector algebra to compute the entire matrix at once.
Given a set of $m$ data points (samples) $\{\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(m)}\}$, where each $\mathbf{x}^{(i)} \in \mathbb{R}^n$, and the empirical mean vector:$$\mathbf{\bar{x}} = \frac{1}{m} \sum_{i=1}^{m} \mathbf{x}^{(i)}$$The Empirical Covariance Matrix $\mathbf{C}$ is defined as the average of the "spread" of the data around the mean:$$\mathbf{C} = \frac{1}{m-1} \sum_{i=1}^{m} (\mathbf{x}^{(i)} - \mathbf{\bar{x}})(\mathbf{x}^{(i)} - \mathbf{\bar{x}})^T$$
To see why this holds, look at the $(j, k)$-th entry of the matrix resulting from the sum above. Let $\mathbf{v}^{(i)} = (\mathbf{x}^{(i)} - \mathbf{\bar{x}})$.
The product of a column vector and a row vector—the outer product—results in a matrix:$$\mathbf{v}^{(i)} (\mathbf{v}^{(i)})^T = \begin{pmatrix} v_1^{(i)} \\ \vdots \\ v_n^{(i)} \end{pmatrix} \begin{pmatrix} v_1^{(i)} & \dots & v_n^{(i)} \end{pmatrix} = \begin{pmatrix} v_1^{(i)}v_1^{(i)} & \dots & v_1^{(i)}v_n^{(i)} \\ \vdots & \ddots & \vdots \\ v_n^{(i)}v_1^{(i)} & \dots & v_n^{(i)}v_n^{(i)} \end{pmatrix}$$
Comparing this to your previous definition:
The outer product $(\mathbf{x}^{(i)} - \mathbf{\bar{x}})(\mathbf{x}^{(i)} - \mathbf{\bar{x}})^T$ represents the contribution to the variance by a single sample.
If we organize our $m$ measurements (centered by subtracting the mean $\bar{\mathbf{x}}$) as rows in an $m \times n$ matrix $\tilde{\mathbf{X}}$:
$$\tilde{\mathbf{X}} = \begin{pmatrix} — & (\mathbf{x}^{(1)} - \bar{\mathbf{x}})^T & — \\ — & (\mathbf{x}^{(2)} - \bar{\mathbf{x}})^T & — \\ & \vdots & \\ — & (\mathbf{x}^{(m)} - \bar{\mathbf{x}})^T & — \end{pmatrix}$$
The covariance matrix $\mathbf{C}$ (which must be $n \times n$) is then:
$$\mathbf{C} = \frac{1}{m-1} \tilde{\mathbf{X}}^T \tilde{\mathbf{X}}$$
Alternatively, we can organize in a centered data matrix $\hat{\mathbf{X}}$ each data vector as a column, i.e. $\hat{\mathbf{X}}$ is an $n \times m$-matrix:
$$\mathbf{C} = \frac{1}{m-1} \hat{\mathbf{X}} \hat{\mathbf{X}}^T$$
The empirical covariance can be expressed at different levels of abstraction. Each form is mathematically equivalent but serves a different purpose in implementation and theory.
| Level | Formula | Best used for... |
|---|---|---|
| Scalar | $cov(x,y) = \frac{1}{m-1} \sum_{i=1}^{m} (x_i - \bar{x})(y_i - \bar{y})$ | Understanding the statistical relationship between two specific variables. |
| Outer Product | $\mathbf{C} = \frac{1}{m-1} \sum_{i=1}^{m} \mathbf{v}^{(i)}(\mathbf{v}^{(i)})^T$ | Theoretical derivations, Eigen-analysis, and Rank-1/Rank-$\mu$ updates in CMA-ES. |
| Data Matrix | $\mathbf{C} = \frac{1}{m-1} \tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$ | High-performance computation and software implementations (assuming samples as rows). |
Note on Notation: In the Data Matrix form above, $\tilde{\mathbf{X}}$ is an $m \times n$ matrix where each row is a centered observation vector $\mathbf{v}^{(i)} = (\mathbf{x}^{(i)} - \bar{\mathbf{x}})$.