Linear transforms

These notes will give a brief introduction on the procedure of Principal Component Analysis. The reader is encouraged to seek additional resources for enhancing the understanding of the methods.

Principal Component Analysis

Principal component analysis (PCA) is a method of reducing the dimensionality of a dataset correlated variables, while retaining as much as possible of the variance present in the dataset. This is a rather short note on a field with many applications, and is loosely based on ¹. For our application, the interest lies in classification with a reduced number of features.

Suppose that $X$ is a vector of $N_d$ random variables (which in our case correspond to the number of feature images). We are looking for a number of uncorrelated variables $Y_k$ which we will call the principal components of $X$ . In addition to being uncorrelated with each other, each variable will be a linear combination of $X$ . The first one $Y_1$ will be the first principal component, and will account of the most of the variance in $X$ . The next principal component $Y_2$ , will account for the most of the variance in $X$ , constrained on being uncorrelated with $Y_1$ . We continue this until we have found $N_p << N_d$ principal components that accounts for most of the variance in $X$ .

We start with the first PC

$Y_1 = \sum_{k = 1}^{N_d} a_{1k} X_k = a_1^\intercal X,$

which has a variance

$Var[Y_1] = Var[a_1^\intercal X] = a_1^\intercal \Sigma a_1.$

This is the variance which we want to maximize, but in order to achieve finite solutions, we constrain the optimization on

$a_1^\intercal a_1 = 1.$

It turns out that, for $k = 1, 2, ..., N_p$ , $a_k$ will be an eigenvector of $\Sigma$ corresponding to the $k$ th largest eigenvalue $\lambda_k$ .

$\Sigma$ is the covariance matrix of $X$ such that $\Sigma_{ij} = Cov(X_i, X_j)$ . For a dataset with $N_s$ samples $\{x_{k1}, \cdots, x_{k N_s}\}$ for all features $k$ , the elements in the covariance matrix can be estimated as

$\hat{\Sigma}_{ij} = \frac{1}{N_s - 1} \sum_{q = 1}^{N_s} (x_{iq} - \hat{\mu}_{i})(x_{jq} - \hat{\mu}_{j}),$

where $\hat{\mu}_k$ is the sample mean of the $k$ th feature

$\hat{\mu}_k = \frac{1}{N_s} \sum_{q = 1}^{N_s} x_{kq}.$

In the rest of the derivation, I will skip the hat notation on the sample mean and covariance, as this is more important in the actual implementation than in the derivation of the principal components.

Arranginging the feature samples and sample means into vectors of size $N_d$ , the estimate of the covariance matrix can be written as

$\hat{\Sigma} = \frac{1}{N_s - 1}\sum_{q = 1}^{N_s}(x_q - \hat{\mu})(x_q - \hat{\mu})^\intercal.$

Continuing with the variance optimalization, we use the technique of Lagrangian multipliers to incorporate the unit length constraint, that is, we will maximize the expression

$J(a_1) = a_1^\intercal \Sigma a_1 - \lambda(a_1^\intercal a_1 - 1).$

Computing the gradient of $J$ w.r.t. $a_1$ , and setting it equal to zero, yields the equation

$\Sigma a_1 - \lambda a_1 = 0,$

$(\Sigma - \lambda I)a_1 = 0,$

where $I$ is the $N_d \times N_d$ identity matrix. From this we realize that $\lambda$ is an eigenvalue of $\Sigma$ , and $a_1$ is the corresponding eigenvector. Furthermore, $\lambda$ is the largest eigenvalue, as maximizing the variance subject to the constraint of unit lenght coefficients is equivalent to choosing the largest eigenvalue

$a_1^\intercal \Sigma a_1 = a_1^\intercal \lambda a_1 = \lambda a_1^\intercal a_1 = \lambda.$

In general, the $k$ th principal component of $X$ is $a_k^\intercal X$ , where $a_k$ is the eigenvector of the covariance matrix $\Sigma$ of $X$ corresponding to the $k$ th largest eigenvalue $\lambda_k$ .

I.T. Jolliffe, Principal Component Analysis, 2nd edition. ↩