These notes will give a brief introduction on the procedure of Principal Component Analysis. The reader is encouraged to seek additional resources for enhancing the understanding of the methods.

Principal Component Analysis

Principal component analysis (PCA) is a method of reducing the dimensionality of a dataset correlated variables, while retaining as much as possible of the variance present in the dataset. This is a rather short note on a field with many applications, and is loosely based on 1. For our application, the interest lies in classification with a reduced number of features.

Suppose that is a vector of random variables (which in our case correspond to the number of feature images). We are looking for a number of uncorrelated variables which we will call the principal components of . In addition to being uncorrelated with each other, each variable will be a linear combination of . The first one will be the first principal component, and will account of the most of the variance in . The next principal component , will account for the most of the variance in , constrained on being uncorrelated with . We continue this until we have found principal components that accounts for most of the variance in .

We start with the first PC

which has a variance

This is the variance which we want to maximize, but in order to achieve finite solutions, we constrain the optimization on

It turns out that, for , will be an eigenvector of corresponding to the th largest eigenvalue .

is the covariance matrix of such that . For a dataset with samples for all features , the elements in the covariance matrix can be estimated as

where is the sample mean of the th feature

In the rest of the derivation, I will skip the hat notation on the sample mean and covariance, as this is more important in the actual implementation than in the derivation of the principal components.

Arranginging the feature samples and sample means into vectors of size , the estimate of the covariance matrix can be written as

Continuing with the variance optimalization, we use the technique of Lagrangian multipliers to incorporate the unit length constraint, that is, we will maximize the expression

Computing the gradient of w.r.t. , and setting it equal to zero, yields the equation

or

where is the identity matrix. From this we realize that is an eigenvalue of , and is the corresponding eigenvector. Furthermore, is the largest eigenvalue, as maximizing the variance subject to the constraint of unit lenght coefficients is equivalent to choosing the largest eigenvalue

In general, the th principal component of is , where is the eigenvector of the covariance matrix of corresponding to the th largest eigenvalue .