About Principal Component Analysis Functions

It is not uncommon to find data sets in which there are a large number of correlated or redundant variables. This not only leads to computational inefficiency in the data analysis, but can also lead to numerical problems (e.g. in matrix inversion steps). What is required in such cases is a method by which you can compress the data into a smaller number of orthogonal variables. There is a group of methods for doing this, which are collectively referred to as factor analysis methods. The most fundamental of these methods is PCA, which compresses the data to its most dominant factors.

PCA takes an n-dimensional data space and defines a new set of orthogonal axes (i.e. new variables) that describe the variance in the data in an optimal way. The new variables are optimal in the sense that the first one describes the maximum amount of variation possible (i.e. the maximum variance), the second describes the maximum amount of the remaining variation, etc. These are termed the principal components of the data set.

In data sets in which the starting variables are interdependent, or correlated, the higher principal components are close to zero (usually just noise), and can be discarded. The underlying, orthogonal (uncorrelated), variables are termed latent variables, and the number required to describe the data is the rank of the data.

For many data sets, the svd function can be used to calculate scores and loadings, but in general this is undesirable:

• With Singular Value Decomposition (SVD), all the scores must be calculated, even though the higher ones are not needed. For large and highly redundant data sets SVD is therefore computationally inefficient and may fail.

• The svd function requires that the number of columns in the data matrix be larger than the number of rows. This is a problem if there are more measurements than variables.

The NIPALS (Nonlinear Iterative Partial Least Squares) algorithm calculates the scores and loadings iteratively, and avoids these problems. It is numerically very stable, and it works with data sets of any size. It only calculates the desired number of principal components.