Principal Component Analysis
Principal Component Analysis or PCA transforms a number of correlated variables into a number of uncorrelated variables called principal components. The first principal component captures most variability in dataset and each succeeding components accounts for the remaining variabilities.
It is a powerful dimensionality reduction technique that is usually able to express most variability in fewer principal components than using all the features or predictors. It greatly enhances visualization by plotting dataset with large numbers of predictors in 2-D or 3-D space using 2 or 3 principal components. It also reveals correlations and influence of the predictors in terms of variability of the dataset.
To transform data to PCA and to export output, use mlearn app and follow the instructions below.
Calculate Principal Components
To describe the basics of PCA calculation using Singular Value Decomposition (SVD), we’ll be using a dataset that has only two features or predictors, X and Y. By averaging, center of the data is calculated and is used as the origin of the 2-D plot. First, data is projected on a random line that passes through the origin. From projected data, sum of squared distances between projection and origin is calculated. Then the random line is rotated and at each instance sum of squared distances are calculated. In this way, the best fit line is determined where sum of squared distances are the maximum. This best fit line is called Principal Component 1 or PC1. This is a linear combination of features X and Y.
The unit vector along PC1 is called Singular Vector or Eigenvector for PC1. The proportion of X and Y that constitutes the Eigenvector is called the Loading Scores. Also, the sum of squared distances for the best fit line is called the Eigenvalue for PC1. For this dataset, PC2 is determined as the line that passes through origin and is perpendicular to PC1. Same as PC1, the Eigenvalue for PC2 is the sum of squared distances between projected data on PC2 and the origin. Principal Component scatter plot is drawn by rotating and making PC1 – PC2 orthogonal coordinate and projecting data back to the quadrants. Factor map is created by rotating Eigenvectors of PC1-PC2 as orthogonal coordinate and showing loading scores of the feature variables.
Scree plots graphically represent the percentage of variation captured by each PC. Variations are calculated by dividing sum of squared distances by (sample size – 1). The variation can also be represented by Eigenvalues as (sample size – 1) is a constant.
In case of more than 3 features, say X, Y and Z, PC1 is still the best fitting line on dataset. PC2 is the next best fit line through the origin and perpendicular to PC1. PC3 is the best fit line through the origin and is perpendicular to both PC1 and PC2.
To demonstrate how to apply PCA on a dataset, we will be using ‘wine’ data from the app. To load the data, open mlearn, click ‘Data’ tab, select ‘wine’ in ‘load data from app’ under ‘Data Source’ and click ‘Submit Change’. This data can also be loaded from the source and be imported in app using ‘Data’ tab, selecting ‘desktop’ under ‘Data Source’ and identifying the appropriate ‘file type’.
The data originated from chemical analysis of wines grown in the same region in Italy and collected from 3 different cultivars. Analyses revealed the amount of 13 constituents found in each type of wine. Our goal is to transform the constituents into Principal Components to evaluate expressing the data at fewer dimensions and to determine contributing variables of the Principal Components. In addition to dimensionality reduction, this helps identifying the most influential constituents in terms of determining the ‘wine source’ or, in general’, finding the response variable. So, the knowledge can be utilized in Regression or Classification analysis of the data.
To transform data, click ‘PCA’ tab, select all columns except ‘source’ in ‘select columns’ under ‘Variables’, click ‘Submit Change’. This will provide the scree plot (first 10 PCs) under ‘Plot’ and values of PCs.
Scree plot shows about 60% of the variations in dataset can be accounted for by using only first 2 PCs. As we mentioned before, Principal Component plot can be drawn by rotating and making PC1 as vertical axis and PC2 as horizontal axis and projecting data back to the quadrants. To create following plot, select ‘Scatter’ in ‘select plot’, select ‘color variable’ as ‘source’ and ‘point size’ to 7.
To create factor map, select ‘Correlation’ in ‘select plot’, click ‘Submit Change’. This plot shows loading scores of the feature variables as described above.
The squared loading scores are called cos2 or quality of representation, the sum of which for each variable is equal to one. Higher value cos2 means better quality of representation of the variable by the Principal Component. Following plot show cos2 of PC1, here, flavonoids is the best represented by PC1 while ash representation is almost negligible. To plot, select ‘Quality’ in ‘select plot’ and ‘PC1’ in ‘select component’.
Variables percentage contributions on PCs can be derived from cos2s. It is calculated for each PC as follows:
(cos2 of a variable) X 100 / (sum of cos2s of all variables for the PC)
Variables that contribute the most to the first PCs i.e. PC1, PC2 etc. are the most important in explaining the data variability. Conversely, variables with least contributions to first PCs can be less relevant to explain variability and may be redundant. Contributions. Plot contributions for PC1 by selecting ‘Contribution’ in ‘select plot’ and ‘PC1’ in ‘select component’.
Finally, data for the PCA ‘Scatter’ plot can be downloaded by clicking ‘Download Data’ below ‘Data PCA’ table.