PCA in R

What is PCA?

Principal Component Analysis is a useful technique for exploratory data analysis. It helps in better defining the variation in samples when each sample is represented by many variables or you have wide dataset. PCA reduces the dimensionality of the data set and allows you to explain the variability using fewer variables. Though mathematics underlyging is quite complex (I will not explain the maths in this post), but in simple tersm it helps you in identification or grouping of samples which are similar to one another from which are very different. PCA is a type of linear transformation of the dataset to fit the data on a new cordinate system in such a way that highest significant variance is found on the first coordinate and each subsequent coordinate is orthogonal to the last and has less variation. PCA trasnsorms a set of x correlated varibales over y samples to a set of p uncorrelated principle components over the sample samples.

What is Eigenvalue and Eigenvectors?

Eigenvector mostly defines the directio of dimension and eignevalue is the number which explains the variance in the data in that direction.The eigenvector with the highest eigenvalue is therefore the first principle component. There will be number of eigenvalues and eigenvectors equaling to the number of dimension the data has.

###So lets try to calculate and plot some PCA

There are two general methods to perform PCA in R

*Spectral decomposition which examines the covariance/correlation between variables

*Singular value decompoistion which examines the covariance/correlation between individual

#R-inbuilt function princomp() uses spectral decompostion and prcomp() uses singular value decomposition.

functions are as

prcomp(x, scale = F)

where x is the numeric matrix or data frame, and scale is a logical value which indicates whether variables should be scaled to have unit variance before the analysis or not based on its FALSE or TRUE arguments.

princomp (x, cor = F, score = T)

x is a numeric matrix or data frame, cor is a logical value which if true will centered and scaled the data before analysis and score is a logical value if true will calculate the coordinates on each principal component.

###For this we will use factoextra package

#### Loading the library (if you dont have the package, you can simply use install.package function to install the same of you can also use devtools to load from github using devtools::install_github(kassambara/factoextra””))

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

####for the analysis we are going to use decathlon2 and USArrests dataset

data("decathlon2")
head(decathlon2)
##           X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
## CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
## BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
## YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
## ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
## McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
##           Pole.vault Javeline X1500m Rank Points Competition
## SEBRLE          5.02    63.19  291.7    1   8217    Decastar
## CLAY            4.92    60.15  301.5    2   8122    Decastar
## BERNARD         5.32    62.77  280.1    4   8067    Decastar
## YURKOV          4.72    63.44  276.4    5   8036    Decastar
## ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
## McMULLEN        4.42    56.37  285.1    8   7995    Decastar

####now compute PCA and store the result in a varibale (here we are using res.pca)

res.pca <- prcomp(decathlon2[,1:12], scale = TRUE)#here we omitted the last column or 13 colum which was charcter column 
#lets see the results of our PCA with summary function
summary(res.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.2726 1.3218 1.2907 1.04961 0.78882 0.76774 0.6302
## Proportion of Variance 0.4304 0.1456 0.1388 0.09181 0.05185 0.04912 0.0331
## Cumulative Proportion  0.4304 0.5760 0.7148 0.80663 0.85848 0.90760 0.9407
##                            PC8     PC9    PC10    PC11     PC12
## Standard deviation     0.53611 0.45157 0.38071 0.27450 0.005121
## Proportion of Variance 0.02395 0.01699 0.01208 0.00628 0.000000
## Cumulative Proportion  0.96465 0.98164 0.99372 1.00000 1.000000

#####SO, we obtained 12 principal component, which are PC 1 – 12. Each of this component explains a percentage of the total variation in the dataset. PC1 explains 43% of total variance and PC2 explains about 14.5% of total variance. So, by knowing the position of a sample in relation to PC1 and PC2 we can get its relation with other samples as PC1 and PC2 explains more than 50% of the variance in the datasets (57.6%)

#now draw a scree plot now, which shows the percentage of variance by each principal component ( or eigenvalue)
fviz_eig(res.pca)

Individual PCA

##graph of individuals, which will group the individual with similar profile or athletic records.
fviz_pca_ind(res.pca, repel = TRUE) #repel function helps in avoiding the text overlapping
## now lets say, you want to use some color to the group to better distinguish the groups
fviz_pca_ind(res.pca, col.ind = "cos2",
             gradinet.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)
##graph of variables. Positive correlated variables point to the same side of the plot.Negatively correlated variables opposes each other
fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE
             )

PCA biplot

##we can also draw biplot and represent both individuals and variables in a single plot
fviz_pca_biplot(res.pca, repel = TRUE)
##using categorical variables to color indivuals as groups
groups <- as.factor(decathlon2$Competition)
fviz_pca_ind(res.pca,
             col.ind = groups, # color by groups
             palette = c("#00AFBB",  "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             ellipse.type = "confidence",
             legend.title = "Groups",
             repel = TRUE
             )
##lets try the same with USArrest dataset
#for loading the dataset
data("USArrests")
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
View(USArrests)
res.pca2 <- princomp(USArrests, scale = TRUE) # doing the PCA analysis using princomp function

fviz_eig(res.pca2) #plotting scree plot to observe the variance
summary(res.pca2)
## Importance of components:
##                            Comp.1      Comp.2      Comp.3       Comp.4
## Standard deviation     82.8908472 14.06956001 6.424204055 2.4578367034
## Proportion of Variance  0.9655342  0.02781734 0.005799535 0.0008489079
## Cumulative Proportion   0.9655342  0.99335156 0.999151092 1.0000000000
biplot(res.pca2, scale = TRUE) #biplot using R base function
fviz_pca_biplot(res.pca2, repel = TRUE, addEllipses = TRUE) #biplot using factoextra package to relationship between the states based on variables murder, assault, rape and UrbanPop

Reference:

1. https://www.datacamp.com/community/tutorials/pca-analysis-r

2. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/