**What is PCA?**

Principal Component Analysis is a useful technique for exploratory data analysis. It helps in better defining the variation in samples when each sample is represented by many variables or you have wide dataset. PCA reduces the dimensionality of the data set and allows you to explain the variability using fewer variables. Though mathematics underlyging is quite complex (I will not explain the maths in this post), but in simple tersm it helps you in identification or grouping of samples which are similar to one another from which are very different. PCA is a type of linear transformation of the dataset to fit the data on a new cordinate system in such a way that highest significant variance is found on the first coordinate and each subsequent coordinate is orthogonal to the last and has less variation. PCA trasnsorms a set of x correlated varibales over y samples to a set of p uncorrelated principle components over the sample samples.

**What is Eigenvalue and Eigenvectors?**

Eigenvector mostly defines the directio of dimension and eignevalue is the number which explains the variance in the data in that direction.The eigenvector with the highest eigenvalue is therefore the first principle component. There will be number of eigenvalues and eigenvectors equaling to the number of dimension the data has.

###**So lets try to calculate and plot some PCA**

There are two general methods to perform PCA in R

*Spectral decomposition which examines the covariance/correlation between variables

*Singular value decompoistion which examines the covariance/correlation between individual

#R-inbuilt function **princomp()** uses spectral decompostion and **prcomp()** uses singular value decomposition.

*functions are as*

### prcomp(x, scale = F)

##### where x is the numeric matrix or data frame, and scale is a logical value which indicates whether variables should be scaled to have unit variance before the analysis or not based on its FALSE or TRUE arguments.

### princomp (x, cor = F, score = T)

##### x is a numeric matrix or data frame, cor is a logical value which if true will centered and scaled the data before analysis and score is a logical value if true will calculate the coordinates on each principal component.

###For this we will use factoextra package

#### Loading the library (if you dont have the package, you can simply use install.package function to install the same of you can also use devtools to load from github using *devtools::install_github(kassambara/factoextra””)*)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 3.6.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

####for the analysis we are going to use *decathlon2* and *USArrests* dataset

data("decathlon2") head(decathlon2)

## X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus ## SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 ## CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 ## BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 ## YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 ## ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67 ## McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41 ## Pole.vault Javeline X1500m Rank Points Competition ## SEBRLE 5.02 63.19 291.7 1 8217 Decastar ## CLAY 4.92 60.15 301.5 2 8122 Decastar ## BERNARD 5.32 62.77 280.1 4 8067 Decastar ## YURKOV 4.72 63.44 276.4 5 8036 Decastar ## ZSIVOCZKY 4.42 55.37 268.0 7 8004 Decastar ## McMULLEN 4.42 56.37 285.1 8 7995 Decastar

####now compute PCA and store the result in a varibale (here we are using res.pca)

res.pca <- prcomp(decathlon2[,1:12], scale = TRUE)#here we omitted the last column or 13 colum which was charcter column #lets see the results of our PCA with summary function summary(res.pca)

## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 2.2726 1.3218 1.2907 1.04961 0.78882 0.76774 0.6302 ## Proportion of Variance 0.4304 0.1456 0.1388 0.09181 0.05185 0.04912 0.0331 ## Cumulative Proportion 0.4304 0.5760 0.7148 0.80663 0.85848 0.90760 0.9407 ## PC8 PC9 PC10 PC11 PC12 ## Standard deviation 0.53611 0.45157 0.38071 0.27450 0.005121 ## Proportion of Variance 0.02395 0.01699 0.01208 0.00628 0.000000 ## Cumulative Proportion 0.96465 0.98164 0.99372 1.00000 1.000000

#####SO, we obtained 12 principal component, which are PC 1 – 12. Each of this component explains a percentage of the total variation in the dataset. PC1 explains 43% of total variance and PC2 explains about 14.5% of total variance. So, by knowing the position of a sample in relation to PC1 and PC2 we can get its relation with other samples as PC1 and PC2 explains more than 50% of the variance in the datasets (57.6%)

#now draw a scree plot now, which shows the percentage of variance by each principal component ( or eigenvalue) fviz_eig(res.pca)

#### Individual PCA

##graph of individuals, which will group the individual with similar profile or athletic records. fviz_pca_ind(res.pca, repel = TRUE) #repel function helps in avoiding the text overlapping

## now lets say, you want to use some color to the group to better distinguish the groups fviz_pca_ind(res.pca, col.ind = "cos2", gradinet.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

##graph of variables. Positive correlated variables point to the same side of the plot.Negatively correlated variables opposes each other fviz_pca_var(res.pca, col.var = "contrib", # Color by contributions to the PC gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE )

**PCA biplot**

##we can also draw biplot and represent both individuals and variables in a single plot fviz_pca_biplot(res.pca, repel = TRUE)

##using categorical variables to color indivuals as groups groups <- as.factor(decathlon2$Competition) fviz_pca_ind(res.pca, col.ind = groups, # color by groups palette = c("#00AFBB", "#FC4E07"), addEllipses = TRUE, # Concentration ellipses ellipse.type = "confidence", legend.title = "Groups", repel = TRUE )

##lets try the same with USArrest dataset #for loading the dataset data("USArrests") head(USArrests)

## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7

View(USArrests)

res.pca2 <- princomp(USArrests, scale = TRUE) # doing the PCA analysis using princomp function

fviz_eig(res.pca2) #plotting scree plot to observe the variance

summary(res.pca2)

## Importance of components: ## Comp.1 Comp.2 Comp.3 Comp.4 ## Standard deviation 82.8908472 14.06956001 6.424204055 2.4578367034 ## Proportion of Variance 0.9655342 0.02781734 0.005799535 0.0008489079 ## Cumulative Proportion 0.9655342 0.99335156 0.999151092 1.0000000000

biplot(res.pca2, scale = TRUE) #biplot using R base function

fviz_pca_biplot(res.pca2, repel = TRUE, addEllipses = TRUE) #biplot using factoextra package to relationship between the states based on variables murder, assault, rape and UrbanPop

Reference:

1. https://www.datacamp.com/community/tutorials/pca-analysis-r