Data Visualization with ggplot2 (#part 1)

ggplot is based on grammer of graphics (gg), which means you can draw every graph with few basic components like:

  1. a data set
  2. a set of geom or geometry – which represents the data points
  3. a coordinate system

You can marge you data points with aes or aesthetic components to provide a beautiful graphical visualization.

For this purpose we will use the famous “iris” database.

you can simply load the iris database from R (R has inbuilt datasets to play around with the codes)

data(iris)

 

#this will load the iris data set

head(iris)

#this will let you view the headings of the data

#in ggplot, you can add arguments with ” + ” signs for every functions; for example,

Lets say we want to draw a scatter plots:

ggplot(data = iris, aes(x= Sepal.Length, y = Petal.Length)) + geom_point()


#here the ggplot is the function and iris is the data and aesthetics you can put the x-axis and y-axis data and here we are drawing scatter plot so the function we put is geom_point()

Let’s say we want to color the data points according to the species,

ggplot(data = iris, aes(x= Sepal.Length, y = Petal.Length)) + geom_point(aes(color= Species))

#For this you have to put aes in the geom point ; color = Species

Let’s say you want to change the size of points for better visualization, for that you have to add one arguments, to the geom_point; size = “which size you want”

ggplot(data = iris, aes(x= Sepal.Length, y = Petal.Length)) + geom_point(aes(color= Species, size = 2))

#Let’s say now we want to draw a density plot,

ggplot(data=iris, aes(x=Sepal.Width, fill=Species))+ geom_density(stat=”density”) +
xlab(“Sepal Width”) + ylab(“Density”) + ggtitle(“Density Plot”)

#geom_density() will allow you to plot a density plot

 

Correlation Using ggcorrplot

#The hypothetical data frame,

#Lets say,

>A <- c(2, 4, 6, 8, 12, 5, 7,8)
>B <- c(4,5,6,2,9,13,2, 6)
>C <- c(3,5,7,2,4,8,3,6)
>D <- c(4,6,2,5,7,4,6,9)

DD <- data.frame(A,B,C, D)

#correlation with package ggcorrplot – installing the package and importing the library

install.packages(“ggcorrplot”)
library(“ggcorrplot”)

#doing the correlation test using pearson method and assigning the results to variable CORT

CORT <- cor (DD, method = “pearson”)

#now plotting with ggcorrplot

ggcorrplot(CORT, hc.order = TRUE)

#calculating the matrix with p value

p.mat <- cor_pmat(DD)

#Plotting with p-value (non-significant coefficients are crossed)

ggcorrplot(CORT, hc.order = TRUE, p.mat = p.mat)

#plotting with the ggcorrplot with modified arguments and p value

ggcorrplot(COR, hc.order = TRUE, p.mat = p.mat, colors = c(“red”,”green”, “blue”))

# plotting only the lower half

ggcorrplot(COR, hc.order = TRUE, p.mat = p.mat, type = “lower”)

#with logical values

ggcorrplot(COR, hc.order = TRUE, p.mat = p.mat, type = “lower”, lab = TRUE)

#with method = circle

ggcorrplot(COR, hc.order = TRUE, p.mat = p.mat, type = “lower”, method = “circle”)

#with outline

ggcorrplot(COR, hc.order = TRUE, p.mat = p.mat, type = “lower”, outline.color = “black”)

Installing R

For installation of R,

Go to the R – Project Website:

https://www.r-project.org/

Then go to the CRAN server on the download section, and choose any of the servers from your specific country or even from the world server and can download the version you want for your operating system:

https://cloud.r-project.org/ (for download)

Then follow the suggested installation procedure to load the base program.

Then for integrated environment :

You can download the: R – studio from

https://www.rstudio.com/

and install the same.

Doing correlation in R using corrplot

#Creating a imaginary data frame

Lets say

>A <- c(2, 4, 6, 8, 12, 5, 7,8)
> B <- c(4,5,6,2,9,13,2, 6)
> C <- c(3,5,7,2,4,8,3,6)
>D <- c(4,6,2,5,7,4,6,9)

DD <- data.frame(A,B,C, D)

#correlation with package corrplot
> install.packages(“corrplot”)

library(“corrplot”)

#doing the correlation test and plotting the matrix
#we have assigned the result of the correlation to the variable CORT

#you can use any of the test from pearson, kendall, spearman, we are using pearson

CORT <- cor (DD, method = “pearson”)

cor(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))

#now plotting with corrplot, the default plot will give you a plot with circular points in a matrix with a key scale on the right
> corrplot(CORT)

corplot

*here, if you compare the right scale, higher the color intensity towards blue they are more correlated, while towards red are negatively correlated.

#you can easily change your plot , changing the following arguments

corrplot(CORT, method = “color”, order = “AOE”, tl.col = “black”, tl.cex = 0.8, addCoef.col = “black”)

#where, you can choose any of the following methods,

method = c("circle", "square", "ellipse", "number", "shade",
  "color", "pie") we choosed color

or any of the order from,

orderCharacter, the ordering method of the correlation matrix.

  • "original" for original order (default).
  • "AOE" for the angular order of the eigenvectors.
  • "FPC" for the first principal component order.
  • "hclust" for the hierarchical clustering order.
  • "alphabet" for alphabetical order.

tl. col = change the color of the label,

tl.cex= you can adjust the size of the font

rplot

The above corrplot is full, let say you only want the lower half or upper half, then what you have to do is,

corrplot(CORT, method = “color”, order = “AOE”, type = ‘lower’, tl.col = “black”, tl.cex = 0.8, addCoef.col = “black”)

add one more argument, type = “lower” if you want lower half or type = “upper” if you want upper half.

lower corrplot
upper corrplot
corrplot(corr, method = c("circle", "square", "ellipse", "number", "shade",
  "color", "pie"), type = c("full", "lower", "upper"), add = FALSE,
  col = NULL, bg = "white", title = "", is.corr = TRUE, diag = TRUE,
  outline = FALSE, mar = c(0, 0, 0, 0), addgrid.col = NULL,
  addCoef.col = NULL, addCoefasPercent = FALSE, order = c("original",
  "AOE", "FPC", "hclust", "alphabet"), hclust.method = c("complete", "ward",
  "ward.D", "ward.D2", "single", "average", "mcquitty", "median", "centroid"),
  addrect = NULL, rect.col = "black", rect.lwd = 2, tl.pos = NULL,
  tl.cex = 1, tl.col = "red", tl.offset = 0.4, tl.srt = 90,
  cl.pos = NULL, cl.lim = NULL, cl.length = NULL, cl.cex = 0.8,
  cl.ratio = 0.15, cl.align.text = "c", cl.offset = 0.5, number.cex = 1,
  number.font = 2, number.digits = NULL, addshade = c("negative",
  "positive", "all"), shade.lwd = 1, shade.col = "white", p.mat = NULL,
  sig.level = 0.05, insig = c("pch", "p-value", "blank", "n", "label_sig"),
  pch = 4, pch.col = "black", pch.cex = 3, plotCI = c("n", "square",
  "circle", "rect"), lowCI.mat = NULL, uppCI.mat = NULL, na.label = "?",
  na.label.col = "black", win.asp = 1, ...)

#You can play with the arguments, changing the parameter to change your plots accordingly.

corrplot(CORT, method = “pie”, order = “AOE”, type = ‘upper’, tl.col = “black”, tl.cex = 0.8, addCoef.col = “black”)

I hope this will help be useful. Please comment and share the post. if you have any question or suggestion, please comment.

Next time we will plot corrplot with p value or using significance.

Author : Saurav Das (https://twitter.com/Moutain_Soul or https://www.facebook.com/saurav12das)