Function in Python

A function is a reusable code for specific action. Here we are defining two different function.

  1. First one is to say hello to the user after asking their name with their age.
  2. Second function is to calculate square of a number.
#defining a function
#function_1
def say_hi(name, age):
    print("Hello" + name + ", you are " + age)

#asking user for their name and age: input
name = input("What is your name?")
age = input("What is your age?")

#calling the function
say_hi(name, age)

#function_2
def sq(num):
    return num*num

#calling the function and printing the result
print(sq(5))

PCA in R

What is PCA?

Principal Component Analysis is a useful technique for exploratory data analysis. It helps in better defining the variation in samples when each sample is represented by many variables or you have wide dataset. PCA reduces the dimensionality of the data set and allows you to explain the variability using fewer variables. Though mathematics underlyging is quite complex (I will not explain the maths in this post), but in simple tersm it helps you in identification or grouping of samples which are similar to one another from which are very different. PCA is a type of linear transformation of the dataset to fit the data on a new cordinate system in such a way that highest significant variance is found on the first coordinate and each subsequent coordinate is orthogonal to the last and has less variation. PCA trasnsorms a set of x correlated varibales over y samples to a set of p uncorrelated principle components over the sample samples.

What is Eigenvalue and Eigenvectors?

Eigenvector mostly defines the directio of dimension and eignevalue is the number which explains the variance in the data in that direction.The eigenvector with the highest eigenvalue is therefore the first principle component. There will be number of eigenvalues and eigenvectors equaling to the number of dimension the data has.

###So lets try to calculate and plot some PCA

There are two general methods to perform PCA in R

*Spectral decomposition which examines the covariance/correlation between variables

*Singular value decompoistion which examines the covariance/correlation between individual

#R-inbuilt function princomp() uses spectral decompostion and prcomp() uses singular value decomposition.

functions are as

prcomp(x, scale = F)

where x is the numeric matrix or data frame, and scale is a logical value which indicates whether variables should be scaled to have unit variance before the analysis or not based on its FALSE or TRUE arguments.

princomp (x, cor = F, score = T)

x is a numeric matrix or data frame, cor is a logical value which if true will centered and scaled the data before analysis and score is a logical value if true will calculate the coordinates on each principal component.

###For this we will use factoextra package

#### Loading the library (if you dont have the package, you can simply use install.package function to install the same of you can also use devtools to load from github using devtools::install_github(kassambara/factoextra””))

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

####for the analysis we are going to use decathlon2 and USArrests dataset

data("decathlon2")
head(decathlon2)
##           X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
## CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
## BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
## YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
## ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
## McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
##           Pole.vault Javeline X1500m Rank Points Competition
## SEBRLE          5.02    63.19  291.7    1   8217    Decastar
## CLAY            4.92    60.15  301.5    2   8122    Decastar
## BERNARD         5.32    62.77  280.1    4   8067    Decastar
## YURKOV          4.72    63.44  276.4    5   8036    Decastar
## ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
## McMULLEN        4.42    56.37  285.1    8   7995    Decastar

####now compute PCA and store the result in a varibale (here we are using res.pca)

res.pca <- prcomp(decathlon2[,1:12], scale = TRUE)#here we omitted the last column or 13 colum which was charcter column 
#lets see the results of our PCA with summary function
summary(res.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.2726 1.3218 1.2907 1.04961 0.78882 0.76774 0.6302
## Proportion of Variance 0.4304 0.1456 0.1388 0.09181 0.05185 0.04912 0.0331
## Cumulative Proportion  0.4304 0.5760 0.7148 0.80663 0.85848 0.90760 0.9407
##                            PC8     PC9    PC10    PC11     PC12
## Standard deviation     0.53611 0.45157 0.38071 0.27450 0.005121
## Proportion of Variance 0.02395 0.01699 0.01208 0.00628 0.000000
## Cumulative Proportion  0.96465 0.98164 0.99372 1.00000 1.000000

#####SO, we obtained 12 principal component, which are PC 1 – 12. Each of this component explains a percentage of the total variation in the dataset. PC1 explains 43% of total variance and PC2 explains about 14.5% of total variance. So, by knowing the position of a sample in relation to PC1 and PC2 we can get its relation with other samples as PC1 and PC2 explains more than 50% of the variance in the datasets (57.6%)

#now draw a scree plot now, which shows the percentage of variance by each principal component ( or eigenvalue)
fviz_eig(res.pca)

Individual PCA

##graph of individuals, which will group the individual with similar profile or athletic records.
fviz_pca_ind(res.pca, repel = TRUE) #repel function helps in avoiding the text overlapping
## now lets say, you want to use some color to the group to better distinguish the groups
fviz_pca_ind(res.pca, col.ind = "cos2",
             gradinet.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)
##graph of variables. Positive correlated variables point to the same side of the plot.Negatively correlated variables opposes each other
fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE
             )

PCA biplot

##we can also draw biplot and represent both individuals and variables in a single plot
fviz_pca_biplot(res.pca, repel = TRUE)
##using categorical variables to color indivuals as groups
groups <- as.factor(decathlon2$Competition)
fviz_pca_ind(res.pca,
             col.ind = groups, # color by groups
             palette = c("#00AFBB",  "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             ellipse.type = "confidence",
             legend.title = "Groups",
             repel = TRUE
             )
##lets try the same with USArrest dataset
#for loading the dataset
data("USArrests")
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
View(USArrests)
res.pca2 <- princomp(USArrests, scale = TRUE) # doing the PCA analysis using princomp function

fviz_eig(res.pca2) #plotting scree plot to observe the variance
summary(res.pca2)
## Importance of components:
##                            Comp.1      Comp.2      Comp.3       Comp.4
## Standard deviation     82.8908472 14.06956001 6.424204055 2.4578367034
## Proportion of Variance  0.9655342  0.02781734 0.005799535 0.0008489079
## Cumulative Proportion   0.9655342  0.99335156 0.999151092 1.0000000000
biplot(res.pca2, scale = TRUE) #biplot using R base function
fviz_pca_biplot(res.pca2, repel = TRUE, addEllipses = TRUE) #biplot using factoextra package to relationship between the states based on variables murder, assault, rape and UrbanPop

Reference:

1. https://www.datacamp.com/community/tutorials/pca-analysis-r

2. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

Soil Health Gap

Title: Soil Health Gap: A concept to establish a benchmark for soil health management

Abstract

Growing calls and the need for sustainable agriculture have brought deserved attention to soil and to efforts towards improving or maintaining soil health. Numerous research and field experiments report soil health in terms of physicochemical and biological indicators, and identify different management practices that can improve it. However, the question remains how much of cultivated land has degraded since the dawn of agriculture? What is the maximum or realistically attainable soil health goal? Determination of a benchmark that defines the true magnitude of degradation and simultaneously sets potential soil health goals will optimize efforts in improving soil health using different practices. In this paper, we discuss a new term “Soil Health Gap” that is defined as the difference between soil health in an undisturbed native soil and current soil health in a cropland in a given agroecosystem. Soil Health Gap can be determined based on a general or specific soil property such as soil carbon. Soil organic carbon were measured at native grassland, no-till, conventionally tilled, and subsoil exposed farmlands. Soil Health Gap based on soil organic carbon was in order of no-till < conventional till < subsoil exposed farmland and subsequently, maximum attainable soil health goal with introduction of conservation practices would vary by an existing management practice or condition. Soil Health Gap establishes a benchmark for soil health management decisions and goals and can be scaled up from site-specific to regional to global scale.

Link: https://www.sciencedirect.com/science/article/pii/S2351989420305680?via%3Dihub

Extracting Table from Webpage and Analyzing using R

Saurav Das

March 7, 2020

Extracting the coronavirus cases from website “https://www.worldometers.info/coronavirus/

loading the library

library(rvest)
## Loading required package: xml2
library(xml2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

extracting the data from webpage

webpage_url <- "https://www.worldometers.info/coronavirus/"
webpage <- xml2::read_html(webpage_url)
ExOffndrsRaw <- rvest::html_table(webpage)[[1]] %>% 
  tibble::as_tibble(.name_repair = "unique") # repair the repeated columns
CV <- ExOffndrsRaw %>% dplyr::glimpse(45)
## Observations: 106
## Variables: 9
## $ `Country,Other`    <chr> "China", "S. ...
## $ TotalCases         <chr> "80,703", "7,...
## $ NewCases           <chr> "+52", "+272"...
## $ TotalDeaths        <chr> "3,098", "50"...
## $ NewDeaths          <int> 28, 2, 49, NA...
## $ TotalRecovered     <chr> "57,333", "13...
## $ ActiveCases        <chr> "20,272", "7,...
## $ `Serious,Critical` <chr> "5,264", "36"...
## $ `Tot Cases/1M pop` <dbl> 56.1, 142.6, ...
str(CV)
## Classes 'tbl_df', 'tbl' and 'data.frame':    106 obs. of  9 variables:
##  $ Country,Other   : chr  "China" "S. Korea" "Iran" "Italy" ...
##  $ TotalCases      : chr  "80,703" "7,313" "6,566" "5,883" ...
##  $ NewCases        : chr  "+52" "+272" "+743" "" ...
##  $ TotalDeaths     : chr  "3,098" "50" "194" "233" ...
##  $ NewDeaths       : int  28 2 49 NA NA NA NA 7 1 NA ...
##  $ TotalRecovered  : chr  "57,333" "130" "2,134" "589" ...
##  $ ActiveCases     : chr  "20,272" "7,133" "4,238" "5,061" ...
##  $ Serious,Critical: chr  "5,264" "36" "" "567" ...
##  $ Tot Cases/1M pop: num  56.1 142.6 78.2 97.3 12.2 ...
View(CV)

Plotting

loading ggplot2 for plotting

library(ggplot2)
CV %>% ggplot(aes(`Country,Other`, TotalCases)) + geom_bar(stat = "identity")

you can see x-asis is overcrowded and we can’t really figure out what is going on, so we need to modify the plot and if you pay close attention to y axis also the values seems little off as some of the values have “commas” in the data frame for Total Case, which need to be fixed before we use the value as continuous_y_axis.

so for that I used mutate function from dplyr package with gsub to remove the comma and created one extra column named as “Total_cases”

CV <- mutate(CV, Total_cases = as.numeric(gsub(",", "", gsub("\\,", ",,", CV$TotalCases))))
str(CV)
## Classes 'tbl_df', 'tbl' and 'data.frame':    106 obs. of  10 variables:
##  $ Country,Other   : chr  "China" "S. Korea" "Iran" "Italy" ...
##  $ TotalCases      : chr  "80,703" "7,313" "6,566" "5,883" ...
##  $ NewCases        : chr  "+52" "+272" "+743" "" ...
##  $ TotalDeaths     : chr  "3,098" "50" "194" "233" ...
##  $ NewDeaths       : int  28 2 49 NA NA NA NA 7 1 NA ...
##  $ TotalRecovered  : chr  "57,333" "130" "2,134" "589" ...
##  $ ActiveCases     : chr  "20,272" "7,133" "4,238" "5,061" ...
##  $ Serious,Critical: chr  "5,264" "36" "" "567" ...
##  $ Tot Cases/1M pop: num  56.1 142.6 78.2 97.3 12.2 ...
##  $ Total_cases     : num  80703 7313 6566 5883 1018 ...
View(CV)

Modifying plot

here i modified the x axis for better readibility, I rotated the axis text at 45 angle.

CV %>% ggplot(aes(`Country,Other`, Total_cases)) + 
  geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

however if you still go through x-axis, its quite tough to read, so we can remove few of the countries which has reported only few cases like lets say 15. So, I will use filter function from dplyr to remove those.

CV %>% filter(Total_cases >= 15) %>% ggplot(aes(as.character(`Country,Other`), Total_cases)) + 
  geom_bar(stat = "identity") +theme(axis.text.x = element_text(angle = 45,hjust = 1))

lets scale the axis and transform it for better comparison. I scaled the y axis with log10, you can transform accordingly.

CV %>% filter(Total_cases > 15) %>% ggplot(aes(`Country,Other`, Total_cases)) + 
  geom_bar(stat = "identity") +theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(trans = "log10")

Hypothesis Testing

Hypothesis testing is one of the key method of inferential statistics.

It is based on the idea that we can define and describe a population based on the random samples we have collected.

So, if you want to define hypothesis:

What is Hypothesis?

It is an assumption about a population based on the sampling, which may or may not be true.

There are two types of statistical hypothesis:

  • Null Hypothesis (H0) – Which is hypothesis, defining there is no effect of your treatment or test.
  • Alternative Hypothesis (H1) – There is observable effect of the test or treatments on the population.

Point to remember: Hypothesis is always about the population parameter not the sample values or statistics.

For doing hypothesis testing there are 5 steps:

  1. Define your hypothesis for the experiment, the null hypothesis and the alternative hypothesis.
  2. Define the level of significance (the alpha value; in statistics which is often taken as 0.05; but based on your experiment you can determine your level of significance or alpha value. Based on which you can reject or accept your hypothesis)
  3. Your sample (Take random sample from your population and determine the statistical test based on the type and distribution of your sample)
  4. P-value
  5. Decide (Use the p-value to accept or reject your hypothesis)

So, lets say we want to find out the effect of manure on maize yield. So, we planted maize across our state at 5 different plots (replication) and applied our treatment or manure to see the effect. So, for this we set up two hypothesis :

  1. Null hypothesis: there is no effect of manure application on maize yield.
  2. Alternative hypothesis will be : there is effect of manure on maize yield.

So, at the end of the season we harvested maize from our five plots and determine mean population and used t-test to determine the p-value and we found our p-value is 0.03.

So, now what it means?

As we have taken our level of significance (alpha value) as 0.05 and our p-value is less than the level of significance so we can reject the null hypothesis and accept the alternative hypothesis. And, it means there is a significant effect of manuring on maize yield or the yield has increased after the application of manure.

Lets say we got our p-value 0.08, then what does that means ?

Now, the p-value is greater than the level of significance or alpha value, thus we can not reject the null hypothesis and it means there is no significant effect of manuring on maize yield.

So, when we describe hypothesis, we should also know about the type-1 and type-2 errors.

What is type-1 error ?

Type-1 error is also known as false positive, it happens when null hypothesis is true but rejected.

What is type-2 error?

Type-2 error happens when the null hypothesis is false and you fails to reject it.

We will further discuss about type-1 error and type-2 error in next blog post.

Cowplot

Ever wondered if you can put multiple plots in different alignment as plot-grids in R.

  • Cowplot is something which can do that for you.
  • Lets make some plots and see how we can arrange them using cow plot.
  • For plotting we are going to use inbuilt R dataset : mtcars
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

#For plotting we will use ggplot2 package (If you already have the package, you can simply load the package using library function, or else you have to install the package using “install.packages(“ggplot2”) function and then load the library)

library(ggplot2)

1st plot:mpg vs cyl

ggplot(mtcars, aes(cyl, mpg)) + geom_point()

2nd plot:

ggplot(mtcars, aes(hp, disp)) + geom_point()

3rd plot: hp vs cyl

ggplot(mtcars, aes(cyl, hp)) + geom_point()

Now lets say we want all the three graphs in one plot or in a grid. For that, before beginning, lets give unique name to each plots so that we can order/arrange them accordingly.

P <- ggplot(mtcars, aes(cyl, mpg)) + geom_point()
Q <- ggplot(mtcars, aes(hp, disp)) + geom_point()
R <- ggplot(mtcars, aes(cyl, hp)) + geom_point()

Now we have renamed all the three graphs, we can use cowplot package to arrange them. For that we need to load the library of cowplot if you had already installed the pacakge or if you don’t have then you have to install the package and then load the library.

library(cowplot)

Now we will use plot_grid function to arrange the graphs accordingly

1. Arranging the plot

plot_grid(P, Q, R, labels = "auto") #you can rename the labels accordingly or you can put auto argument which will name them automatically.

  1. Now lets say you don’t want auto labels, you want to name them specifically. So we will change the labels to P, Q and R to name them accordingly.
plot_grid(P, Q, R, labels = c("P", "Q", "R"))
  1. Now lets say you want all the three graphs in one single strips. so we will use ncol function to do that. we will make three columns to fit three graphs in one plate
plot_grid(P, Q, R, labels=c("P", "Q", "R"), ncol=3)
  1. Now lets say you want to change the labels size for all the graphs. For that you can use labels size function and accordingly modify the size.
plot_grid(P, Q, R, labels=c("P", "Q", "R"), ncol=3, label_size = 12)
  1. Now how about plotting all the three graphs in vertical direction rather than horizontal. For that we will use “nrow” function instead of ncol.
plot_grid(P, Q, R, labels=c("P", "Q", "R"), nrow=3)

Hope this will be helpful. If you have any more specific question you type in the comments.

P value (Part 1)

People often thinks “p-value” as “probability”, though they are related but not completely same.

Lets say, You flip a coin two times:

  1. First time you will get either 50% of Head and 50% of Tails
  2. Second time you will again get either 50% of Head or 50% of Tails.

So, if the question now is

a. what is the probability of getting two heads in a row ?

or

b. What is the p-value for getting 2 heads in a row ?

Now, lets breakdown the questions:

a. What is the probability of getting two heads in a row?

So after two flips there are four outcomes, in which each one is equally likely, as it is equally likely we can use the following formula to calculate probability:

Number of times two heads occurred / total outcomes = 1/4 = 0.25

So, there is 25% chances of getting two heads after two flips.

so if the question is probability of getting two tails then it will be again: 1/4= 0.25

means 25% chances of getting two tails

what about one tail and one heads what is the likelihood ?

2/4 = 0.50

so we have twice the chances of getting one head and one tail compared to two heads or tails.

B. So what is the p value for HH ?

P value definition: A p-value is the probability that random chance generated the data or something else that is equal or rarer.

(P value range = 0 – 1 and in statistics if it is less than alpha 0.05 it is significant and it is more than alpha 0.05 it is not significant)

So, it consist of three part:

  1. A p-value is the probability of event/ that random chance generated the data

so, it will be like probability of getting two heads (HH) in our two random flips, which is 0.25

2. something else that is equal

which is like getting two Tails (TT), which is equal to getting two heads (HH) and is also 0.25

3. Any rare event

There is no rare event than getting HH, so it is 0

so, the p-value for getting two heads (HH) = 0.25 + 0.25 = 0.50 and which is different than the probability of getting two heads which is 0.25

How about make this more complex:

Its easy to calculate each outcomes with coin, but lets consider human heights, is it easy to consider each possible outcome, definitely not, if so how many decimal places you need to put to accurately calculate that, which is not possible every time. So, for that we use density plot or distribution plot:

Image result for density plot

So, from a study “Height of nations: a socioeconomic analysis of cohort differences and patterns among women in 54 low- to middle-income countries.”

we found that height for Brazilian women (between 15 and 49 years old) which was measured in 1996 mostly lies in between

142 cm (4.6 ft) to 169 cm (5.5 ft)

So, area under the curve shows the distribution or probability of someone having heights in that range.

If we breakdown and analyse the density plot.

  1. 95% of the women or most of the women have height in between 142 – 169 cm or there is 95% probability that each time you measure a Brazilian woman’s height it will be in between 142 – 169 cm.

2. There 2.5% changes or 2.5% probability that each time you measure Brazilian women it will be less than 142 cm.

3. There is 2.5% chance that each time you measure Brazilian women it will be more than 169 cm.

So, what will be the p-value for someone who is 142 cm tall:

To calculate:

  1. there is 2.5% chance that someone will be 142 cm or shorter = 0.025 (which is probability of the even)
  2. there is 2.5% chance that is someone will be 169 cm or shorter = 0.025 (which is the event more likely equal or rare)
  3. there is no more rare event than that = 0

So, the p-value = 0.025 +0.025 = 0.05

(so in statistics, it is significant)

How about, what is the p-value for someone who is between 155.4 and 156 cm tall ?

so to calculate:

  1. the probability of the event or person with heights between 155.4 and 156 cm tall is = 4% or 0.04
  2. For rare events or extreme values .

there is 48% of the people who are taller than 156 cm and there is 48% of people who are shorter than 155.4 cm which is equal to = 0.48 + 0.48 = 0.96

So, the p value is = 0.04 + 0.96 = 1

So, in statistics its not significant or measuring someone between 155.4 and 156 is not significant thought the probability of the event is rare.

Reference:

Josh stammer video statquest. (https://www.youtube.com/watch?v=5Z9OIYA8He8&t=92s)

Time Series Data Analysis – 1

A time series data are sequence of data which were listed in time order. Example of time series data include daily temperature, precipitation for a year or historical data showing several years month wise or day wise. Time series are mostly used in econometric, mathematical finance, weather forecasting etc. Time series analysis and forecasting uses different statistics and modelling to extract meaningful information or to predict future value based on the data. A time series data can be of, 1. Regular time series : When data have specific intervals between each observations. 2. Irregular time series : when there is no fixed intervals between the observations.

loading the library

library(ggplot2)
library(dplyr)

##
## Attaching package: ‘dplyr’

## The following objects are masked from ‘package:stats’:
##
##     filter, lag

## The following objects are masked from ‘package:base’:
##
##     intersect, setdiff, setequal, union

library(lubridate)

##
## Attaching package: ‘lubridate’

## The following object is masked from ‘package:base’:
##
##     date

library(xts)

## Loading required package: zoo

##
## Attaching package: ‘zoo’

## The following objects are masked from ‘package:base’:
##
##     as.Date, as.Date.numeric

##
## Attaching package: ‘xts’

## The following objects are masked from ‘package:dplyr’:
##
##     first, last

Data

I accessed the data from NOAA (National Oceanic and Atmospheric Administration) So, I Used Clemson_Florence 30 years weather data *You can simply import by native R-Studio import function I used variable FL to name the dataframe

getwd()

## [1] “C:/Users/saura/Desktop”

FL <- read.csv(“C:/Users/saura/Desktop/Florence_30years_Weather_Data.csv”, header = TRUE)
str(FL)

## ‘data.frame’:    10456 obs. of  9 variables:
##  $ NAME     : Factor w/ 1 level “Clemson PDREC”: 1 1 1 1 1 1 1 1 1 1 …
##  $ LATITUDE : num  34.3 34.3 34.3 34.3 34.3 …
##  $ LONGITUDE: num  -79.7 -79.7 -79.7 -79.7 -79.7 …
##  $ ELEVATION: int  125 125 125 125 125 125 125 125 125 125 …
##  $ DATE     : Factor w/ 10456 levels “1/1/1991″,”1/1/1992”,..: 840 782 4155 294 207 517 488 3848 477 3439 …
##  $ PRCP_inch: num  0 0 0 0 0 0 0 0 0 0 …
##  $ Avg_T_F  : Factor w/ 76 levels “.”,”0″,”19″,”20″,..: 4 6 5 3 4 15 5 7 7 9 …
##  $ Max_T_F  : int  32 35 33 26 28 51 31 34 35 39 …
##  $ Min_T_F  : int  8 10 11 12 12 12 13 13 13 13 …

Converting into Date variable or POSIXct format

you can see the Date is in factor format, so before we can use this for analysis, we need to change it to Date variables/POSIXct format. changing the factor varibales of Date into Date variable

FL$DATE <- parse_date_time(FL$DATE, orders = c(“ymd”, “dmy”, “mdy”))
str(FL)

## ‘data.frame’:    10456 obs. of  9 variables:
##  $ NAME     : Factor w/ 1 level “Clemson PDREC”: 1 1 1 1 1 1 1 1 1 1 …
##  $ LATITUDE : num  34.3 34.3 34.3 34.3 34.3 …
##  $ LONGITUDE: num  -79.7 -79.7 -79.7 -79.7 -79.7 …
##  $ ELEVATION: int  125 125 125 125 125 125 125 125 125 125 …
##  $ DATE     : POSIXct, format: “2018-01-07” “2018-01-05” …
##  $ PRCP_inch: num  0 0 0 0 0 0 0 0 0 0 …
##  $ Avg_T_F  : Factor w/ 76 levels “.”,”0″,”19″,”20″,..: 4 6 5 3 4 15 5 7 7 9 …
##  $ Max_T_F  : int  32 35 33 26 28 51 31 34 35 39 …
##  $ Min_T_F  : int  8 10 11 12 12 12 13 13 13 13 …

Now you can see, the DATE has been changed from factor to POSIXct format ###Extracting the year and month from Date variable I used mutate function to store the Year and Month into two new columns in the Data Frame and named the new data frame FL2

FL2 <- FL %>%
  mutate(Year = year(DATE), Month = month(DATE))

plotting the data and doing the analysis

I grouped the data based on year and then did the plotting using mean of maximum temperature per year, I Used default regression model LOESS or polynomial moving average regression with geom_smooth function to see the trend

FL2 %>% group_by(Year) %>% na.omit() %>% summarise(Mean_Max = mean(Max_T_F)) %>%
ggplot(aes(Year, Mean_Max)) + geom_line(type = 2, size = 1) + geom_smooth() +
  scale_x_continuous(breaks = seq(1990, 2019, by = 1)) + xlab(“Year”) + ylab(“Mean Maximum Temperature (°F)”) +
  theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14))

## Warning: Ignoring unknown parameters: type

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

You can see the x-axis text are overlapped within each other, you can simply rotate the angle of text to solve the issue, running the codes again with angle function under theme and element text function

FL2 %>% group_by(Year) %>% na.omit() %>% summarise(Mean_Max = mean(Max_T_F)) %>%
ggplot(aes(Year, Mean_Max)) + geom_line(type = 2, size = 1) + geom_smooth() +
  scale_x_continuous(breaks = seq(1990, 2019, by = 1)) + xlab(“Year”) + ylab(“Mean Maximum Temperature (°F)”) +
  theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14)) +
  theme(axis.text.x = element_text(angle = 45))

## Warning: Ignoring unknown parameters: type

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

boxplot

Lets plot boxplot to see the more details of the temperature profile

FL2 %>% ggplot(aes(as.factor(Year),Max_T_F)) + geom_boxplot(fill = “light green”) +
  theme_bw() + ylab(“Maximum Temperature (°F)”) + xlab(“Year”) +
  theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14)) + theme(axis.text.x = element_text(angle = 45))

filtering the data for specific months

Lets say I want to filter the data for specific month or for growing season of a particular crop which is from May to September. Now I will filter the month from data set using filter function of dplyr. I can directly even filter the months by extracting months from DATE column using month() function of lubridate or we can use the Month column which we created in the first line of code using mutate function.

FL2 %>% group_by(Year) %>% na.omit() %>% filter(month(DATE) >= 5 & month(DATE) <= 9) %>% summarise(Mean_SeasonalT = mean(Max_T_F)) %>%
  ggplot(aes(Year, Mean_SeasonalT)) + geom_line(size = 0.8) + scale_x_continuous(breaks = seq(1990, 2019, by = 1)) +
  geom_point() + geom_smooth() + ylab(“Mean of Maximum Seasonal Temperature (°F)”) +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12)) +
  theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

Minimum Temperature

Now lets plot the minimum temperature for the Florence over the years. This time I am just using Month Column to filter the data

FL2 %>% group_by(Year) %>% na.omit() %>% filter(Month >= 5 & Month <= 9) %>%
  summarise(Mean_SeasonalT = mean(Min_T_F)) %>%
  ggplot(aes(Year, Mean_SeasonalT)) + geom_line(size = 0.8) +
  scale_x_continuous(breaks = seq(1990, 2019, by = 1)) +
  geom_point() + geom_smooth() + ylab(“Mean of Minimum Seasonal Temperature (°F)”) +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12)) +
  theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

Precipitation

Lets make another line plot for the precipitation per year

FL2 %>% group_by(Year) %>% na.omit() %>% filter(month(DATE) >= 5 & month(DATE) <= 9) %>%
  summarise(Mean_SeasonalT = mean(PRCP_inch)) %>%
  ggplot(aes(Year, Mean_SeasonalT)) + geom_line(size = 0.8) +
  scale_x_continuous(breaks = seq(1990, 2019, by = 1)) +
  geom_point() + geom_smooth() + ylab(“Mean Seasonal Precipitation (inch)”) +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12)) +
  theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

Data for last 10 years

Lets say, Now I only want to put plots for last 10 years to see the changing temperature trend in growing season from May to September. Now I will filter both the Year and Month to get the data. I used facet wrap function to see each year separately.

FL2 %>% na.omit() %>% group_by(Year, Month) %>%
  filter(Month >= 5 & Month <= 9, Year >= 2008 & Year<= 2019) %>%
  summarise(Mean_MaxT = mean(Max_T_F)) %>% ggplot(aes(Month, Mean_MaxT)) +
  geom_line() + geom_point() + facet_wrap(~Year) + ylab(” Mean Maximum Temperature (°F)”)

Now look at the X- axis, the month are coming as numeric value of 5 – 9, but I want to change them in words, so I will use scale_x_continuous function with label arguments to change the same. it will be like,

FL2 %>% na.omit() %>% group_by(Year, Month) %>%
  filter(Month >= 5 & Month <= 9, Year >= 2008 & Year<= 2019) %>%
  summarise(Mean_MaxT = mean(Max_T_F)) %>% ggplot(aes(Month, Mean_MaxT)) +
  geom_line() + geom_point() + facet_wrap(~Year) + ylab(” Mean Maximum Temperature (°F)”) + scale_x_continuous(labels = c(“5” = “May”, “6” =”June”, “7” = “July”, “8” = “August”, “9” = “September”)) + theme(axis.text.x = element_text(angle = 45))

Truncated Mean

Before going to Trimmed Mean or Truncated Mean, lets have a quick view of the most common terms used in descriptive and summary statistics: (There are whole lot of other terms)

Mean : calculation of the central value of the dataset

Median: A value lying in the midpoint of the dataset

Mode: value with highest frequency (or most occurred)

Range : distance between the highest and smallest value of a dataset

Lets Say, for Example, we have a dataset:

A = {2, 5, 7, 11, 14, 17,17,2, 18,2, 20, 24, 25, 27, 32}

So, if we calculate using base R functions, it will be something like this:

> mean(A)
[1] 14.86667
> median(A)
[1] 17
> mode(A)
[1] “numeric”
> range(A)
[1] 2 – 32

So, What is Trimmed Mean ?

Similar to the “Mean” and “Median” , Truncated Mean or Trimmed Mean is also a measure of central tendency. It calculates mean or average discarding the samples at extreme high and low ends (outliers). It is a robust statistical method.

So, lets say we have a dataset : (source: wikipedia)

X = { 92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41 }

So, you can majority of the number (95%) lies in between 15 – 92), while negative values and 1053 are kind of extreme outliers. So, lets see what we will get if we calculate mean vs trimmed mean (with 5%)

> mean(x)
[1] 101.5
> mean(x, trim = 0.05)
[1] 56.5

So, what does this trim value means? How to choose trim value ?

If you are choosing 20% of trim or trim = 0.2 on a dataset of N = 20; (20% of 20 = 4), so it will remove first four lower value and last four higher values from the dataset or 20% of lower value and 20% of higher value.

Lets consider a dataset,

M = {20, 33, 22, 36, 38, 45, 67, 39, 12, 15, 19, 28, 10, 44, 56, 63, 72, 30, 42, 48}

first lets sort the dataset in ascending order:

sort(M)
10 12 15 19 20 22 28 30 33 36 38 39 42 44 45 48 56 63 67 72

so, if we now trim it with 20% it will be like removing first four value and last four value so, the trimmed dataset will be:

10 12 15 19 20 22 28 30 33 36 38 39 42 44 45 48 56 63 67 72

M = { 20 22 28 30 33 36 38 39 42 44 45 48 }

People generally choose 20% trim, but you can choose other trim value also based on your data distribution.

What is the advantage of truncated Mean ?

a. As it less sensitive to outliers, it gives reasonable estimates of central tendency when sample distribution is skewed or uneven.

b. standard error of trimmed mean is less effected by outliers

What are the statistical test you can use with truncated mean?

You can use trimmed means instead of means in t-test. However, calculation of standard error differs from traditional t-test formula as because the values are nor more independent after trimming. Adjusted standard error calculation for trimmed mean was originally proposed by Karen Yuen in 1974, which involves “winsorization”. In winsorization instead of removing observation like in trimming, we replace the values with extreme values. so for dataset M, 20% winsorized sample will be :

M = { 10 10 10 10 20 22 28 30 33 36 38 39 42 44 45 48 72 72 72 72 }

Yuen’s approach:

trimmed mean with confidence interval : trimci(x, tr = 0.2, alpha = 0.05)

Yuen t-test for 2 independent groups: yuen(x, y, tr=0.2)

Yuen t-test for 2 dependent groups: yuend(x, y,, tr=0.2)

Further Readings:

  1. Trimmed Mean: https://garstats.wordpress.com/2017/11/28/trimmed-means/
  2. Karen K. Yuen. The two-sample trimmed t for unequal population variances, Biometrika, Volume 61, Issue 1, 1 April 1974, Pages 165–170, https://doi.org/10.1093/biomet/61.1.165
  3. Trimmed mean: https://medium.com/@HollyEmblem/when-to-use-a-trimmed-mean-fd6aab347e46