def translate(phrase): translation =""for letter in phrase: if letter.lower() in"aeiou": if letter.isupper(): translation = translation +"M"else: translation = translation +"m"else: translation= translation+letter return translation print(translate(input("Enter a phrase:")))

## Function in Python

A function is a reusable code for specific action. Here we are defining two different function.

- First one is to say hello to the user after asking their name with their age.
- Second function is to calculate square of a number.

def say_hi(name, age): print(#defining a function #function_1"Hello"+ name +", you are "+ age)name = input(#asking user for their name and age: input"What is your name?") age = input("What is your age?")say_hi(name, age)#calling the functiondef sq(num): return num*num#function_2print(sq(5))#calling the function and printing the result

## PCA in R

**What is PCA?**

Principal Component Analysis is a useful technique for exploratory data analysis. It helps in better defining the variation in samples when each sample is represented by many variables or you have wide dataset. PCA reduces the dimensionality of the data set and allows you to explain the variability using fewer variables. Though mathematics underlyging is quite complex (I will not explain the maths in this post), but in simple tersm it helps you in identification or grouping of samples which are similar to one another from which are very different. PCA is a type of linear transformation of the dataset to fit the data on a new cordinate system in such a way that highest significant variance is found on the first coordinate and each subsequent coordinate is orthogonal to the last and has less variation. PCA trasnsorms a set of x correlated varibales over y samples to a set of p uncorrelated principle components over the sample samples.

**What is Eigenvalue and Eigenvectors?**

Eigenvector mostly defines the directio of dimension and eignevalue is the number which explains the variance in the data in that direction.The eigenvector with the highest eigenvalue is therefore the first principle component. There will be number of eigenvalues and eigenvectors equaling to the number of dimension the data has.

###**So lets try to calculate and plot some PCA**

There are two general methods to perform PCA in R

*Spectral decomposition which examines the covariance/correlation between variables

*Singular value decompoistion which examines the covariance/correlation between individual

#R-inbuilt function **princomp()** uses spectral decompostion and **prcomp()** uses singular value decomposition.

*functions are as*

### prcomp(x, scale = F)

##### where x is the numeric matrix or data frame, and scale is a logical value which indicates whether variables should be scaled to have unit variance before the analysis or not based on its FALSE or TRUE arguments.

### princomp (x, cor = F, score = T)

##### x is a numeric matrix or data frame, cor is a logical value which if true will centered and scaled the data before analysis and score is a logical value if true will calculate the coordinates on each principal component.

###For this we will use factoextra package

#### Loading the library (if you dont have the package, you can simply use install.package function to install the same of you can also use devtools to load from github using *devtools::install_github(kassambara/factoextra””)*)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 3.6.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

####for the analysis we are going to use *decathlon2* and *USArrests* dataset

data("decathlon2") head(decathlon2)

## X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus ## SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 ## CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 ## BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 ## YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 ## ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67 ## McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41 ## Pole.vault Javeline X1500m Rank Points Competition ## SEBRLE 5.02 63.19 291.7 1 8217 Decastar ## CLAY 4.92 60.15 301.5 2 8122 Decastar ## BERNARD 5.32 62.77 280.1 4 8067 Decastar ## YURKOV 4.72 63.44 276.4 5 8036 Decastar ## ZSIVOCZKY 4.42 55.37 268.0 7 8004 Decastar ## McMULLEN 4.42 56.37 285.1 8 7995 Decastar

####now compute PCA and store the result in a varibale (here we are using res.pca)

res.pca <- prcomp(decathlon2[,1:12], scale = TRUE)#here we omitted the last column or 13 colum which was charcter column #lets see the results of our PCA with summary function summary(res.pca)

## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 2.2726 1.3218 1.2907 1.04961 0.78882 0.76774 0.6302 ## Proportion of Variance 0.4304 0.1456 0.1388 0.09181 0.05185 0.04912 0.0331 ## Cumulative Proportion 0.4304 0.5760 0.7148 0.80663 0.85848 0.90760 0.9407 ## PC8 PC9 PC10 PC11 PC12 ## Standard deviation 0.53611 0.45157 0.38071 0.27450 0.005121 ## Proportion of Variance 0.02395 0.01699 0.01208 0.00628 0.000000 ## Cumulative Proportion 0.96465 0.98164 0.99372 1.00000 1.000000

#####SO, we obtained 12 principal component, which are PC 1 – 12. Each of this component explains a percentage of the total variation in the dataset. PC1 explains 43% of total variance and PC2 explains about 14.5% of total variance. So, by knowing the position of a sample in relation to PC1 and PC2 we can get its relation with other samples as PC1 and PC2 explains more than 50% of the variance in the datasets (57.6%)

#now draw a scree plot now, which shows the percentage of variance by each principal component ( or eigenvalue) fviz_eig(res.pca)

#### Individual PCA

##graph of individuals, which will group the individual with similar profile or athletic records. fviz_pca_ind(res.pca, repel = TRUE) #repel function helps in avoiding the text overlapping

## now lets say, you want to use some color to the group to better distinguish the groups fviz_pca_ind(res.pca, col.ind = "cos2", gradinet.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

##graph of variables. Positive correlated variables point to the same side of the plot.Negatively correlated variables opposes each other fviz_pca_var(res.pca, col.var = "contrib", # Color by contributions to the PC gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE )

**PCA biplot**

##we can also draw biplot and represent both individuals and variables in a single plot fviz_pca_biplot(res.pca, repel = TRUE)

##using categorical variables to color indivuals as groups groups <- as.factor(decathlon2$Competition) fviz_pca_ind(res.pca, col.ind = groups, # color by groups palette = c("#00AFBB", "#FC4E07"), addEllipses = TRUE, # Concentration ellipses ellipse.type = "confidence", legend.title = "Groups", repel = TRUE )

##lets try the same with USArrest dataset #for loading the dataset data("USArrests") head(USArrests)

## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7

View(USArrests)

res.pca2 <- princomp(USArrests, scale = TRUE) # doing the PCA analysis using princomp function

fviz_eig(res.pca2) #plotting scree plot to observe the variance

summary(res.pca2)

## Importance of components: ## Comp.1 Comp.2 Comp.3 Comp.4 ## Standard deviation 82.8908472 14.06956001 6.424204055 2.4578367034 ## Proportion of Variance 0.9655342 0.02781734 0.005799535 0.0008489079 ## Cumulative Proportion 0.9655342 0.99335156 0.999151092 1.0000000000

biplot(res.pca2, scale = TRUE) #biplot using R base function

fviz_pca_biplot(res.pca2, repel = TRUE, addEllipses = TRUE) #biplot using factoextra package to relationship between the states based on variables murder, assault, rape and UrbanPop

Reference:

1. https://www.datacamp.com/community/tutorials/pca-analysis-r

## Soil Health Gap

**Title: **Soil Health Gap: A concept to establish a benchmark for soil health management

#### Abstract

Growing calls and the need for sustainable agriculture have brought deserved attention to soil and to efforts towards improving or maintaining soil health. Numerous research and field experiments report soil health in terms of physicochemical and biological indicators, and identify different management practices that can improve it. However, the question remains how much of cultivated land has degraded since the dawn of agriculture? What is the maximum or realistically attainable soil health goal? Determination of a benchmark that defines the true magnitude of degradation and simultaneously sets potential soil health goals will optimize efforts in improving soil health using different practices. In this paper, we discuss a new term “*Soil Health Gap”* that is defined as the difference between soil health in an undisturbed native soil and current soil health in a cropland in a given agroecosystem. Soil Health Gap can be determined based on a general or specific soil property such as soil carbon. Soil organic carbon were measured at native grassland, no-till, conventionally tilled, and subsoil exposed farmlands. Soil Health Gap based on soil organic carbon was in order of no-till < conventional till < subsoil exposed farmland and subsequently, maximum attainable soil health goal with introduction of conservation practices would vary by an existing management practice or condition. Soil Health Gap establishes a benchmark for soil health management decisions and goals and can be scaled up from site-specific to regional to global scale.

Link: https://www.sciencedirect.com/science/article/pii/S2351989420305680?via%3Dihub

## Extracting Table from Webpage and Analyzing using R

### Saurav Das

###### March 7, 2020

### Extracting the coronavirus cases from website “https://www.worldometers.info/coronavirus/”

loading the library

library(rvest)

## Loading required package: xml2

library(xml2) library(dplyr)

## ## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats': ## ## filter, lag

## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union

### extracting the data from webpage

webpage_url <- "https://www.worldometers.info/coronavirus/" webpage <- xml2::read_html(webpage_url) ExOffndrsRaw <- rvest::html_table(webpage)[[1]] %>% tibble::as_tibble(.name_repair = "unique") # repair the repeated columns CV <- ExOffndrsRaw %>% dplyr::glimpse(45)

## Observations: 106 ## Variables: 9 ## $ `Country,Other` <chr> "China", "S. ... ## $ TotalCases <chr> "80,703", "7,... ## $ NewCases <chr> "+52", "+272"... ## $ TotalDeaths <chr> "3,098", "50"... ## $ NewDeaths <int> 28, 2, 49, NA... ## $ TotalRecovered <chr> "57,333", "13... ## $ ActiveCases <chr> "20,272", "7,... ## $ `Serious,Critical` <chr> "5,264", "36"... ## $ `Tot Cases/1M pop` <dbl> 56.1, 142.6, ...

str(CV)

## Classes 'tbl_df', 'tbl' and 'data.frame': 106 obs. of 9 variables: ## $ Country,Other : chr "China" "S. Korea" "Iran" "Italy" ... ## $ TotalCases : chr "80,703" "7,313" "6,566" "5,883" ... ## $ NewCases : chr "+52" "+272" "+743" "" ... ## $ TotalDeaths : chr "3,098" "50" "194" "233" ... ## $ NewDeaths : int 28 2 49 NA NA NA NA 7 1 NA ... ## $ TotalRecovered : chr "57,333" "130" "2,134" "589" ... ## $ ActiveCases : chr "20,272" "7,133" "4,238" "5,061" ... ## $ Serious,Critical: chr "5,264" "36" "" "567" ... ## $ Tot Cases/1M pop: num 56.1 142.6 78.2 97.3 12.2 ...

View(CV)

### Plotting

loading ggplot2 for plotting

library(ggplot2) CV %>% ggplot(aes(`Country,Other`, TotalCases)) + geom_bar(stat = "identity")

you can see x-asis is overcrowded and we can’t really figure out what is going on, so we need to modify the plot and if you pay close attention to y axis also the values seems little off as some of the values have “commas” in the data frame for Total Case, which need to be fixed before we use the value as continuous_y_axis.

so for that I used mutate function from dplyr package with gsub to remove the comma and created one extra column named as “Total_cases”

CV <- mutate(CV, Total_cases = as.numeric(gsub(",", "", gsub("\\,", ",,", CV$TotalCases)))) str(CV)

## Classes 'tbl_df', 'tbl' and 'data.frame': 106 obs. of 10 variables: ## $ Country,Other : chr "China" "S. Korea" "Iran" "Italy" ... ## $ TotalCases : chr "80,703" "7,313" "6,566" "5,883" ... ## $ NewCases : chr "+52" "+272" "+743" "" ... ## $ TotalDeaths : chr "3,098" "50" "194" "233" ... ## $ NewDeaths : int 28 2 49 NA NA NA NA 7 1 NA ... ## $ TotalRecovered : chr "57,333" "130" "2,134" "589" ... ## $ ActiveCases : chr "20,272" "7,133" "4,238" "5,061" ... ## $ Serious,Critical: chr "5,264" "36" "" "567" ... ## $ Tot Cases/1M pop: num 56.1 142.6 78.2 97.3 12.2 ... ## $ Total_cases : num 80703 7313 6566 5883 1018 ...

View(CV)

### Modifying plot

here i modified the x axis for better readibility, I rotated the axis text at 45 angle.

CV %>% ggplot(aes(`Country,Other`, Total_cases)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

however if you still go through x-axis, its quite tough to read, so we can remove few of the countries which has reported only few cases like lets say 15. So, I will use filter function from dplyr to remove those.

CV %>% filter(Total_cases >= 15) %>% ggplot(aes(as.character(`Country,Other`), Total_cases)) + geom_bar(stat = "identity") +theme(axis.text.x = element_text(angle = 45,hjust = 1))

lets scale the axis and transform it for better comparison. I scaled the y axis with log10, you can transform accordingly.

CV %>% filter(Total_cases > 15) %>% ggplot(aes(`Country,Other`, Total_cases)) + geom_bar(stat = "identity") +theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(trans = "log10")

## Hypothesis Testing

Hypothesis testing is one of the key method of inferential statistics.

**It is based on the idea that we can define and describe a population based on the random samples we have collected.**

So, if you want to define hypothesis:

**What is Hypothesis?**

It is an assumption about a population based on the sampling, which may or may not be true.

There are two types of statistical hypothesis:

**Null Hypothesis (H0)**– Which is hypothesis, defining there is no effect of your treatment or test.**Alternative Hypothesis (H1)**– There is observable effect of the test or treatments on the population.

Point to remember: **Hypothesis is always about the population parameter not the sample values or statistics**.

For doing hypothesis testing there are 5 steps:

**Define your hypothesis for the experiment, the null hypothesis and the alternative hypothesis.****Define the level of significance (**the alpha value; in statistics which is often taken as 0.05; but based on your experiment you can determine your level of significance or alpha value. Based on which you can reject or accept your hypothesis**)****Your sample**(Take random sample from your population and determine the statistical test based on the type and distribution of your sample)**P-value****Decide**(Use the p-value to accept or reject your hypothesis)

**So, lets say we want to find out the effect of manure on maize yield. So, we planted maize across our state at 5 different plots (replication) and applied our treatment or manure to see the effect. So, for this we set up two hypothesis :**

- Null hypothesis: there is no effect of manure application on maize yield.
- Alternative hypothesis will be : there is effect of manure on maize yield.

**So, at the end of the season we harvested maize from our five plots and determine mean population and used t-test to determine the p-value and we found our p-value is 0.03.**

**So, now what it means?**

As we have taken our level of significance (alpha value) as 0.05 and our p-value is less than the level of significance so we can reject the null hypothesis and accept the alternative hypothesis. And, it means there is a significant effect of manuring on maize yield or the yield has increased after the application of manure.

**Lets say we got our p-value 0.08, then what does that means ?**

Now, the p-value is greater than the level of significance or alpha value, thus we can not reject the null hypothesis and it means there is no significant effect of manuring on maize yield.

So, when we describe hypothesis, we should also know about the type-1 and type-2 errors.

**What is type-1 error ?**

**Type-1 error **is also known as false positive, it happens when null hypothesis is true but rejected.

**What is type-2 error?**

**Type-2 error** happens when the null hypothesis is false and you fails to reject it.

We will further discuss about type-1 error and type-2 error in next blog post.

## Cowplot

Ever wondered if you can put multiple plots in different alignment as plot-grids in R.

- Cowplot is something which can do that for you.
- Lets make some plots and see how we can arrange them using cow plot.
- For plotting we are going to use inbuilt R dataset : mtcars

mtcars

## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 ## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 ## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 ## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 ## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 ## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 ## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 ## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 ## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 ## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

#For plotting we will use ggplot2 package (If you already have the package, you can simply load the package using library function, or else you have to install the package using “install.packages(“ggplot2”) function and then load the library)

**library(ggplot2)**

1st plot:mpg vs cyl

**ggplot(mtcars, aes(cyl, mpg)) + geom_point()**

2nd plot:

**ggplot(mtcars, aes(hp, disp)) + geom_point() **

3rd plot: hp vs cyl

**ggplot(mtcars, aes(cyl, hp)) + geom_point()**

Now lets say we want all the three graphs in one plot or in a grid. For that, before beginning, lets give unique name to each plots so that we can order/arrange them accordingly.

P <- ggplot(mtcars, aes(cyl, mpg)) + geom_point() Q <- ggplot(mtcars, aes(hp, disp)) + geom_point() R <- ggplot(mtcars, aes(cyl, hp)) + geom_point()

Now we have renamed all the three graphs, we can use cowplot package to arrange them. For that we need to load the library of cowplot if you had already installed the pacakge or if you don’t have then you have to install the package and then load the library.

library(cowplot)

Now we will use plot_grid function to arrange the graphs accordingly

1. Arranging the plot

plot_grid(P, Q, R, labels = "auto") #you can rename the labels accordingly or you can put auto argument which will name them automatically.

- Now lets say you don’t want auto labels, you want to name them specifically. So we will change the labels to P, Q and R to name them accordingly.

plot_grid(P, Q, R, labels = c("P", "Q", "R"))

- Now lets say you want all the three graphs in one single strips. so we will use ncol function to do that. we will make three columns to fit three graphs in one plate

plot_grid(P, Q, R, labels=c("P", "Q", "R"), ncol=3)

- Now lets say you want to change the labels size for all the graphs. For that you can use labels size function and accordingly modify the size.

plot_grid(P, Q, R, labels=c("P", "Q", "R"), ncol=3, label_size = 12)

- Now how about plotting all the three graphs in vertical direction rather than horizontal. For that we will use “nrow” function instead of ncol.

plot_grid(P, Q, R, labels=c("P", "Q", "R"), nrow=3)

Hope this will be helpful. If you have any more specific question you type in the comments.

## P value (Part 1)

People often thinks “p-value” as “probability”, though they are related but not completely same.

Lets say, You flip a coin two times:

- First time you will get either 50% of Head and 50% of Tails
- Second time you will again get either 50% of Head or 50% of Tails.

So, if the question now is

a. **what is the probability of getting two heads in a row ?**

or

b. **What is the p-value for getting 2 heads in a row ?**

Now, lets breakdown the questions:

**a. What is the probability of getting two heads in a row**?

So after two flips there are four outcomes, in which each one is equally likely, as it is equally likely we can use the following formula to calculate probability:

Number of times two heads occurred / total outcomes = 1/4 = 0.25

So, there is 25% chances of getting two heads after two flips.

so if the question is probability of getting two tails then it will be again: 1/4= 0.25

means 25% chances of getting two tails

what about one tail and one heads what is the likelihood ?

2/4 = 0.50

so we have twice the chances of getting one head and one tail compared to two heads or tails.

B. So what is the p value for HH ?

P value definition: A p-value is the **probability that random chance generated the data or something else that is equal or rarer**.

(P value range = 0 – 1 and in statistics if it is less than alpha 0.05 it is significant and it is more than alpha 0.05 it is not significant)

So, it consist of three part:

**A p-value is the probability of event/ that random chance generated the data**

so, it will be like probability of getting two heads (HH) in our two random flips, which is 0.25

2. **something else that is equal**

which is like getting two Tails (TT), which is equal to getting two heads (HH) and is also 0.25

3. **Any rare event**

There is no rare event than getting HH, so it is 0

**so, the p-value for getting two heads (HH) = 0.25 + 0.25 = 0.50** and **which is different than the probability of getting two heads which is 0.25**

**How about make this more complex:**

Its easy to calculate each outcomes with coin, but lets consider human heights, is it easy to consider each possible outcome, definitely not, if so how many decimal places you need to put to accurately calculate that, which is not possible every time. So, for that we use density plot or distribution plot:

So, from a study “**Height of nations: a socioeconomic analysis of cohort differences and patterns among women in 54 low- to middle-income countries**.”

we found that height for Brazilian women (between 15 and 49 years old) which was measured in 1996 mostly lies in between

142 cm (4.6 ft) to 169 cm (5.5 ft)

So, area under the curve shows the distribution or probability of someone having heights in that range.

If we breakdown and analyse the density plot.

- 95% of the women or most of the women have height in between 142 – 169 cm or there is 95% probability that each time you measure a Brazilian woman’s height it will be in between 142 – 169 cm.

2. There 2.5% changes or 2.5% probability that each time you measure Brazilian women it will be less than 142 cm.

3. There is 2.5% chance that each time you measure Brazilian women it will be more than 169 cm.

So, what will be the p-value for someone who is 142 cm tall:

To calculate:

- there is 2.5% chance that someone will be 142 cm or shorter = 0.025 (which is probability of the even)
- there is 2.5% chance that is someone will be 169 cm or shorter = 0.025 (which is the event more likely equal or rare)
- there is no more rare event than that = 0

So, the p-value = 0.025 +0.025 = 0.05

(so in statistics, it is significant)

How about, **what is the p-value for someone who is between 155.4 and 156 cm tall ?**

so to calculate:

- the probability of the event or person with heights between 155.4 and 156 cm tall is = 4% or 0.04
- For rare events or extreme values .

there is 48% of the people who are taller than 156 cm and there is 48% of people who are shorter than 155.4 cm which is equal to = 0.48 + 0.48 = 0.96

So, the p value is = 0.04 + 0.96 = 1

So, in statistics its not significant or measuring someone between 155.4 and 156 is not significant thought the probability of the event is rare.

Reference:

Josh stammer video statquest. (https://www.youtube.com/watch?v=5Z9OIYA8He8&t=92s)

## Time Series Data Analysis – 1

A time series data are sequence of data which were listed in time order. Example of time series data include daily temperature, precipitation for a year or historical data showing several years month wise or day wise. Time series are mostly used in econometric, mathematical finance, weather forecasting etc. Time series analysis and forecasting uses different statistics and modelling to extract meaningful information or to predict future value based on the data. A time series data can be of, 1. Regular time series : When data have specific intervals between each observations. 2. Irregular time series : when there is no fixed intervals between the observations.

### loading the library

library(ggplot2)

library(dplyr)

##

## Attaching package: ‘dplyr’

## The following objects are masked from
‘package:stats’:

##

## filter,
lag

## The following objects are masked from
‘package:base’:

##

##
intersect, setdiff, setequal, union

library(lubridate)

##

## Attaching package: ‘lubridate’

## The following object is masked from
‘package:base’:

##

## date

library(xts)

## Loading required package: zoo

##

## Attaching package: ‘zoo’

## The following objects are masked from
‘package:base’:

##

## as.Date,
as.Date.numeric

##

## Attaching package: ‘xts’

## The following objects are masked from
‘package:dplyr’:

##

## first,
last

## Data

I accessed the data from NOAA (National Oceanic and Atmospheric Administration) So, I Used Clemson_Florence 30 years weather data *You can simply import by native R-Studio import function I used variable FL to name the dataframe

getwd()

## [1] “C:/Users/saura/Desktop”

FL <- read.csv(“C:/Users/saura/Desktop/Florence_30years_Weather_Data.csv”, header = TRUE)

str(FL)

## ‘data.frame’:
10456 obs. of 9 variables:

## $ NAME : Factor w/ 1 level “Clemson
PDREC”: 1 1 1 1 1 1 1 1 1 1 …

## $ LATITUDE :
num 34.3 34.3 34.3 34.3 34.3 …

## $ LONGITUDE:
num -79.7 -79.7 -79.7 -79.7 -79.7 …

## $ ELEVATION:
int 125 125 125 125 125 125 125 125 125
125 …

## $ DATE : Factor w/ 10456 levels
“1/1/1991″,”1/1/1992”,..: 840 782 4155 294 207 517 488 3848
477 3439 …

## $ PRCP_inch:
num 0 0 0 0 0 0 0 0 0 0 …

## $
Avg_T_F : Factor w/ 76 levels
“.”,”0″,”19″,”20″,..: 4 6 5 3 4 15 5 7
7 9 …

## $
Max_T_F : int 32 35 33 26 28 51 31 34 35 39 …

## $
Min_T_F : int 8 10 11 12 12 12 13 13 13 13 …

### Converting into Date variable or POSIXct format

you can see the Date is in factor format, so before we can use this for analysis, we need to change it to Date variables/POSIXct format. changing the factor varibales of Date into Date variable

FL$DATE <- parse_date_time(FL$DATE, orders = c(“ymd”, “dmy”, “mdy”))

str(FL)

## ‘data.frame’:
10456 obs. of 9 variables:

## $ NAME : Factor w/ 1 level “Clemson
PDREC”: 1 1 1 1 1 1 1 1 1 1 …

## $ LATITUDE :
num 34.3 34.3 34.3 34.3 34.3 …

## $ LONGITUDE:
num -79.7 -79.7 -79.7 -79.7 -79.7 …

## $ ELEVATION:
int 125 125 125 125 125 125 125 125 125
125 …

## $ DATE : POSIXct, format: “2018-01-07”
“2018-01-05” …

## $ PRCP_inch:
num 0 0 0 0 0 0 0 0 0 0 …

## $
Avg_T_F : Factor w/ 76 levels
“.”,”0″,”19″,”20″,..: 4 6 5 3 4 15 5 7
7 9 …

## $
Max_T_F : int 32 35 33 26 28 51 31 34 35 39 …

## $
Min_T_F : int 8 10 11 12 12 12 13 13 13 13 …

Now you can see, the DATE has been changed from factor to POSIXct format ###Extracting the year and month from Date variable I used mutate function to store the Year and Month into two new columns in the Data Frame and named the new data frame FL2

FL2 <- FL %>%

mutate(Year = year(DATE), Month = month(DATE))

### plotting the data and doing the analysis

I grouped the data based on year and then did the plotting using mean of maximum temperature per year, I Used default regression model LOESS or polynomial moving average regression with geom_smooth function to see the trend

FL2 %>% group_by(Year) %>% na.omit() %>% summarise(Mean_Max = mean(Max_T_F)) %>%

ggplot(aes(Year, Mean_Max)) + geom_line(type = 2, size = 1) + geom_smooth() +

scale_x_continuous(breaks = seq(1990, 2019, by = 1)) + xlab(“Year”) + ylab(“Mean Maximum Temperature (°F)”) +

theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14))

## Warning: Ignoring unknown parameters: type

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

You can see the x-axis text are overlapped within each other, you can simply rotate the angle of text to solve the issue, running the codes again with angle function under theme and element text function

FL2 %>% group_by(Year) %>% na.omit() %>% summarise(Mean_Max = mean(Max_T_F)) %>%

ggplot(aes(Year, Mean_Max)) + geom_line(type = 2, size = 1) + geom_smooth() +

scale_x_continuous(breaks = seq(1990, 2019, by = 1)) + xlab(“Year”) + ylab(“Mean Maximum Temperature (°F)”) +

theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14)) +

theme(axis.text.x = element_text(angle = 45))

## Warning: Ignoring unknown parameters: type

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

#### boxplot

Lets plot boxplot to see the more details of the temperature profile

FL2 %>% ggplot(aes(as.factor(Year),Max_T_F))
+ geom_boxplot(fill = “light
green”) +

theme_bw() + ylab(“Maximum Temperature (°F)”) + xlab(“Year”) +

theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14)) + theme(axis.text.x = element_text(angle = 45))

### filtering the data for specific months

Lets say I want to filter the data for specific month or for growing season of a particular crop which is from May to September. Now I will filter the month from data set using filter function of dplyr. I can directly even filter the months by extracting months from DATE column using month() function of lubridate or we can use the Month column which we created in the first line of code using mutate function.

FL2 %>% group_by(Year) %>% na.omit() %>% filter(month(DATE) >= 5 & month(DATE) <= 9) %>% summarise(Mean_SeasonalT
=
mean(Max_T_F)) %>%

ggplot(aes(Year,
Mean_SeasonalT)) + geom_line(size = 0.8) + scale_x_continuous(breaks = seq(1990, 2019, by = 1)) +

geom_point() + geom_smooth() + ylab(“Mean of Maximum Seasonal Temperature (°F)”) +

theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12)) +

theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

### Minimum Temperature

Now lets plot the minimum temperature for the Florence over the years. This time I am just using Month Column to filter the data

FL2 %>% group_by(Year) %>% na.omit() %>% filter(Month >= 5 & Month <= 9) %>%

summarise(Mean_SeasonalT
=
mean(Min_T_F)) %>%

ggplot(aes(Year,
Mean_SeasonalT)) + geom_line(size = 0.8) +

scale_x_continuous(breaks = seq(1990, 2019, by = 1)) +

geom_point() + geom_smooth() + ylab(“Mean of Minimum Seasonal Temperature (°F)”) +

theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12)) +

theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

### Precipitation

Lets make another line plot for the precipitation per year

FL2 %>% group_by(Year) %>% na.omit() %>% filter(month(DATE) >= 5 & month(DATE) <= 9) %>%

summarise(Mean_SeasonalT
=
mean(PRCP_inch)) %>%

ggplot(aes(Year,
Mean_SeasonalT)) + geom_line(size = 0.8) +

scale_x_continuous(breaks = seq(1990, 2019, by = 1)) +

geom_point() + geom_smooth() + ylab(“Mean Seasonal Precipitation (inch)”) +

theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12)) +

theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’

### Data for last 10 years

Lets say, Now I only want to put plots for last 10 years to see the changing temperature trend in growing season from May to September. Now I will filter both the Year and Month to get the data. I used facet wrap function to see each year separately.

FL2 %>% na.omit() %>% group_by(Year, Month) %>%

filter(Month >= 5 & Month <= 9, Year >= 2008 & Year<= 2019) %>%

summarise(Mean_MaxT = mean(Max_T_F)) %>% ggplot(aes(Month, Mean_MaxT)) +

geom_line() + geom_point() + facet_wrap(~Year) + ylab(” Mean Maximum Temperature (°F)”)

Now look at the X- axis, the month are coming as numeric value of 5 – 9, but I want to change them in words, so I will use scale_x_continuous function with label arguments to change the same. it will be like,

FL2 %>% na.omit() %>% group_by(Year, Month) %>%

filter(Month >= 5 & Month <= 9, Year >= 2008 & Year<= 2019) %>%

summarise(Mean_MaxT = mean(Max_T_F)) %>% ggplot(aes(Month, Mean_MaxT)) +

geom_line() + geom_point() + facet_wrap(~Year) + ylab(” Mean Maximum Temperature (°F)”) + scale_x_continuous(labels = c(“5” = “May”, “6” =”June”, “7” = “July”, “8” = “August”, “9” = “September”)) + theme(axis.text.x = element_text(angle = 45))

## Truncated Mean

Before going to Trimmed Mean or Truncated Mean, lets have a quick view of the most common terms used in descriptive and summary statistics: (There are whole lot of other terms)

**Mean** : calculation of the central value of the dataset

**Median**: A value lying in the midpoint of the dataset

**Mode**: value with highest frequency (or most occurred)

**Range** : distance between the highest and smallest value of a dataset

Lets Say, for Example, we have a dataset:

A = {2, 5, 7, 11, 14, 17,17,2, 18,2, 20, 24, 25, 27, 32}

So, if we calculate using base R functions, it will be something like this:

> mean(A)

[1] 14.86667

> median(A)

[1] 17

> mode(A)

[1] “numeric”

> range(A)

[1] 2 – 32

**So, What is Trimmed Mean ?**

Similar to the “Mean” and “Median” , Truncated Mean or Trimmed Mean is also a measure of central tendency. It calculates mean or average discarding the samples at extreme high and low ends (outliers). It is a robust statistical method.

So, lets say we have a dataset : (source: wikipedia)

X = { 92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41 }

So, you can majority of the number (95%) lies in between 15 – 92), while negative values and 1053 are kind of extreme outliers. So, lets see what we will get if we calculate mean vs trimmed mean (with 5%)

> mean(x)

[1] 101.5

> mean(x, trim = 0.05)

[1] 56.5

**So, what does this trim value means?** **How to choose trim value ?**

If you are choosing 20% of trim or trim = 0.2 on a dataset of N = 20; (20% of 20 = 4), so it will remove first four lower value and last four higher values from the dataset or 20% of lower value and 20% of higher value.

Lets consider a dataset,

M = {20, 33, 22, 36, 38, 45, 67, 39, 12, 15, 19, 28, 10, 44, 56, 63, 72, 30, 42, 48}

first lets sort the dataset in ascending order:

sort(M)

10 12 15 19 20 22 28 30 33 36 38 39 42 44 45 48 56 63 67 72

so, if we now trim it with 20% it will be like removing first four value and last four value so, the trimmed dataset will be:

**10 12 15 19 **20 22 28 30 33 36 38 39 42 44 45 48 **56 63 67 72 **

M = { 20 22 28 30 33 36 38 39 42 44 45 48 }

People generally choose 20% trim, but you can choose other trim value also based on your data distribution.

**What is the advantage of truncated Mean ?**

a. As it less sensitive to outliers, it gives reasonable estimates of central tendency when sample distribution is skewed or uneven.

b. standard error of trimmed mean is less effected by outliers

**What are the statistical test you can use with truncated mean?**

You can use trimmed means instead of means in t-test. However, calculation of standard error differs from traditional t-test formula as because the values are nor more independent after trimming. Adjusted standard error calculation for trimmed mean was originally proposed by Karen Yuen in 1974, which involves “winsorization”. In winsorization instead of removing observation like in trimming, we replace the values with extreme values. so for dataset M, 20% winsorized sample will be :

M = { **10 10 10 10** 20 22 28 30 33 36 38 39 42 44 45 48 **72 72 72 72** }

**Yuen’s approach:**

trimmed mean with confidence interval : trimci(x, tr = 0.2, alpha = 0.05)

Yuen t-test for 2 independent groups: yuen(x, y, tr=0.2)

Yuen t-test for 2 dependent groups: yuend(x, y,, tr=0.2)

Further Readings:

- Trimmed Mean: https://garstats.wordpress.com/2017/11/28/trimmed-means/
- Karen K. Yuen. The two-sample trimmed t for unequal population variances, Biometrika, Volume 61, Issue 1, 1 April 1974, Pages 165–170, https://doi.org/10.1093/biomet/61.1.165
- Trimmed mean: https://medium.com/@HollyEmblem/when-to-use-a-trimmed-mean-fd6aab347e46