Title: Soil Health Gap: A concept to establish a benchmark for soil health management

Abstract

Growing calls and the need for sustainable agriculture have brought deserved attention to soil and to efforts towards improving or maintaining soil health. Numerous research and field experiments report soil health in terms of physicochemical and biological indicators, and identify different management practices that can improve it. However, the question remains how much of cultivated land has degraded since the dawn of agriculture? What is the maximum or realistically attainable soil health goal? Determination of a benchmark that defines the true magnitude of degradation and simultaneously sets potential soil health goals will optimize efforts in improving soil health using different practices. In this paper, we discuss a new term “Soil Health Gap” that is defined as the difference between soil health in an undisturbed native soil and current soil health in a cropland in a given agroecosystem. Soil Health Gap can be determined based on a general or specific soil property such as soil carbon. Soil organic carbon were measured at native grassland, no-till, conventionally tilled, and subsoil exposed farmlands. Soil Health Gap based on soil organic carbon was in order of no-till < conventional till < subsoil exposed farmland and subsequently, maximum attainable soil health goal with introduction of conservation practices would vary by an existing management practice or condition. Soil Health Gap establishes a benchmark for soil health management decisions and goals and can be scaled up from site-specific to regional to global scale.

you can see x-asis is overcrowded and we can’t really figure out what is going on, so we need to modify the plot and if you pay close attention to y axis also the values seems little off as some of the values have “commas” in the data frame for Total Case, which need to be fixed before we use the value as continuous_y_axis.

so for that I used mutate function from dplyr package with gsub to remove the comma and created one extra column named as “Total_cases”

however if you still go through x-axis, its quite tough to read, so we can remove few of the countries which has reported only few cases like lets say 15. So, I will use filter function from dplyr to remove those.

Hypothesis testing is one of the key method of inferential statistics.

It is based on the idea that we can define and describe a population based on the random samples we have collected.

So, if you want to define hypothesis:

What is Hypothesis?

It is an assumption about a population based on the sampling, which may or may not be true.

There are two types of statistical hypothesis:

Null Hypothesis (H0) – Which is hypothesis, defining there is no effect of your treatment or test.

Alternative Hypothesis (H1) – There is observable effect of the test or treatments on the population.

Point to remember: Hypothesis is always about the population parameter not the sample values or statistics.

For doing hypothesis testing there are 5 steps:

Define your hypothesis for the experiment, the null hypothesis and the alternative hypothesis.

Define the level of significance (the alpha value; in statistics which is often taken as 0.05; but based on your experiment you can determine your level of significance or alpha value. Based on which you can reject or accept your hypothesis)

Your sample (Take random sample from your population and determine the statistical test based on the type and distribution of your sample)

P-value

Decide (Use the p-value to accept or reject your hypothesis)

So, lets say we want to find out the effect of manure on maize yield. So, we planted maize across our state at 5 different plots (replication) and applied our treatment or manure to see the effect. So, for this we set up two hypothesis :

Null hypothesis: there is no effect of manure application on maize yield.

Alternative hypothesis will be : there is effect of manure on maize yield.

So, at the end of the season we harvested maize from our five plots and determine mean population and used t-test to determine the p-value and we found our p-value is 0.03.

So, now what it means?

As we have taken our level of significance (alpha value) as 0.05 and our p-value is less than the level of significance so we can reject the null hypothesis and accept the alternative hypothesis. And, it means there is a significant effect of manuring on maize yield or the yield has increased after the application of manure.

Lets say we got our p-value 0.08, then what does that means ?

Now, the p-value is greater than the level of significance or alpha value, thus we can not reject the null hypothesis and it means there is no significant effect of manuring on maize yield.

So, when we describe hypothesis, we should also know about the type-1 and type-2 errors.

What is type-1 error ?

Type-1 error is also known as false positive, it happens when null hypothesis is true but rejected.

What is type-2 error?

Type-2 error happens when the null hypothesis is false and you fails to reject it.

We will further discuss about type-1 error and type-2 error in next blog post.

#For plotting we will use ggplot2 package (If you already have the package, you can simply load the package using library function, or else you have to install the package using “install.packages(“ggplot2”) function and then load the library)

library(ggplot2)

1st plot:mpg vs cyl

ggplot(mtcars, aes(cyl, mpg)) + geom_point()

2nd plot:

ggplot(mtcars, aes(hp, disp)) + geom_point()

3rd plot: hp vs cyl

ggplot(mtcars, aes(cyl, hp)) + geom_point()

Now lets say we want all the three graphs in one plot or in a grid. For that, before beginning, lets give unique name to each plots so that we can order/arrange them accordingly.

Now we have renamed all the three graphs, we can use cowplot package to arrange them. For that we need to load the library of cowplot if you had already installed the pacakge or if you don’t have then you have to install the package and then load the library.

library(cowplot)

Now we will use plot_grid function to arrange the graphs accordingly

1. Arranging the plot

plot_grid(P, Q, R, labels = "auto") #you can rename the labels accordingly or you can put auto argument which will name them automatically.

Now lets say you don’t want auto labels, you want to name them specifically. So we will change the labels to P, Q and R to name them accordingly.

plot_grid(P, Q, R, labels = c("P", "Q", "R"))

Now lets say you want all the three graphs in one single strips. so we will use ncol function to do that. we will make three columns to fit three graphs in one plate

plot_grid(P, Q, R, labels=c("P", "Q", "R"), ncol=3)

Now lets say you want to change the labels size for all the graphs. For that you can use labels size function and accordingly modify the size.

plot_grid(P, Q, R, labels=c("P", "Q", "R"), ncol=3, label_size = 12)

Now how about plotting all the three graphs in vertical direction rather than horizontal. For that we will use “nrow” function instead of ncol.

plot_grid(P, Q, R, labels=c("P", "Q", "R"), nrow=3)

Hope this will be helpful. If you have any more specific question you type in the comments.

People often thinks “p-value” as “probability”, though they are related but not completely same.

Lets say, You flip a coin two times:

First time you will get either 50% of Head and 50% of Tails

Second time you will again get either 50% of Head or 50% of Tails.

So, if the question now is

a. what is the probability of getting two heads in a row ?

or

b. What is the p-value for getting 2 heads in a row ?

Now, lets breakdown the questions:

a. What is the probability of getting two heads in a row?

So after two flips there are four outcomes, in which each one is equally likely, as it is equally likely we can use the following formula to calculate probability:

Number of times two heads occurred / total outcomes = 1/4 = 0.25

So, there is 25% chances of getting two heads after two flips.

so if the question is probability of getting two tails then it will be again: 1/4= 0.25

means 25% chances of getting two tails

what about one tail and one heads what is the likelihood ?

2/4 = 0.50

so we have twice the chances of getting one head and one tail compared to two heads or tails.

B. So what is the p value for HH ?

P value definition: A p-value is the probability that random chance generated the data or something else that is equal or rarer.

(P value range = 0 – 1 and in statistics if it is less than alpha 0.05 it is significant and it is more than alpha 0.05 it is not significant)

So, it consist of three part:

A p-value is the probability of event/ that random chance generated the data

so, it will be like probability of getting two heads (HH) in our two random flips, which is 0.25

2. something else that is equal

which is like getting two Tails (TT), which is equal to getting two heads (HH) and is also 0.25

3. Any rare event

There is no rare event than getting HH, so it is 0

so, the p-value for getting two heads (HH) = 0.25 + 0.25 = 0.50 and which is different than the probability of getting two heads which is 0.25

How about make this more complex:

Its easy to calculate each outcomes with coin, but lets consider human heights, is it easy to consider each possible outcome, definitely not, if so how many decimal places you need to put to accurately calculate that, which is not possible every time. So, for that we use density plot or distribution plot:

So, from a study “Height of nations: a socioeconomic analysis of cohort differences and patterns among women in 54 low- to middle-income countries.”

we found that height for Brazilian women (between 15 and 49 years old) which was measured in 1996 mostly lies in between

142 cm (4.6 ft) to 169 cm (5.5 ft)

So, area under the curve shows the distribution or probability of someone having heights in that range.

If we breakdown and analyse the density plot.

95% of the women or most of the women have height in between 142 – 169 cm or there is 95% probability that each time you measure a Brazilian woman’s height it will be in between 142 – 169 cm.

2. There 2.5% changes or 2.5% probability that each time you measure Brazilian women it will be less than 142 cm.

3. There is 2.5% chance that each time you measure Brazilian women it will be more than 169 cm.

So, what will be the p-value for someone who is 142 cm tall:

To calculate:

there is 2.5% chance that someone will be 142 cm or shorter = 0.025 (which is probability of the even)

there is 2.5% chance that is someone will be 169 cm or shorter = 0.025 (which is the event more likely equal or rare)

there is no more rare event than that = 0

So, the p-value = 0.025 +0.025 = 0.05

(so in statistics, it is significant)

How about, what is the p-value for someone who is between 155.4 and 156 cm tall ?

so to calculate:

the probability of the event or person with heights between 155.4 and 156 cm tall is = 4% or 0.04

For rare events or extreme values .

there is 48% of the people who are taller than 156 cm and there is 48% of people who are shorter than 155.4 cm which is equal to = 0.48 + 0.48 = 0.96

So, the p value is = 0.04 + 0.96 = 1

So, in statistics its not significant or measuring someone between 155.4 and 156 is not significant thought the probability of the event is rare.

A time series data are sequence of data which were listed in time order. Example of time series data include daily temperature, precipitation for a year or historical data showing several years month wise or day wise. Time series are mostly used in econometric, mathematical finance, weather forecasting etc. Time series analysis and forecasting uses different statistics and modelling to extract meaningful information or to predict future value based on the data. A time series data can be of, 1. Regular time series : When data have specific intervals between each observations. 2. Irregular time series : when there is no fixed intervals between the observations.

loading the library

library(ggplot2) library(dplyr)

##
## Attaching package: ‘dplyr’

## The following objects are masked from
‘package:stats’:
##
## filter,
lag

## The following objects are masked from
‘package:base’:
##
##
intersect, setdiff, setequal, union

library(lubridate)

##
## Attaching package: ‘lubridate’

## The following object is masked from
‘package:base’:
##
## date

library(xts)

## Loading required package: zoo

##
## Attaching package: ‘zoo’

## The following objects are masked from
‘package:base’:
##
## as.Date,
as.Date.numeric

##
## Attaching package: ‘xts’

## The following objects are masked from
‘package:dplyr’:
##
## first,
last

Data

I accessed the data from NOAA (National Oceanic and
Atmospheric Administration) So, I Used Clemson_Florence 30 years weather data
*You can simply import by native R-Studio import function I used variable FL to
name the dataframe

you can see the Date is in factor format, so before we
can use this for analysis, we need to change it to Date variables/POSIXct
format. changing the factor varibales of Date into Date variable

Now you can see, the DATE has been changed from factor
to POSIXct format ###Extracting the year and month from Date variable I used
mutate function to store the Year and Month into two new columns in the Data
Frame and named the new data frame FL2

I grouped the data based on year and then did the
plotting using mean of maximum temperature per year, I Used default regression
model LOESS or polynomial moving average regression with geom_smooth function
to see the trend

## `geom_smooth()` using method = ‘loess’ and formula
‘y ~ x’

You can see the x-axis text are overlapped within each
other, you can simply rotate the angle of text to solve the issue, running the
codes again with angle function under theme and element text function

Lets say I want to filter the data for specific month or for growing season of a particular crop which is from May to September. Now I will filter the month from data set using filter function of dplyr. I can directly even filter the months by extracting months from DATE column using month() function of lubridate or we can use the Month column which we created in the first line of code using mutate function.

## `geom_smooth()` using method = ‘loess’ and formula
‘y ~ x’

Data for last 10 years

Lets say, Now I only want to put plots for last 10
years to see the changing temperature trend in growing season from May to
September. Now I will filter both the Year and Month to get the data. I used
facet wrap function to see each year separately.

FL2 %>% na.omit() %>% group_by(Year, Month) %>%
filter(Month >= 5 & Month <= 9, Year >= 2008 & Year<= 2019) %>%
summarise(Mean_MaxT = mean(Max_T_F)) %>% ggplot(aes(Month, Mean_MaxT)) +
geom_line() + geom_point() + facet_wrap(~Year) + ylab(” Mean Maximum Temperature (°F)”)

Now look at the X- axis, the month are coming as numeric
value of 5 – 9, but I want to change them in words, so I will use
scale_x_continuous function with label arguments to change the same. it will be
like,

Before going to Trimmed Mean or Truncated Mean, lets have a quick view of the most common terms used in descriptive and summary statistics: (There are whole lot of other terms)

Mean : calculation of the central value of the dataset

Median: A value lying in the midpoint of the dataset

Mode: value with highest frequency (or most occurred)

Range : distance between the highest and smallest value of a dataset

Similar to the “Mean” and “Median” , Truncated Mean or Trimmed Mean is also a measure of central tendency. It calculates mean or average discarding the samples at extreme high and low ends (outliers). It is a robust statistical method.

So, lets say we have a dataset : (source: wikipedia)

So, you can majority of the number (95%) lies in between 15 – 92), while negative values and 1053 are kind of extreme outliers. So, lets see what we will get if we calculate mean vs trimmed mean (with 5%)

So, what does this trim value means?How to choose trim value ?

If you are choosing 20% of trim or trim = 0.2 on a dataset of N = 20; (20% of 20 = 4), so it will remove first four lower value and last four higher values from the dataset or 20% of lower value and 20% of higher value.

People generally choose 20% trim, but you can choose other trim value also based on your data distribution.

What is the advantage of truncated Mean ?

a. As it less sensitive to outliers, it gives reasonable estimates of central tendency when sample distribution is skewed or uneven.

b. standard error of trimmed mean is less effected by outliers

What are the statistical test you can use with truncated mean?

You can use trimmed means instead of means in t-test. However, calculation of standard error differs from traditional t-test formula as because the values are nor more independent after trimming. Adjusted standard error calculation for trimmed mean was originally proposed by Karen Yuen in 1974, which involves “winsorization”. In winsorization instead of removing observation like in trimming, we replace the values with extreme values. so for dataset M, 20% winsorized sample will be :

Karen K. Yuen. The two-sample trimmed t for unequal population variances, Biometrika, Volume 61, Issue 1, 1 April 1974, Pages 165–170, https://doi.org/10.1093/biomet/61.1.165