Proso Millet

It’s been long, I wanted to write something on the crop I am working on.

It is Proso millet.

Like all other grass family members, it is from Poaceae family. Genus is Panicum. Scientific name is Panicum miliaceum.

It is is mostly cultivated in USA, China, India, Niger, Nepal, Africa, Russia, Ukraine, Belarus, Middle East, Turkey, Poland, Germany, Switzerland, and Romania. Radioactive dating showed its origin was around 10,000 BP in China. It was introduced in the USA by German-Russian Immigrant in 1875.

clip_image001

Fig. 1 Countries growing proso millet around the world

Proso millet is the best suited crop in the semi-arid region of the world. where annual rainfall is pretty low. Here in the western Nebraska and the Central Great Plains where annual rainfall is < 100mm, proso millet stand out as an important rotational crop with winter wheat. High water use efficiency (WUE) with shallow root system (~90 – 92 cm root) helps it thrive and adapt only using the minimal water which is available through the spring rain. It also helps in conservation of subsurface water for the deep-rooted crop like corn, wheat, sunflower. Cultivating proso in the summer fallow also helps in controlling the annual grass weeds. Short growing season of proso millet also makes it ideal and helps it fit perfectly in the winter wheat – proso millet rotation.

clip_image004

Fig. 2 Map of Central Great Plains (Map Source: https://bit.ly/2EC6UXc)

Proso millet is also a miracle grain with multitudinous health benefits.

· It is gluten free,

· It has low glycemic index,

· It increases blood HDL level

· It has high antioxidant

· It has high mineral content

· Essential amino acid index is higher than wheat

Quinoa and Proso millet

Though its abundant health and environmental benefits, it is mostly used as bird seeds in USA. However, due to recent upsurge in health consciousness among people and rapid wave of ancient grains in market proso millet did get some attentions. However still far behind the imported ancient grain like “quinoa”. Though if you compare the nutritional profile of proso millet and quinoa, both are in neck to neck fight with similar carbohydrate, protein, vitamin, amino acids and mineral quantities. Both quinoa and proso millets are gluten free. Moreover, proso millet is locally grown seeds and increased market share can improve the local farm economy. While rapid increase in the export of quinoa has created imbalance in local economy and also impacting the ecosystem of the importing country due to conversion of forest lands to farmland to meet the demands. But, despite all the facts, proso millet is still struggling to get into the human food market where quinoa has established itself as a golden seed of health benefits.

New hybrid varieties with high nutritional value and extension regarding different prospects and perspectives of proso millet is the need of time to expand and develop new market for proso millet.

So, What we do ?

we are trying to develop new varieties of proso millet which will be more marketable and test the developed varieties in multi location for the environment and genotype interaction and try to select the best varieties which gives significant yield in different environment and locations.

proso millet

For Further Readings: Beyond Bird Feed: Proso Millet for Human Health and Environment

https://www.mdpi.com/2077-0472/9/3/64/htm

Odds Ratio

What is odd ratio ?

  • Odds ratio (OR) is statistical quantifier which measures the association between an exposure and an outcome. OR represents the odds of an outcome in present of particular exposure with comparison to odds in the absence of that exposure. In a simple definition you can how likely or probability of occurrence of an event in presence or absence of a particular exposure.
  • OR are mostly used in case control, cross sectional and cohort studies.

Where and When to use it?

  • OR is used to measure the relative odds of occurrence of an outcome especially disease or disorder in a given exposure. For an example, how likely the chance of occurrence of cancer in presence and absence of smoking as a factor.
  • OR can also be used to measure the risk factor for a particular outcome.

OR = 1 Exposure doesn’t affect the odds of outcome.

OR >1 Exposure associated with higher odds of outcome

OR <1 Exposure associated with lower odds of outcome

in rare outcomes OR = RR (relative risk) where the incidence of disease is < 10%

Then What is Confidence Interval (CI), which is always calculated with OR ?

  • 95% CI is used to estimate the precision of OR. A large CI indicates low level of precision for OR while a small CI indicates high precision of OR.
  • Remember, CI is not measure of statistical significance.
  • Its give you the range where your prediction will be in 95% of the time when you are estimating a population based on your sample. (CI is calculated with bootstrapping)

Now lets see, how you can calculate OR and CI using R :

Lets Say, There is a particular disease “X” and I want to check does it have any relationship with Demographic factor like Rural and Urban or simply does livelihood somehow effect the occurrence of the disease. So, I did sampling and collected information from both rural and urban population and it was like:

Positive Negative
Rural 65 55
Urban 46 34

So, all total I collected 200 people’s information, where 120 people were from rural area where 65 were found positive for the disease and 55 were negative. From urban I collected 80 samples and I found 46 were positive and 34 were negative for the disease. So, now lets do the OR test.

So, the question will be like what is the odds of having the disease for someone living in rural area and urban area ?

For this, I am going to use “epiR” package.

>install.packages(“epiR”)
>library(epiR)

#Lets create a matrix of 2 by 2 and give the dataset a name of X
X <- matrix(c(65, 55, 46, 34), nrow = 2, byrow = TRUE)
X

     [,1] [,2]
[1,]   65   55
[2,]   46   34

#Lets chane the row names and colnames as according to the example
rownames(X) <- c(“Positive”, “Negative”)
colnames(X) <- c(“Rural”, “Urban”)
X

Rural Urban
Positive    65    55
Negative    46    34

#OR test
epi.2by2(X, method = “cohort.count”) (I used cohort count, here you can use Wald test or Fischer test accordingly)

             Outcome +    Outcome -      Total        Inc risk *        Odds
Exposed +           65           55              120              54.2                 1.18
Exposed -           46           34                80              57.5                 1.35
Total                    111           89               200             55.5                 1.25

Point estimates and 95 % CIs:
-------------------------------------------------------------------
Inc risk ratio                                            0.94 (0.73, 1.21)
Odds ratio                                                 0.87 (0.49, 1.55)
Attrib risk *                                              -3.33 (-17.36, 10.70)
Attrib risk in population *                    -2.00 (-14.84, 10.84)
Attrib fraction in exposed (%)              -6.15 (-36.33, 17.34)
Attrib fraction in population (%)          -3.60 (-19.96, 10.52)
-------------------------------------------------------------------
 X2 test statistic: 0.216 p-value: 0.642
 Wald confidence limits
 * Outcomes per 100 population units

interpretation : so the odds ratio tells us that the odds are 0.87 times great that someone living in rural areas will have X disease compare to urban areas. However, p value is above o.05 which indicate the results are not statistically significant.

You can even calculate the same in classical way by dividing odds of each case lets say:

chances or odds of having someone the disease living in rural area will be = 65/55 = 1.18

and similarly for someone living in urban area will be = 46/34 = 1.35

so the odds ratio will be = 1.18 / 1.35 = 0.87

For further Read: You can checkout,

Szumilas, M. (2010). Explaining odds ratios. Journal of the Canadian academy of child and adolescent psychiatry, 19(3), 227. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/

Linear Model

Linear models describes a continuous variable (response/ dependent variable ~y) as a function of one or more predictor or explanatory variables (independent ~x). Linear regression is the statistical method used to create a linear model. It can help you to understand and predict the behavior of a complex system based on the predictor functions.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (= value of y when x = 0).

There are several types of linear regression:

  • Simple linear regression: models using only one predictor
  • Multiple linear regression: models using multiple predictors
  • Multivariate linear regression: models for multiple response variables

We will consider simple linear regression in this article.

Lets say, children height and age are related, like higher the age height will be more. So, age is the explanatory variable (~x) which can define the height as a dependent variable (~y). So based on this you can make a model, where by knowing the age of a child you can predict the height of that child.

For analysis of a model, you need fit a line between the two or more variables. The line is generally fitted with sum of least square values. To define this, lets say,

For hypothetical dataset, there are two variables, income and life Expectancy for a dataset df.

Our hypothesis is that, income rate can effect the life expectancy of a person. Higher the income rate, higher will be the life expectancy. So, income is independent and Life expectancy is dependent variable.

So, lets do the modeling and will also explain the fitting of line while going through the model.

Creating the data frame:

df <- data.frame(Income = c(1000, 1500, 2000, 2500, 3000, 4000, 5000, 10000),
                  Life_Expectancy = c(55, 60, 65, 67, 69, 72, 76, 80))

image

So, first lets plot a scatter plot and understand how lines are fitted to your data, (we will be using ggplot2 package for this)

> ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic()

image

Now lets put a “line of best fit” – line of best fit is the straight line with least square method to see the trend.

But how to put a line, and where to put, horizontal, vertical ? how to find the optimal or best fit line ?

For starting, lets put a horizontal line on average value of y axis ( y = m = 68). To see how best fitted the line is, determine the distance of each data point from the line (which is known as residuals). For the first point it will be (m – y1), for the second data point (m – y2), third point will be (m – y3). Similarly you can find distance for each point from the horizontal line. But, here’s a catch, for the data points (y5, y6, y7, y8) which are above the horizontal line, they have greater values than the line. This will produce negative results, which is not good as it will substract the total value and show you a best fit line which is not true. So, for taking residuals, the values are squared before doing the summation. square ensures each term is positive. This is known as “Sum of Square Residuals”

so, the equation for “Sum of Square Residuals” will be:

image 

image

Lets say in this case, the sum is around = 30, and now again we have rotated the line and sum comes to  = 16; so, which will be the best fit? Its 16. the concept is that we have minimize the sum of squared values to minimize the value of intercept and slope to better predict the response. So, the line is fitted with “Least Squared Values”

fitting a line

Now, lets test how good fit our data and do the linear regression.

So, how to test it ?

We will calculate the coefficient of determination or R2

R2 = Explained variation of the model / total variation of the model

or [ss(total) – ss(res)]/ ss(total)

The areas of the blue squares represent the squared residuals with respect to the linear regression. The areas of the red squares represent the squared residuals with respect to the average value.

R^{2}=1-{\frac {\color {blue}{SS_{\text{res}}}}{\color {red}{SS_{\text{tot}}}}}

Now, for doing linear regression in R, you can use

lreg <- lm(Life_Expectancy ~ Income, data = df) #where lm is the linear model and Life_Expectancy (Dependent) and Income (Independent) are two variables.

summary(lreg) #to check the summary of our results

Call:
lm(formula = Life_Expectancy ~ Income, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-6.457 -3.000  1.427  2.685  4.573 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.896e+01  2.479e+00  23.783 3.63e-07 ***
Income      2.492e-03  5.484e-04   4.545  0.00391 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.19 on 6 degrees of freedom
Multiple R-squared:  0.7749,	Adjusted R-squared:  0.7374 
F-statistic: 20.66 on 1 and 6 DF,  p-value: 0.003913

The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. R2 value lies in between “0” and “1”.

  • An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
  • An R2 of 1 means the dependent variable can be predicted without any error from the independent variable.
    (or the response can be completely defined from the predictors values)
  • An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.

“Our R2 value was 0.77, which means 77% response Life Expectancy (dependent variable) can be explained by Income value (Independent variable). So, there is a relationship of income rate and life expectancy, 77% of higher life expectancy are related to high income rate.

If yo plot our data with :

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +
   geom_smooth(method = “lm”)

(This plot is with 95% CI value) if you want to remove, you have add one argument se = FALSE, like

image

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +
   geom_smooth(method = “lm”, se = FALSE)

image

Next part, I will discuss more on this along with p value, adjusted r values. How to show values on your plot with R and all.

Reading:

1. https://www.datacamp.com/community/tutorials/linear-regression-R

2. https://stattrek.com/statistics/dictionary.aspx?definition=coefficient_of_determination

Importing Data into R/ R-studio

Importing data is the basic step before starting the analysis, but it can be frustrating sometimes.

So, lets see, how we can import our data into R- environment.

1. In R –studio, you can go to environment on the right side of the panel and click on the import dataset and then you can choose one out of the three options like: text to import the data (csv files), or you can choose excel to import the excel files or other statistical files (SAS, SPSS, Stata etc).

Screen_Shot_2018-10-31_at_9.24.22_PM.png

2. You can also go the file section of the R- studio on the top menu bar and select the import function to import the data.

Screen_Shot_2018-10-31_at_9.28.55_PM.png

3. You can also import your .txt file using the read.table() function.

df <- read.table(“file name”, header = FALSE)

as a default, header is always set at TRUE. header will show you the header with variables name.

4. You can read .csv file with read.table () or read.csv() or read.csv2()

Note: You could land yourself in trouble if you have saved your file in Byte Order Mark (BOM). if you have done this, then you have to add an extra argument “ fileEncoding = “UTF-8-BOM” to your function

for read.table() = you have to specify the separator character (for csv generally separators are “,” or “’;”

example,

df <- read.table(“file name”,
                  header = FALSE,
                  sep = “,”)

read.csv() = for file with “,” as separator

read.csv2() = for files with “;” as separator

df <- read.csv(“file name”,
                header = FALSE)

df <- read.csv2(“file name”,
                header= FALSE)

“ Note that if you get a warning message that reads like “incomplete final line found by readTableHeader”, you can try to go and “stand” on the cell that contains the last value (c in this case) and press ENTER. This will normally fix the warning because the message indicates that the last line of the file doesn’t end with an End Of Line (EOL) character, which can be a linefeed or a carriage return and linefeed. Don’t forget to save the file to make sure that your changes are saved!

Pro-Tip: use a text editor like NotePad to make sure that you add an EOL character without adding new rows or columns to your data.”

5. If your data file is separated with characters other than tab, comma and semicolon, you can use read.delim() or read.delim2()

df <- read.delim(“file name, sep=”$”)

df <- read.delim2(“file name”, sep=”$”)

You can get more info on other types of file: like xml, json from https://www.datacamp.com/community/tutorials/r-data-import-tutorial

Day 2 of 100 days of Code: Bar plot 2: legend and labels in ggplot

#Day 2 of 100days of code
#Lets improve the bar plots and look into the other features we can use
#I will use the irish database for the plots

data(“iris”)
View(iris)
head(iris)

 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

#I will use ggplot2, dplyr and reshape2 for plotting, filtering and data arrangements
#you can use the install.packages function to install the pacakges

install.packages(“ggplot2”)
install.packages(“dplyr”)
install.packages(“reshape2”)


#then load the library with library function

library(ggplot2)

library(dplyr)

library(reshape2)
#lets do some data reshaping with the reshape2 package
#I will use melt function with Species ID to arrange the data in column according to their species and variables

iris2 <- melt(iris, ID = “species”)
head(iris2)

Species     variable value
1  setosa Sepal.Length   5.1
2  setosa Sepal.Length   4.9
3  setosa Sepal.Length   4.7
4  setosa Sepal.Length   4.6
5  setosa Sepal.Length   5.0
6  setosa Sepal.Length   5.4

#lets say

I <- iris2


#plotting: will plot a bar graph where the values will be side by side or in dodge position

I %>% ggplot(aes(x = Species, y = value)) + geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable))

image

#now lets say I am not happy with the legend position and I want it on the top of the plot

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = “top”)

image

#now lets say I am not happy with the legend position and I want it on the bottom of the plot

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = “bottom”)

image

#now lets say I am not happy with the legend position and I want it on some specified place(you can use coordinate to change the positon also)

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = c(0.1, 0.9))

image

#lets say I want to change the variable in legend to attributes

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = c(0.1, 0.9)) +
                                    labs(fill = “Attributes”)

image

#lets say i want to add plot tittle and change the x and y legends

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = c(0.1, 0.9)) +
   labs(title = “Iris Dataset”, fill = “Attributes”, x = “Flower species”, y = “Length and width parameter”)

image

#By default the title are place in left alignment in ggplot, lets say I want the title to be center placed

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = c(0.1, 0.9), plot.title = element_text(hjust = 0.5)) +
   labs(title = “Iris Dataset”, fill = “Attributes”, x = “Flower species”, y = “Length and width parameter”)

image

#you can also change the appearance of main title and axis labels by modifying the theme arguments under the element_text () function:

  • family : font family
  • face : font face. “plain”, “italic”, “bold” and “bold.italic”
  • color : text color
  • size : text size in pts
  • hjust : horizontal justification (in [0, 1])
  • vjust : vertical justification (in [0, 1])
  • line height : line height. In multi-line text, the line height argument is used to change the spacing between lines.
  • color : an alias for color

For example:

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “dodge”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = c(0.1, 0.9), plot.title = element_text(hjust = 0.5, size = 14, color = “red”, face = “bold”), axis.title.x = element_text(color = “red”)) +
   labs(title = “Iris Dataset”, fill = “Attributes”, x = “Flower species”, y = “Length and width parameter”)

image

#Lets say you want to plot a stack bar , you have to change the position from dodge to stack:

I %>% ggplot(aes(x = Species, y = value)) +
   geom_bar(position = “stack”, stat = “identity”, aes(fill = variable)) +
   theme(legend.position = c(0.1, 0.9), plot.title = element_text(hjust = 0.5, size = 14, color = “red”, face = “bold”), axis.title.x = element_text(color = “red”)) +
   labs(title = “Iris Dataset”, fill = “Attributes”, x = “Flower species”, y = “Length and width parameter”)

image

#play with your codes and have fun

Day 1 of 100days of code: Bar plot

#Blog post Feb 11,2019
#Day 1 of 100 ; 100 days code challenge
#I will try to plot some bar plots using dplyr and ggplot package and gapminder dataset
#installing the gapminder package for the gapminder dataset

install.packages(“gapminder”)

#loading the library and data

library(ggplot2) #as I already have installed the packages
library(dplyr)
library(gapminder)
data(“gapminder”)

#to view the headings of the data
head(gapminder)
#lets assign a varible for the gapminder data for ease of access

g <- gapminder

#lets view the head of the data

head(gapminder)

 A tibble: 396 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.

#Lets plot a bar chart for Life Expectancy for Asian countries

# %>% = pipe function can be used for augmenting multiple arguments in single line of code

g %>% filter(continent==”Asia” & year == 2007) %>% #we filtered the data only for Asia continent and year 2007 using filter function from dplyr pacakge
   head(5) %>%
   ggplot(aes(x= country, y =lifeExp)) +
   geom_bar(stat = “identity”, position = “identity”)

#the above code will produce something like this:

image

#lets twitch some of the codes: you can see the scale of axis y is only upto 60 years but the data goes beyond 70 years, so we want the axis to show the values upto 80, so what we need to do is; insert one more argument of scale_y_continuous, and we also want to change the label of y axis; now the code will look like this:

g %>% filter(continent==”Asia” & year == 2007) %>%
   head(5) %>%
   ggplot(aes(x= country, y =lifeExp)) +
   geom_bar(stat = “identity”, position = “identity”) +
   scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80))

image

#Now lets say, I am not quite happy with the theme, I want something classical, so there is also a classical theme for the ggplot graphs, now the code will look like this:

g %>% filter(continent==”Asia” & year == 2007) %>%
   head(5) %>%
   ggplot(aes(x= country, y =lifeExp)) +
   geom_bar(stat = “identity”, position = “identity”) +
   scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) +
  theme_classic()

image

#now say, I am not happy with color of the bars, I want to change them. So, what I need to do is , fill it with some different colors, you can use color function to see what are the options available for you.

colors()

 [1] "white"                "aliceblue"            "antiquewhite"         "antiquewhite1"       
  [5] "antiquewhite2"        "antiquewhite3"        "antiquewhite4"        "aquamarine"          
  [9] "aquamarine1"          "aquamarine2"          "aquamarine3"          "aquamarine4"         
 [13] "azure"                "azure1"               "azure2"               "azure3"              
 [17] "azure4"               "beige"                "bisque"               "bisque1" 

#now lets change the color of the plot, I really like red colors, so included one fill function in geom_bar argument

g %>% filter(continent==”Asia” & year == 2007) %>%
   head(5) %>%
   ggplot(aes(x= country, y =lifeExp)) +
   geom_bar(stat = “identity”, position = “identity”, fill = “red”) +
   scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) +
   theme_classic()

image

#now lets say, you are not happy with the width of the bars, you need smaller bars in the plot

g %>% filter(continent==”Asia” & year == 2007) %>%
   head(5) %>%
   ggplot(aes(x= country, y =lifeExp)) +
   geom_bar(stat = “identity”, position = “identity”, fill = “red”, width = 0.5) +
   scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) +
   theme_classic()

image

#now lets say, I want to see the 5 countries from Europe with their life Expectancy, so I will filter my data for Europe. and this time may be I will use something purple for the bars. so the codes will be like;

g %>% filter(continent==”Europe” & year == 2007) %>%
   head(5) %>%
   ggplot(aes(x= country, y =lifeExp)) +
   geom_bar(stat = “identity”, position = “identity”, fill = “orchid”, width = 0.5) +
   scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) +
   theme_classic()

image

Thank you. Hope this will help you out.

Mountain Kanchenjunga and Kanchenjunga Demon

The white clips of Mountain Kanchenjunga (Photo Copyright@SauravDas)

Kanchenjunga, the third highest mountain of the world, lying in between Nepal and Sikkim (India). The clips are surrounded with mythical stories.

The valley of Kanchenjunga once said to be home to a mountain deity, called Dzo-nga (Kanchenjunga Demon). He is a yeti or big footed snowman. In, 1925. Tombazi, a Greek photographer who was taking part in a British geological expedition in the Himalayas, saw a big footed creature moving across some lower slopes. In his notes: He said the creature was like a human being. “It stopped occasionally to uproot or pull at some dwarf rhododendron bushes,” he said. “It showed up dark against the snow and, as far as I could make out wore no clothes.” He found 15 footprints before losing the trail in the snow. In 1951, two British mountaineers, Eric Shipton and Micheal Ward, photographed footprints, each one was about 13 inches wide and 18 inches long (https://bit.ly/2tGojYj). For generations, there have been legends recounted by the inhabitants of the areas surrounding Mount Kanchenjunga, both in Sikkim and in Nepal, that there is a valley of immortality hidden on its slopes. These stories are well known to both the original inhabitants of the area, the Lepcha people, and those of the Tibetan Buddhist cultural tradition. In Tibetan, this valley is known as Beyul Demoshong. In 1962 a Tibetan Lama by the name of Tulshuk Lingpa led over 300 followers into the high snow slopes of Kanchenjunga to ‘open the way’ to Beyul Demoshong. The story of this expedition is recounted in the 2011 book A Step Away from Paradise (Wikipedia).

Error-Series: Just to Plot a Single Graph–How Much Errors you can make

Learning is fun; Sometime its a pain too; So I thought, I will post all the trial and errors done by me in a error series posts. Though, when you finally come up with a plot, It feels good.

Once I was trying my best to plot a stack bar for microbiological data. and the codes and errors goes like this.  (This is that plot – that came after hundreds of errors; though the data frame was as simple as it could be)

Domain  Rice  Tea  Soil

Archaea value value value

image

> Domain.Abundance <- read.csv(“C:/Users/sdas4/Desktop/Host Dependency Paper/Domain Abundance.csv”)

> DM <- Domain.Abundance

> DM %>% ggplot(aes(y= “domain”)) + geom_bar()

Error: stat_count() must not be used with a y aesthetic.

> DM %>% ggplot(domain, aes()) + geom_bar()

Error in ggplot.default(., domain, aes()) : object ‘domain’ not found

> DM %>% ggplot(aes(domain)) + geom_bar()

> DM %>% ggplot(aes(x=”environment”)) + geom_bar(aes((color))

+ DM %>% ggplot(aes(x=”environment”)) + geom_bar(aes((color))

Error: unexpected symbol in:

“DM %>% ggplot(aes(x=”environment”)) + geom_bar(aes((color))

DM”

> DM %>% ggplot(aes(x=”environment”)) + geom_bar(aes((color))

+ DM %>% ggplot(aes(x=”environment”)) + geom_bar(aes((color)))

Error: unexpected symbol in:

“DM %>% ggplot(aes(x=”environment”)) + geom_bar(aes((color))

DM”

> DM %>% ggplot(aes(x=”environment”)) + geom_bar(aes(color = domain))

> DM %>% ggplot(aes(x=”environment”)) + geom_bar(position = “stack”)

> DM %>% ggplot(aes(x=domain)) + geom_bar(position = “stack”)

> DM %>% ggplot(aes(x=domain)) + geom_bar(position = “stack”, aes(color = “Environment”))

> DM %>% ggplot(aes(x=domain)) + geom_bar(position = “stack”, aes(color = “Environment”))

Error in FUN(X[[i]], …) : object ‘domain’ not found

> DM %>% ggplot(aes(Environment)) + geom_bar(position = “stack”, aes(color = “Environment”))

> DM %>% ggplot(aes(Environment)) + geom_bar(position = “stack”, aes())

> DM %>% ggplot(aes(x = “Environment”)) + geom_bar(aes(fill = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences”))

Error: unexpected string constant in “DM %>% ggplot(aes(x = “Environment”)) + geom_bar(aes(fill = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences””

> )))

Error: unexpected ‘)’ in “)”

> DM %>% ggplot(aes(Environment)) + geom_bar(aes(fill = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences”)))

Error: unexpected string constant in “DM %>% ggplot(aes(Environment)) + geom_bar(aes(fill = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences“”

> DM <- Domain.Abundance

> DM %>% ggplot(aes(Environment)) + geom_bar(aes(fill = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences”)))

Error: unexpected string constant in “DM %>% ggplot(aes(Environment)) + geom_bar(aes(fill = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences””

> DM %>% ggplot(aes(Environment)) + geom_bar(aes(color = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences”)))

Error: unexpected string constant in “DM %>% ggplot(aes(Environment)) + geom_bar(aes(color = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses” “other.sequences””

> DM %>% ggplot(aes(Environment)) + geom_bar(aes(color = c(“Archaea”, “Bacteria”, “Eukaryota”, “Viruses”, “other.sequences”)))

Error: Aesthetics must be either length 1 or the same as the data (3): colour

> DM %>% ggplot(Environment, aes()) + geom_bar(stat = “identity”, position = “fill”)

Error in ggplot.default(., Environment, aes()) :

object ‘Environment’ not found

> DM %>% ggplot(“Environment”, aes()) + geom_bar(stat = “identity”, position = “fill”)

Error: Mapping should be created with `aes() or `aes_()`.

> install.packages(“reshape2”)

> library(reshape2)

Warning message:

package ‘reshape2’ was built under R version 3.5.2

> mytable <- melt(DM, id.vars = c(“Environment”), value.name = “proportion”)

> View(mytable)

> mytable %>% ggplot() + geom_bar(aes(x = “Environment” y = proportion, fill = variable, stat = “identity”))

Error: unexpected symbol in “mytable %>% ggplot() + geom_bar(aes(x = “Environment” y”

> mytable %>% ggplot() + geom_bar(aes(x = “Environment”, y = proportion, fill = variable, stat = “identity”))

Warning: Ignoring unknown aesthetics: stat

Error: stat_count() must not be used with a y aesthetic.

> ggplot() + geom_bar(aes(mytable$Environment,mytable$proportion, fill = mytable$variable, data = mytable, stat = “identity”))

Warning: Ignoring unknown aesthetics: data, stat

Don’t know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

Error: Aesthetics must be either length 1 or the same as the data (15): data

> View(mytable)

> ggplot(mytable, aes(x = Environment, y= proportion, fill= variable)) + geom_bar()

Error: stat_count() must not be used with a y aesthetic.

> ggplot(mytable, aes(x = Environment, y= proportion, fill= variable)) + geom_bar(aes(stat = “identity”))

Warning: Ignoring unknown aesthetics: stat

Error: stat_count() must not be used with a y aesthetic.

> ggplot(mytable, aes(Environment, proportion, fill= variable)) + geom_bar(aes(stat = “identity”))

Warning: Ignoring unknown aesthetics: stat

Error: stat_count() must not be used with a y aesthetic.

> ggplot(mytable, aes(x = “Environment”, y = “proportion”)) + geom_bar(aes(stat = “identity”))

Warning: Ignoring unknown aesthetics: stat

Error: stat_count() must not be used with a y aesthetic.

> ggplot(mytable, aes(x = “Environment”, y = “proportion”)) + geom_bar(aes(stat = “identity”, fill = variable))

Warning: Ignoring unknown aesthetics: stat

Error: stat_count() must not be used with a y aesthetic.

> P <- ggplot(mytable, aes())

> P + geom_bar()

Error: stat_count requires the following missing aesthetics: x

> P + geom_bar(aes(x = Environment))

> P + geom_bar(aes(x = Environment, fill = variable))

> P + geom_bar(aes(x = Environment, fill = variable, ylab = “abundance”))

Warning: Ignoring unknown aesthetics: ylab

Error: stat_count() must not be used with a y aesthetic.

> P + geom_bar(aes(x = Environment, fill = variable, ylab = abundance))

Warning: Ignoring unknown aesthetics: ylab

Error in FUN(X[[i]], …) : object ‘abundance’ not found

> P + geom_bar(aes(x = Environment, fill = variable))

> myTable <- prop.table(mytable, 1)

Error in margin.table(x, margin) : ‘x’ is not an array

> myTable <- prop.table(mytable)

Error in FUN(X[[i]], …) :

only defined on a data frame with all numeric variables

> myTable <- prop.table(mytable, 3)

Error in margin.table(x, margin) : ‘x’ is not an array

> myTable <- prop.table(mytable)

Error in FUN(X[[i]], …) :

only defined on a data frame with all numeric variables

> myTable <- prop.table(mytable$proportion)

> View(myTable)

> myTable <- prop.table(mytable$Environment, mytable$proportion)

Error in margin.table(x, margin) : ‘x’ is not an array

> mytable %>% mutate(perentage = proportion/mean(proportion)*100)

Environment variable proportion perentage

1 Tea Archaea 6 3.937542e-02

2 Rice Archaea 0 0.000000e+00

3 Soil Archaea 1622 1.064449e+01

> P + geom_bar(aes(x = Environment, y= percentage, fill = variable))

Error in FUN(X[[i]], …) : object ‘percentage’ not found

> mytable %>% mutate(perentage = proportion/mean(proportion)*100)

Environment variable proportion perentage

1 Tea Archaea 6 3.937542e-02

2 Rice Archaea 0 0.000000e+00

3 Soil Archaea 1622 1.064449e+01

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)

Environment variable proportion percentage

1 Tea Archaea 6 3.937542e-02

2 Rice Archaea 0 0.000000e+00

3 Soil Archaea 1622 1.064449e+01

> P + geom_bar(aes(x = Environment, y= percentage, fill = variable))

Error in FUN(X[[i]], …) : object ‘percentage’ not found

> P + geom_bar(aes(x = Environment, y= percentage, fill = variable))

Error in FUN(X[[i]], …) : object ‘percentage’ not found

> P + geom_bar(aes(x = Environment, y= “percentage”, fill = variable))

Error: stat_count() must not be used with a y aesthetic.

> P <- ggplot(mytable, aes(Environment, percentage))

> P + geom_bar(aesfill = variable))

Error: unexpected ‘)’ in “P + geom_bar(aesfill = variable))”

> P + geom_bar(aes(fill = variable))

Error in FUN(X[[i]], …) : object ‘percentage’ not found

> View(mytable)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100) %>% ggplot(aes(Environment, percentage)) + geom_bar(aes(fill = variable))

Error: stat_count() must not be used with a y aesthetic.

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)

Environment variable proportion percentage

1 Tea Archaea 6 3.937542e-02

2 Rice Archaea 0 0.000000e+00

3 Soil Archaea 1622 1.064449e+01

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(Environment, percentage, color = variable)) + geom_bar()

Error: stat_count() must not be used with a y aesthetic.

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(Environment, color = variable)) + geom_bar()

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot() + geom_bar(aes(x = “Environment”, y = “percentage”))

Error: stat_count() must not be used with a y aesthetic.

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=percentage)) + geom_bar(stat = “identity”, position = “stack”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=percentage, color = variable)) + geom_bar(stat = “identity”, position = “stack”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=percentage, fill = variable)) + geom_bar(stat = “identity”, position = “stack”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=percentage, fill = variable, ylab(percetage of abundance))) + geom_bar(stat = “identity”, position = “stack”)

Error: unexpected symbol in “mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=percentage, fill = variable, ylab(percetage of”

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=percentage, fill = variable)) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Percentage of Abundance”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=proportion, fill = variable)) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Percentage of Abundance”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=proportion, fill = variable)) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=proportion, fill = variable)) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance”, colour = “Domain”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=proportion, fill = variable)) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance”, legend = “Domain”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=proportion, fill = variable)) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance”, Colour = “Domain”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=proportion, fill = variable, trans = log10(proportion))) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance”, fill = “Domain”)

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=log10(proportion), fill = variable, )) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance”, fill = “Domain”)

Warning message:

Removed 4 rows containing missing values (geom_bar).

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=log10(proportion), fill = variable, )) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance(log transformed)”, fill = “Domain”)

The successful code:

> mytable %>% mutate(percentage = proportion/mean(proportion)*100)%>% ggplot(aes(x=Environment, y=log10(proportion), fill = variable, )) + geom_bar(stat = “identity”, position = “stack”) + labs(y= “Abundance (log10 transformed)”, fill = “Domain”)

Combining Data Frames with join function from”dplyr”

In real life, the data frames you want to work on often comes as separate files and sometime you want to combine them to do your analysis, but often we face lots of trouble in merging the different data sets, lets see how we can use the “dplyr’ package to merge the data sets in your way

Lets create some data sets to try this:

# Data frame A
A <- data.frame(ID = c(“A”, “B”, “C”, “D”), Y = 1:4)
View(A)
#Data frame B
B <- data.frame(ID = c(“A”, “C”, “D”, “E”), Y = 5:8)
View(B)

Now lets join this two data frame with the join function from “dplyr” package:

#loading the library

library (dplyr)

#Lets call the new data set as “C” and we will do the full join by their “ID”

C <- full_join (A, B, by = “ID”)

# so now the two data sets are joined with their ID, the join function will give you a warning message like:

The joining function will give you a warning message like:

“Warning message:
Column ID joining factors with different levels, coercing to character vector”

#as there are some element which are not in the either of data sets and they are being filled with NA as a default function of coercing from R. the new data set will look like:

ID Y.X Y.Y
1 A 1 5
2 B 2 NA
3 C 3 6
4 D 4 7
5 E NA 8

#Now lets say you want to remove all the NA’s in the new data set “C” with “Zero”

for this;

C[is.na(C)] <- 0
View(C)

This function will replace all the NA’s with zero.

Understanding your data – part 1

To understand the data structure and type, summary statistics and visualization of the distribution of data is much important

For basic, A data can be of two types, “Categorical/Qualitative” or “Numerical/Quantative”

Categorical can be of two types – a. Ordinal or b. Non-ordinal

Numerical data can be of two types – a. Discrete or b. Continuous

For an example, let say we measure the plants height from a experimental field which is accordingly:

P <- c(10.2, 10.5, 11.5, 12.6, 14.3, 12.8, 9.4, 13.2, 15.6, 13.9, 14.8, 16.2, 12.5)

So, if I ask, what is the data type ?

its obvious its numerical, but is it discrete or continuous ?

So, for discrete data: that can only takes certain values, like number of students in a class : it may be 6 or 7 or 9

but for continuous data, they can take any value over a range, for an example, the height of the student of the class:

that may be like: 6.6, 6.7,6.8,6.9 so on

Thus, the above data frame is an example of continuous value

if you want to summarise the data with two values we can take mean and the standard deviation for the above data frame:

for calculating mean:

mean(P)


12.88462

For standard deviation:

sd(P)


2.088798

So, you can now tell the average height of the (n = 13) plant samples was 12.88 with standard deviation of 2.088

or you can also use the summary function to understand the data range like:

summary(P)

which will provide the following results
Min. 1st Qu. Median Mean 3rd Qu. Max.

9.40 11.50 12.80 12.88 14.30 16.20

So, the data is ranging with a minimum of 9.4 to maximum 16.2, a median of 12.80 and mean of 12.88

You can also visualize the distribution of data with histogram:

A histogram is a representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable). It differs from a bar graph, in the sense that a bar graph relates two variables, but a histogram relates only one. ” (Source: Wikipedia)

For drawing a histogram:

we can put,

hist(P, col = “grey”)

hist function will draw the histogram and col argument will fill the columns with grey color