It’s been long, I wanted to write something on the crop I am working on.
It is Proso millet.
Like all other grass family members, it is from Poaceae family. Genus is Panicum. Scientific name is Panicum miliaceum.
It is is mostly cultivated in USA, China, India, Niger, Nepal, Africa, Russia, Ukraine, Belarus, Middle East, Turkey, Poland, Germany, Switzerland, and Romania. Radioactive dating showed its origin was around 10,000 BP in China. It was introduced in the USA by German-Russian Immigrant in 1875.
Fig. 1 Countries growing proso millet around the world
Proso millet is the best suited crop in the semi-arid region of the world. where annual rainfall is pretty low. Here in the western Nebraska and the Central Great Plains where annual rainfall is < 100mm, proso millet stand out as an important rotational crop with winter wheat. High water use efficiency (WUE) with shallow root system (~90 – 92 cm root) helps it thrive and adapt only using the minimal water which is available through the spring rain. It also helps in conservation of subsurface water for the deep-rooted crop like corn, wheat, sunflower. Cultivating proso in the summer fallow also helps in controlling the annual grass weeds. Short growing season of proso millet also makes it ideal and helps it fit perfectly in the winter wheat – proso millet rotation.
Proso millet is also a miracle grain with multitudinous health benefits.
· It is gluten free,
· It has low glycemic index,
· It increases blood HDL level
· It has high antioxidant
· It has high mineral content
· Essential amino acid index is higher than wheat
Quinoa and Proso millet
Though its abundant health and environmental benefits, it is mostly used as bird seeds in USA. However, due to recent upsurge in health consciousness among people and rapid wave of ancient grains in market proso millet did get some attentions. However still far behind the imported ancient grain like “quinoa”. Though if you compare the nutritional profile of proso millet and quinoa, both are in neck to neck fight with similar carbohydrate, protein, vitamin, amino acids and mineral quantities. Both quinoa and proso millets are gluten free. Moreover, proso millet is locally grown seeds and increased market share can improve the local farm economy. While rapid increase in the export of quinoa has created imbalance in local economy and also impacting the ecosystem of the importing country due to conversion of forest lands to farmland to meet the demands. But, despite all the facts, proso millet is still struggling to get into the human food market where quinoa has established itself as a golden seed of health benefits.
New hybrid varieties with high nutritional value and extension regarding different prospects and perspectives of proso millet is the need of time to expand and develop new market for proso millet.
So, What we do ?
we are trying to develop new varieties of proso millet which will be more marketable and test the developed varieties in multi location for the environment and genotype interaction and try to select the best varieties which gives significant yield in different environment and locations.
Odds ratio (OR) is statistical quantifier which measures the association between an exposure and an outcome. OR represents the odds of an outcome in present of particular exposure with comparison to odds in the absence of that exposure. In a simple definition you can how likely or probability of occurrence of an event in presence or absence of a particular exposure.
OR are mostly used in case control, cross sectional and cohort studies.
Where and When to use it?
OR is used to measure the relative odds of occurrence of an outcome especially disease or disorder in a given exposure. For an example, how likely the chance of occurrence of cancer in presence and absence of smoking as a factor.
OR can also be used to measure the risk factor for a particular outcome.
OR = 1 Exposure doesn’t affect the odds of outcome.
OR >1 Exposure associated with higher odds of outcome
OR <1 Exposure associated with lower odds of outcome
in rare outcomes OR = RR (relative risk) where the incidence of disease is < 10%
Then What is Confidence Interval (CI), which is always calculated with OR ?
95% CI is used to estimate the precision of OR. A large CI indicates low level of precision for OR while a small CI indicates high precision of OR.
Remember, CI is not measure of statistical significance.
Its give you the range where your prediction will be in 95% of the time when you are estimating a population based on your sample. (CI is calculated with bootstrapping)
Now lets see, how you can calculate OR and CI using R :
Lets Say, There is a particular disease “X” and I want to check does it have any relationship with Demographic factor like Rural and Urban or simply does livelihood somehow effect the occurrence of the disease. So, I did sampling and collected information from both rural and urban population and it was like:
So, all total I collected 200 people’s information, where 120 people were from rural area where 65 were found positive for the disease and 55 were negative. From urban I collected 80 samples and I found 46 were positive and 34 were negative for the disease. So, now lets do the OR test.
So, the question will be like what is the odds of having the disease for someone living in rural area and urban area ?
For this, I am going to use “epiR” package.
#Lets create a matrix of 2 by 2 and give the dataset a name of X
X <- matrix(c(65, 55, 46, 34), nrow = 2, byrow = TRUE)
[1,] 65 55
[2,] 46 34
#Lets chane the row names and colnames as according to the example
rownames(X) <- c(“Positive”, “Negative”)
colnames(X) <- c(“Rural”, “Urban”)
Positive 65 55
Negative 46 34
epi.2by2(X, method = “cohort.count”) (I used cohort count, here you can use Wald test or Fischer test accordingly)
Outcome + Outcome - Total Inc risk * Odds
Exposed + 65 55 120 54.2 1.18
Exposed - 46 34 80 57.5 1.35
Total 111 89 200 55.5 1.25
Point estimates and 95 % CIs:
Inc risk ratio 0.94 (0.73, 1.21)
Odds ratio 0.87 (0.49, 1.55)
Attrib risk * -3.33 (-17.36, 10.70)
Attrib risk in population * -2.00 (-14.84, 10.84)
Attrib fraction in exposed (%) -6.15 (-36.33, 17.34)
Attrib fraction in population (%) -3.60 (-19.96, 10.52)
X2 test statistic: 0.216 p-value: 0.642
Wald confidence limits
* Outcomes per 100 population units
interpretation : so the odds ratio tells us that the odds are 0.87 times great that someone living in rural areas will have X disease compare to urban areas. However, p value is above o.05 which indicate the results are not statistically significant.
You can even calculate the same in classical way by dividing odds of each case lets say:
chances or odds of having someone the disease living in rural area will be = 65/55 = 1.18
and similarly for someone living in urban area will be = 46/34 = 1.35
Linear models describes a continuous variable (response/ dependent variable ~y) as a function of one or more predictor or explanatory variables (independent ~x). Linear regression is the statistical method used to create a linear model. It can help you to understand and predict the behavior of a complex system based on the predictor functions.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (= value of y when x = 0).
There are several types of linear regression:
Simple linear regression: models using only one predictor
Multiple linear regression: models using multiple predictors
Multivariate linear regression: models for multiple response variables
We will consider simple linear regression in this article.
Lets say, children height and age are related, like higher the age height will be more. So, age is the explanatory variable (~x) which can define the height as a dependent variable (~y). So based on this you can make a model, where by knowing the age of a child you can predict the height of that child.
For analysis of a model, you need fit a line between the two or more variables. The line is generally fitted with sum of least square values. To define this, lets say,
For hypothetical dataset, there are two variables, income and life Expectancy for a dataset df.
Our hypothesis is that, income rate can effect the life expectancy of a person. Higher the income rate, higher will be the life expectancy. So, income is independent and Life expectancy is dependent variable.
So, lets do the modeling and will also explain the fitting of line while going through the model.
Now lets put a “line of best fit” – line of best fit is the straight line with least square method to see the trend.
But how to put a line, and where to put, horizontal, vertical ? how to find the optimal or best fit line ?
For starting, lets put a horizontal line on average value of y axis ( y = m = 68). To see how best fitted the line is, determine the distance of each data point from the line (which is known as residuals). For the first point it will be (m – y1), for the second data point (m – y2), third point will be (m – y3). Similarly you can find distance for each point from the horizontal line. But, here’s a catch, for the data points (y5, y6, y7, y8) which are above the horizontal line, they have greater values than the line. This will produce negative results, which is not good as it will substract the total value and show you a best fit line which is not true. So, for taking residuals, the values are squared before doing the summation. square ensures each term is positive. This is known as “Sum of Square Residuals”
so, the equation for “Sum of Square Residuals” will be:
Lets say in this case, the sum is around = 30, and now again we have rotated the line and sum comes to = 16; so, which will be the best fit? Its 16. the concept is that we have minimize the sum of squared values to minimize the value of intercept and slope to better predict the response. So, the line is fitted with “Least Squared Values”
Now, lets test how good fit our data and do the linear regression.
So, how to test it ?
We will calculate the coefficient of determination or R2
R2 = Explained variation of the model / total variation of the model
or [ss(total) – ss(res)]/ ss(total)
The areas of the blue squares represent the squared residuals with respect to the linear regression. The areas of the red squares represent the squared residuals with respect to the average value.
Now, for doing linear regression in R, you can use
lreg <- lm(Life_Expectancy ~ Income, data = df) #where lm is the linear model and Life_Expectancy (Dependent) and Income (Independent) are two variables.
summary(lreg) #to check the summary of our results
lm(formula = Life_Expectancy ~ Income, data = df)
Min 1Q Median 3Q Max
-6.457 -3.000 1.427 2.685 4.573
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.896e+01 2.479e+00 23.783 3.63e-07 ***
Income 2.492e-03 5.484e-04 4.545 0.00391 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.19 on 6 degrees of freedom
Multiple R-squared: 0.7749, Adjusted R-squared: 0.7374
F-statistic: 20.66 on 1 and 6 DF, p-value: 0.003913
The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. R2 value lies in between “0” and “1”.
An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
An R2 of 1 means the dependent variable can be predicted without any error from the independent variable.
(or the response can be completely defined from the predictors values)
An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.
“Our R2 value was 0.77, which means 77% response Life Expectancy (dependent variable) can be explained by Income value (Independent variable). So, there is a relationship of income rate and life expectancy, 77% of higher life expectancy are related to high income rate.
Importing data is the basic step before starting the analysis, but it can be frustrating sometimes.
So, lets see, how we can import our data into R- environment.
1. In R –studio, you can go to environment on the right side of the panel and click on the import dataset and then you can choose one out of the three options like: text to import the data (csv files), or you can choose excel to import the excel files or other statistical files (SAS, SPSS, Stata etc).
2. You can also go the file section of the R- studio on the top menu bar and select the import function to import the data.
3. You can also import your .txt file using the read.table() function.
as a default, header is always set at TRUE. header will show you the header with variables name.
4. You can read .csv file with read.table () or read.csv() or read.csv2()
Note: You could land yourself in trouble if you have saved your file in Byte Order Mark (BOM). if you have done this, then you have to add an extra argument “ fileEncoding = “UTF-8-BOM” to your function
for read.table() = you have to specify the separator character (for csv generally separators are “,” or “’;”
“ Note that if you get a warning message that reads like “incomplete final line found by readTableHeader”, you can try to go and “stand” on the cell that contains the last value (c in this case) and press ENTER. This will normally fix the warning because the message indicates that the last line of the file doesn’t end with an End Of Line (EOL) character, which can be a linefeed or a carriage return and linefeed. Don’t forget to save the file to make sure that your changes are saved!
Pro-Tip: use a text editor like NotePad to make sure that you add an EOL character without adding new rows or columns to your data.”
5. If your data file is separated with characters other than tab, comma and semicolon, you can use read.delim() or read.delim2()
#Blog post Feb 11,2019 #Day 1 of 100 ; 100 days code challenge #I will try to plot some bar plots using dplyr and ggplot package and gapminder dataset #installing the gapminder package for the gapminder dataset
#loading the library and data
library(ggplot2) #as I already have installed the packages library(dplyr) library(gapminder) data(“gapminder”)
#to view the headings of the data head(gapminder) #lets assign a varible for the gapminder data for ease of access
g <- gapminder
#lets view the head of the data
A tibble: 396 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
#Lets plot a bar chart for Life Expectancy for Asian countries
# %>% = pipe function can be used for augmenting multiple arguments in single line of code
g %>% filter(continent==”Asia” & year == 2007) %>% #we filtered the data only for Asia continent and year 2007 using filter function from dplyr pacakge head(5) %>% ggplot(aes(x= country, y =lifeExp)) + geom_bar(stat = “identity”, position = “identity”)
#the above code will produce something like this:
#lets twitch some of the codes: you can see the scale of axis y is only upto 60 years but the data goes beyond 70 years, so we want the axis to show the values upto 80, so what we need to do is; insert one more argument of scale_y_continuous, and we also want to change the label of y axis; now the code will look like this:
g %>% filter(continent==”Asia” & year == 2007) %>% head(5) %>% ggplot(aes(x= country, y =lifeExp)) + geom_bar(stat = “identity”, position = “identity”) + scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80))
#Now lets say, I am not quite happy with the theme, I want something classical, so there is also a classical theme for the ggplot graphs, now the code will look like this:
g %>% filter(continent==”Asia” & year == 2007) %>% head(5) %>% ggplot(aes(x= country, y =lifeExp)) + geom_bar(stat = “identity”, position = “identity”) + scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) + theme_classic()
#now say, I am not happy with color of the bars, I want to change them. So, what I need to do is , fill it with some different colors, you can use color function to see what are the options available for you.
#now lets change the color of the plot, I really like red colors, so included one fill function in geom_bar argument
g %>% filter(continent==”Asia” & year == 2007) %>% head(5) %>% ggplot(aes(x= country, y =lifeExp)) + geom_bar(stat = “identity”, position = “identity”, fill = “red”) + scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) + theme_classic()
#now lets say, you are not happy with the width of the bars, you need smaller bars in the plot
g %>% filter(continent==”Asia” & year == 2007) %>% head(5) %>% ggplot(aes(x= country, y =lifeExp)) + geom_bar(stat = “identity”, position = “identity”, fill = “red”, width = 0.5) + scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) + theme_classic()
#now lets say, I want to see the 5 countries from Europe with their life Expectancy, so I will filter my data for Europe. and this time may be I will use something purple for the bars. so the codes will be like;
g %>% filter(continent==”Europe” & year == 2007) %>% head(5) %>% ggplot(aes(x= country, y =lifeExp)) + geom_bar(stat = “identity”, position = “identity”, fill = “orchid”, width = 0.5) + scale_y_continuous(name=”Life Expectancy”, limits=c(0, 80)) + theme_classic()
Kanchenjunga, the third highest mountain of the world, lying in between Nepal and Sikkim (India). The clips are surrounded with mythical stories.
The valley of Kanchenjunga once said to be home to a mountain deity, called Dzo-nga (Kanchenjunga Demon). He is a yeti or big footed snowman. In, 1925. Tombazi, a Greek photographer who was taking part in a British geological expedition in the Himalayas, saw a big footed creature moving across some lower slopes. In his notes: He said the creature was like a human being. “It stopped occasionally to uproot or pull at some dwarf rhododendron bushes,” he said. “It showed up dark against the snow and, as far as I could make out wore no clothes.” He found 15 footprints before losing the trail in the snow. In 1951, two British mountaineers, Eric Shipton and Micheal Ward, photographed footprints, each one was about 13 inches wide and 18 inches long (https://bit.ly/2tGojYj). For generations, there have been legends recounted by the inhabitants of the areas surrounding Mount Kanchenjunga, both in Sikkim and in Nepal, that there is a valley of immortality hidden on its slopes. These stories are well known to both the original inhabitants of the area, the Lepcha people, and those of the Tibetan Buddhist cultural tradition. In Tibetan, this valley is known as Beyul Demoshong. In 1962 a Tibetan Lama by the name of Tulshuk Lingpa led over 300 followers into the high snow slopes of Kanchenjunga to ‘open the way’ to Beyul Demoshong. The story of this expedition is recounted in the 2011 book A Step Away from Paradise (Wikipedia).
Learning is fun; Sometime its a pain too; So I thought, I will post all the trial and errors done by me in a error series posts. Though, when you finally come up with a plot, It feels good.
Once I was trying my best to plot a stack bar for microbiological data. and the codes and errors goes like this. (This is that plot – that came after hundreds of errors; though the data frame was as simple as it could be)
In real life, the data frames you want to work on often comes as separate files and sometime you want to combine them to do your analysis, but often we face lots of trouble in merging the different data sets, lets see how we can use the “dplyr’ package to merge the data sets in your way
Lets create some data sets to try this:
# Data frame A A <- data.frame(ID = c(“A”, “B”, “C”, “D”), Y = 1:4) View(A) #Data frame B B <- data.frame(ID = c(“A”, “C”, “D”, “E”), Y = 5:8) View(B)
Now lets join this two data frame with the join function from “dplyr” package:
#loading the library
#Lets call the new data set as “C” and we will do the full join by their “ID”
C <- full_join (A, B, by = “ID”)
# so now the two data sets are joined with their ID, the join function will give you a warning message like:
The joining function will give you a warning message like:
“Warning message: Column ID joining factors with different levels, coercing to character vector”
#as there are some element which are not in the either of data sets and they are being filled with NA as a default function of coercing from R. the new data set will look like:
#Now lets say you want to remove all the NA’s in the new data set “C” with “Zero”
C[is.na(C)] <- 0 View(C)
This function will replace all the NA’s with zero.