Linear Model

Linear models describes a continuous variable (response/ dependent variable ~y) as a function of one or more predictor or explanatory variables (independent ~x). Linear regression is the statistical method used to create a linear model. It can help you to understand and predict the behavior of a complex system based on the predictor functions.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (= value of y when x = 0).

There are several types of linear regression:

  • Simple linear regression: models using only one predictor
  • Multiple linear regression: models using multiple predictors
  • Multivariate linear regression: models for multiple response variables

We will consider simple linear regression in this article.

Lets say, children height and age are related, like higher the age height will be more. So, age is the explanatory variable (~x) which can define the height as a dependent variable (~y). So based on this you can make a model, where by knowing the age of a child you can predict the height of that child.

For analysis of a model, you need fit a line between the two or more variables. The line is generally fitted with sum of least square values. To define this, lets say,

For hypothetical dataset, there are two variables, income and life Expectancy for a dataset df.

Our hypothesis is that, income rate can effect the life expectancy of a person. Higher the income rate, higher will be the life expectancy. So, income is independent and Life expectancy is dependent variable.

So, lets do the modeling and will also explain the fitting of line while going through the model.

Creating the data frame:

df <- data.frame(Income = c(1000, 1500, 2000, 2500, 3000, 4000, 5000, 10000),
                  Life_Expectancy = c(55, 60, 65, 67, 69, 72, 76, 80))

image

So, first lets plot a scatter plot and understand how lines are fitted to your data, (we will be using ggplot2 package for this)

> ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic()

image

Now lets put a “line of best fit” – line of best fit is the straight line with least square method to see the trend.

But how to put a line, and where to put, horizontal, vertical ? how to find the optimal or best fit line ?

For starting, lets put a horizontal line on average value of y axis ( y = m = 68). To see how best fitted the line is, determine the distance of each data point from the line (which is known as residuals). For the first point it will be (m – y1), for the second data point (m – y2), third point will be (m – y3). Similarly you can find distance for each point from the horizontal line. But, here’s a catch, for the data points (y5, y6, y7, y8) which are above the horizontal line, they have greater values than the line. This will produce negative results, which is not good as it will substract the total value and show you a best fit line which is not true. So, for taking residuals, the values are squared before doing the summation. square ensures each term is positive. This is known as “Sum of Square Residuals”

so, the equation for “Sum of Square Residuals” will be:

image 

image

Lets say in this case, the sum is around = 30, and now again we have rotated the line and sum comes to  = 16; so, which will be the best fit? Its 16. the concept is that we have minimize the sum of squared values to minimize the value of intercept and slope to better predict the response. So, the line is fitted with “Least Squared Values”

fitting a line

Now, lets test how good fit our data and do the linear regression.

So, how to test it ?

We will calculate the coefficient of determination or R2

R2 = Explained variation of the model / total variation of the model

or [ss(total) – ss(res)]/ ss(total)

The areas of the blue squares represent the squared residuals with respect to the linear regression. The areas of the red squares represent the squared residuals with respect to the average value.

R^{2}=1-{\frac {\color {blue}{SS_{\text{res}}}}{\color {red}{SS_{\text{tot}}}}}

Now, for doing linear regression in R, you can use

lreg <- lm(Life_Expectancy ~ Income, data = df) #where lm is the linear model and Life_Expectancy (Dependent) and Income (Independent) are two variables.

summary(lreg) #to check the summary of our results

Call:
lm(formula = Life_Expectancy ~ Income, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-6.457 -3.000  1.427  2.685  4.573 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.896e+01  2.479e+00  23.783 3.63e-07 ***
Income      2.492e-03  5.484e-04   4.545  0.00391 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.19 on 6 degrees of freedom
Multiple R-squared:  0.7749,	Adjusted R-squared:  0.7374 
F-statistic: 20.66 on 1 and 6 DF,  p-value: 0.003913

The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. R2 value lies in between “0” and “1”.

  • An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
  • An R2 of 1 means the dependent variable can be predicted without any error from the independent variable.
    (or the response can be completely defined from the predictors values)
  • An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.

“Our R2 value was 0.77, which means 77% response Life Expectancy (dependent variable) can be explained by Income value (Independent variable). So, there is a relationship of income rate and life expectancy, 77% of higher life expectancy are related to high income rate.

If yo plot our data with :

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +
   geom_smooth(method = “lm”)

(This plot is with 95% CI value) if you want to remove, you have add one argument se = FALSE, like

image

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +
   geom_smooth(method = “lm”, se = FALSE)

image

Next part, I will discuss more on this along with p value, adjusted r values. How to show values on your plot with R and all.

Reading:

1. https://www.datacamp.com/community/tutorials/linear-regression-R

2. https://stattrek.com/statistics/dictionary.aspx?definition=coefficient_of_determination

Importing Data into R/ R-studio

Importing data is the basic step before starting the analysis, but it can be frustrating sometimes.

So, lets see, how we can import our data into R- environment.

1. In R –studio, you can go to environment on the right side of the panel and click on the import dataset and then you can choose one out of the three options like: text to import the data (csv files), or you can choose excel to import the excel files or other statistical files (SAS, SPSS, Stata etc).

Screen_Shot_2018-10-31_at_9.24.22_PM.png

2. You can also go the file section of the R- studio on the top menu bar and select the import function to import the data.

Screen_Shot_2018-10-31_at_9.28.55_PM.png

3. You can also import your .txt file using the read.table() function.

df <- read.table(“file name”, header = FALSE)

as a default, header is always set at TRUE. header will show you the header with variables name.

4. You can read .csv file with read.table () or read.csv() or read.csv2()

Note: You could land yourself in trouble if you have saved your file in Byte Order Mark (BOM). if you have done this, then you have to add an extra argument “ fileEncoding = “UTF-8-BOM” to your function

for read.table() = you have to specify the separator character (for csv generally separators are “,” or “’;”

example,

df <- read.table(“file name”,
                  header = FALSE,
                  sep = “,”)

read.csv() = for file with “,” as separator

read.csv2() = for files with “;” as separator

df <- read.csv(“file name”,
                header = FALSE)

df <- read.csv2(“file name”,
                header= FALSE)

“ Note that if you get a warning message that reads like “incomplete final line found by readTableHeader”, you can try to go and “stand” on the cell that contains the last value (c in this case) and press ENTER. This will normally fix the warning because the message indicates that the last line of the file doesn’t end with an End Of Line (EOL) character, which can be a linefeed or a carriage return and linefeed. Don’t forget to save the file to make sure that your changes are saved!

Pro-Tip: use a text editor like NotePad to make sure that you add an EOL character without adding new rows or columns to your data.”

5. If your data file is separated with characters other than tab, comma and semicolon, you can use read.delim() or read.delim2()

df <- read.delim(“file name, sep=”$”)

df <- read.delim2(“file name”, sep=”$”)

You can get more info on other types of file: like xml, json from https://www.datacamp.com/community/tutorials/r-data-import-tutorial