Linear Model

Linear models describes a continuous variable (response/ dependent variable ~y) as a function of one or more predictor or explanatory variables (independent ~x). Linear regression is the statistical method used to create a linear model. It can help you to understand and predict the behavior of a complex system based on the predictor functions.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (= value of y when x = 0).

There are several types of linear regression:

  • Simple linear regression: models using only one predictor
  • Multiple linear regression: models using multiple predictors
  • Multivariate linear regression: models for multiple response variables

We will consider simple linear regression in this article.

Lets say, children height and age are related, like higher the age height will be more. So, age is the explanatory variable (~x) which can define the height as a dependent variable (~y). So based on this you can make a model, where by knowing the age of a child you can predict the height of that child.

For analysis of a model, you need fit a line between the two or more variables. The line is generally fitted with sum of least square values. To define this, lets say,

For hypothetical dataset, there are two variables, income and life Expectancy for a dataset df.

Our hypothesis is that, income rate can effect the life expectancy of a person. Higher the income rate, higher will be the life expectancy. So, income is independent and Life expectancy is dependent variable.

So, lets do the modeling and will also explain the fitting of line while going through the model.

Creating the data frame:

df <- data.frame(Income = c(1000, 1500, 2000, 2500, 3000, 4000, 5000, 10000),
                  Life_Expectancy = c(55, 60, 65, 67, 69, 72, 76, 80))

image

So, first lets plot a scatter plot and understand how lines are fitted to your data, (we will be using ggplot2 package for this)

> ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic()

image

Now lets put a “line of best fit” – line of best fit is the straight line with least square method to see the trend.

But how to put a line, and where to put, horizontal, vertical ? how to find the optimal or best fit line ?

For starting, lets put a horizontal line on average value of y axis ( y = m = 68). To see how best fitted the line is, determine the distance of each data point from the line (which is known as residuals). For the first point it will be (m – y1), for the second data point (m – y2), third point will be (m – y3). Similarly you can find distance for each point from the horizontal line. But, here’s a catch, for the data points (y5, y6, y7, y8) which are above the horizontal line, they have greater values than the line. This will produce negative results, which is not good as it will substract the total value and show you a best fit line which is not true. So, for taking residuals, the values are squared before doing the summation. square ensures each term is positive. This is known as “Sum of Square Residuals”

so, the equation for “Sum of Square Residuals” will be:

image 

image

Lets say in this case, the sum is around = 30, and now again we have rotated the line and sum comes to  = 16; so, which will be the best fit? Its 16. the concept is that we have minimize the sum of squared values to minimize the value of intercept and slope to better predict the response. So, the line is fitted with “Least Squared Values”

fitting a line

Now, lets test how good fit our data and do the linear regression.

So, how to test it ?

We will calculate the coefficient of determination or R2

R2 = Explained variation of the model / total variation of the model

or [ss(total) – ss(res)]/ ss(total)

The areas of the blue squares represent the squared residuals with respect to the linear regression. The areas of the red squares represent the squared residuals with respect to the average value.

R^{2}=1-{\frac {\color {blue}{SS_{\text{res}}}}{\color {red}{SS_{\text{tot}}}}}

Now, for doing linear regression in R, you can use

lreg <- lm(Life_Expectancy ~ Income, data = df) #where lm is the linear model and Life_Expectancy (Dependent) and Income (Independent) are two variables.

summary(lreg) #to check the summary of our results

Call:
lm(formula = Life_Expectancy ~ Income, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-6.457 -3.000  1.427  2.685  4.573 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.896e+01  2.479e+00  23.783 3.63e-07 ***
Income      2.492e-03  5.484e-04   4.545  0.00391 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.19 on 6 degrees of freedom
Multiple R-squared:  0.7749,	Adjusted R-squared:  0.7374 
F-statistic: 20.66 on 1 and 6 DF,  p-value: 0.003913

The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. R2 value lies in between “0” and “1”.

  • An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
  • An R2 of 1 means the dependent variable can be predicted without any error from the independent variable.
    (or the response can be completely defined from the predictors values)
  • An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.

“Our R2 value was 0.77, which means 77% response Life Expectancy (dependent variable) can be explained by Income value (Independent variable). So, there is a relationship of income rate and life expectancy, 77% of higher life expectancy are related to high income rate.

If yo plot our data with :

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +
   geom_smooth(method = “lm”)

(This plot is with 95% CI value) if you want to remove, you have add one argument se = FALSE, like

image

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +
   geom_smooth(method = “lm”, se = FALSE)

image

Next part, I will discuss more on this along with p value, adjusted r values. How to show values on your plot with R and all.

Reading:

1. https://www.datacamp.com/community/tutorials/linear-regression-R

2. https://stattrek.com/statistics/dictionary.aspx?definition=coefficient_of_determination

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s