Linear models describes a continuous variable (response/ dependent variable ~y) as a function of one or more predictor or explanatory variables (independent ~x). Linear regression is the statistical method used to create a linear model. It can help you to understand and predict the behavior of a complex system based on the predictor functions.

A linear regression line has an equation of the form ** Y = a + bX**, where

**is the explanatory variable and**

*X***is the dependent variable. The slope of the line is**

*Y***, and**

*b***is the intercept (= value of**

*a***when**

*y***= 0).**

*x*There are several types of linear regression:

**Simple linear regression:**models using only one predictor**Multiple linear regression:**models using multiple predictors**Multivariate linear regression:**models for multiple response variables

We will consider simple linear regression in this article.

Lets say, children height and age are related, like higher the age height will be more. So, age is the explanatory variable (~x) which can define the height as a dependent variable (~y). So based on this you can make a model, where by knowing the age of a child you can predict the height of that child.

For analysis of a model, you need fit a line between the two or more variables. The line is generally fitted with sum of least square values. To define this, lets say,

For hypothetical dataset, there are two variables, income and life Expectancy for a dataset df.

Our hypothesis is that, income rate can effect the life expectancy of a person. Higher the income rate, higher will be the life expectancy. So, income is independent and Life expectancy is dependent variable.

So, lets do the modeling and will also explain the fitting of line while going through the model.

Creating the data frame:

df <- data.frame(Income = c(1000, 1500, 2000, 2500, 3000, 4000, 5000, 10000),

Life_Expectancy = c(55, 60, 65, 67, 69, 72, 76, 80))

So, first lets plot a scatter plot and understand how lines are fitted to your data, (we will be using ggplot2 package for this)

> ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic()

Now lets put a “line of best fit” – line of best fit is the straight line with least square method to see the trend.

But how to put a line, and where to put, horizontal, vertical ? how to find the optimal or best fit line ?

For starting, lets put a horizontal line on average value of y axis ( y = m = 68). To see how best fitted the line is, determine the distance of each data point from the line (which is known as residuals). For the first point it will be (m – y1), for the second data point (m – y2), third point will be (m – y3). Similarly you can find distance for each point from the horizontal line. But, here’s a catch, for the data points (y5, y6, y7, y8) which are above the horizontal line, they have greater values than the line. This will produce negative results, which is not good as it will substract the total value and show you a best fit line which is not true. So, for taking residuals, the values are squared before doing the summation. square ensures each term is positive. This is known as “Sum of Square Residuals”

so, the equation for “Sum of Square Residuals” will be:

Lets say in this case, the sum is around = 30, and now again we have rotated the line and sum comes to = 16; so, which will be the best fit? Its 16. the concept is that we have minimize the sum of squared values to minimize the value of intercept and slope to better predict the response. So, the line is fitted with “Least Squared Values”

Now, lets test how good fit our data and do the linear regression.

So, how to test it ?

We will calculate the coefficient of determination or R^{2}

R^{2} = Explained variation of the model / total variation of the model

or [ss(total) – ss(res)]/ ss(total)

The areas of the blue squares represent the squared residuals with respect to the linear regression. The areas of the red squares represent the squared residuals with respect to the average value.

Now, for doing linear regression in R, you can use

lreg <- lm(Life_Expectancy ~ Income, data = df) #where lm is the linear model and Life_Expectancy (Dependent) and Income (Independent) are two variables.

summary(lreg) #to check the summary of our results

Call: lm(formula = Life_Expectancy ~ Income, data = df) Residuals: Min 1Q Median 3Q Max -6.457 -3.000 1.427 2.685 4.573 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.896e+01 2.479e+00 23.783 3.63e-07 *** Income 2.492e-03 5.484e-04 4.545 0.00391 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.19 on 6 degrees of freedom Multiple R-squared: 0.7749, Adjusted R-squared: 0.7374 F-statistic: 20.66 on 1 and 6 DF, p-value: 0.003913

The **coefficient of determination** (denoted by R^{2}) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. R^{2} value lies in between “0” and “1”.

- An R
^{2}of 0 means that the dependent variable cannot be predicted from the independent variable. - An R
^{2}of 1 means the dependent variable can be predicted without any error from the independent variable.

(or the response can be completely defined from the predictors values) - An R
^{2}between 0 and 1 indicates the extent to which the dependent variable is predictable. An R^{2}of 0.10 means that 10 percent of the variance in*Y*is predictable from*X*; an R^{2}of 0.20 means that 20 percent is predictable; and so on.

“Our R2 value was 0.77, which means 77% response Life Expectancy (dependent variable) can be explained by Income value (Independent variable). So, there is a relationship of income rate and life expectancy, 77% of higher life expectancy are related to high income rate.

If yo plot our data with :

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +

geom_smooth(method = “lm”)

(This plot is with 95% CI value) if you want to remove, you have add one argument se = FALSE, like

ggplot(df, aes(Income, Life_Expectancy)) + geom_point(size = 5, col = “blue”) + theme_classic() +

geom_smooth(method = “lm”, se = FALSE)

Next part, I will discuss more on this along with p value, adjusted r values. How to show values on your plot with R and all.

Reading:

1. https://www.datacamp.com/community/tutorials/linear-regression-R

2. https://stattrek.com/statistics/dictionary.aspx?definition=coefficient_of_determination