Linear regression is a well-known supervised machine learning algorithm, and the first regression analysis practiced rigorously. Linear regression is an approach to model the linear relationship between the dependent variable and independent variables.

In this post, you will learn the basics of linear regression, its assumptions, and implementation using Excel, R, and Python.

**What is a Linear Regression?**

Linear regression is supervised machine learning techniques use to predicts the continuous numerical target variables.

Linear regression is useful for finding the linear relationship between the input (independent variables) and target (dependent variable). The purpose of the Linear regression is to find the best fit line, also referred to as regression line, that can accurately predict the output for the continuous dependent variable

Linear regression can be applied for numerical data if there exists the categorical data like city, season, etc., then convert the categorical features to dummy variables.

**Linear Regression vs Logistic Regression**

Beginners often confused between linear regression and logistic regression. This may be due to the “regression” word associated with these algorithms. But they are quite different in their application.

The Linear Regression is used for solving Regression problems whereas Logistic Regression is used for solving the Classification problems. Logistic regression is used to predict the categorical dependent variable based on the independent variables. The output of the Logistic Regression can vary between 0 and 1.

**Simple Linear Regression Vs Multiple Linear Regression**

Simple linear regression is the simplest form linear regression. When there is a single independent variable, the regression model is referred to as a simple linear regression. For example, the relationship between height and weight.

When there are multiple input variables, the regression model is called multiple linear regression. For example, predicting house prices based on the multiple inputs like locality, area, amenities, etc.

**Simple Linear Regression Model Representation**

In simple linear regression, only one input variable and one output variable.

For example, the independent variable(x) and target variable(y), then the simple linear regression model could be represented as:

**y = B0 + B1 * x**

where B0 represents the intercept and B1 is a coefficient describing the linear relationship between x and y. Both x and y should be numeric variables.

**Multiple Linear Regression Model Representation**

Real-world problems are not so simple that you can predict using only a single independent variable. There are various factors to get the prediction close to the actual value.

Multiple independent variables are linearly combined to establish a relationship with the dependent variable.

For example, independent variables (x1, x2, x3, …, xn) and target variable(y), then the multiple linear regression equation could be represented as:

**y = B0 + B1 * x1 + B2 * x2 + B3 * x3 + … + Bn * xn**

where B0 represent the intercept and B1, B2…Bn are the coefficients describing the linear relationship with dependent variable.

**Linear Regression Assumptions**

There are a few assumptions made when we linear regression to model the relationship between the independent variables and the dependent variable.

- Linearity: The relationship between input and the mean of the target variable is linear.
- Homoscedasticity: The variance of residual is the same for any value of X. Scatter plot of residual values vs predicted values should exhibit random distribution. If there are specific patterns that appear when data is heteroscedastic.
- Zero/Little Multi-collinearity: Observations are independent of each other. There should be no relationship between independent variables.
- Normality: For any fixed value of X, Y is normally distributed.
- No Autocorrelation: also known as serial correlation, and mostly found in time-series data. Auto-correlation is a correlation between the two values of the same variables at different timestamps

**Boston Housing dataset**

For this study, using the Boston housing dataset. You can download this dataset from the UCI Machine Learning Repository. The dataset is nice and clean with no missing values.

Boston Housing dataset is a small dataset with 506 observations and contains information about houses in Boston. A regression model is trained to predict the selling price of the house based on input features.

**Implementation of Linear Regression in Excel**

There are various tools like Minitab, Excel, R, SAS, and Python that you can leverage to implement linear regression.

Excel is a widely-available software for Microsoft that supports various data analysis functions. You need to enable “Analysis ToolPak Add-in” to perform regression analysis.

There are 14 variables in this dataset. Our goal is to predict the median value of homes using the independent variables.

Stay here for a moment to understand the logical relationship with the median value or MEDV.

- CRIM – per capita crime rate by town
- INDUS – the proportion of non-retail business acres per town
- CHAS – Charles River dummy variable (1 if the tract bounds river otherwise 0)
- NOX – nitric oxides concentration (parts per 10 million)
- RM – the average number of rooms per dwelling
- AGE – the proportion of owner-occupied units built prior to 1940
- DIS – weighted distances to five Boston employment centres
- RAD – index of accessibility to radial highways
- TAX – full-value property-tax rate per $10,000
- PTRATIO – pupil-teacher ratio by town
- B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT – % lower status of the population
- MEDV – Median value of owner-occupied homes in $1000’s
- ZN – the proportion of residential land zoned for lots over 25,000 sq.ft.

Do you see any relationship? With a soft glance, we can say the CRIM, INDUS, NOX, and AGE are negatively correlated with MEDV whereas RM and DIS have a positive correlation with the target variable.

Go to the “Data Analysis” option under “Data”, and choose regression. Refer screenshot.

Select the input variables and the target variable. Column A to M are input variables, and column N is the target variable. Do not forget to check the “Labels” option on the regression panel.

Press “OK” and you have done the regression analysis. Now will visit each section in the regression analysis to deeper our understanding.

“Regression Statistics”, tells how well the model captures the relationship between independent variables and the target variable.

**Multiple R** – also known as the correlation coefficient. It tells the strength of the linear relationship. It varies between +1 to -1, and equal to the square root of R square.

**R-Square** – tells how close the data are to the fitted regression line. It is also known as the **“coefficient of determination”. **R-square is always between 0 and 1.

R-square = Explained variation / Total variation

Explained variation is the sum of the squared of the differences between each predicted y-value and the mean of y. Total variation is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y. From ANOVA table, Explained variation = 31243.14662 and Total variation = 42716.29542

**Adjusted R-square** – when you penalize R-square for every new variable added to the model. It only increases if the new predictor enhances the model.

Adjusted R-square = 1 – ( (n – (k + 1)) / (n – (k + 2) ) * (1 – R-square)

n is number of observations (n = 506), and k is number of independent variables used in the model (k = 13).

**Standard Error** – is the standard deviation of the observed y-values about the predicted 𝑦 -value for a given x-value

Standard Error = SQRT(Unexplained variation / (n-(k+2))

From ANOVA, Unexplained variation = 11473.14919 and n-(k+2) = 492

**Observations** – Number of observations.

ANOVA or Analysis of Variance another key analysis provided by excel. We have already discussed how you can derive R-square, adjusted R-square, and standard error from the ANOVA report.

**Regression** **df** – regression degree of freedom. Total 14 predictors including intercept, so the degree of freedom is 14 – 1 = 13

**Regression SS** – also known as explained variation, SS stands for Sum of Squares

**Residual df** – total degree of freedom minus regression degree of freedom (505 – 13 = 492)

**Residual SS** – also known as unexplained variation

**Total df** – total variance has n – 1 degree of freedom. In this case, there were 506 observations so the total degree of freedom is 505.

**Regression MS** -regression mean square (Regression SS / Regression degrees of freedom)

**Residual MS** – residual mean squared error (Residual SS / Residual degrees of freedom)

**F **– Overall F test for the null hypothesis (Regression MS / Residual MS)

**Significance F** – significance associated with P-Value

Now, moving to the last part of the output of the report generated by excel.

**Coefficients** – are the values for the regression equation for predicting the dependent variable from the independent variable. These coefficients provide the values of B0, B1, B2, etc. in our linear equation.

Now, compare the sign of these coefficients with our initial observation about the independent variables. For example, RM (average number of rooms) has a positive sign which means house price value increases with RM whereas CRIM (per capita crime) has a negative effect on housing prices.

**Standard Error** – is an estimate of the standard deviation of the coefficient.

**t Stat** – is equal to Coefficients divide by its Standard Error

**P-value** – Coefficients having p-values less than alpha (alpha = 0.05) are statistically significant. It is ok to eliminate these variables when building a linear regression model.

Using coefficients, we can easily build a linear regression equation.

`MEDV = 34.02 - 0.11 * CRIM + 0.04 * ZN - 0.04 * INDUS + 3.05 * CHAS -16.75 * NOX + 4.11 * RM - 0.01 * AGE - 1.49 * DIS + 0.27 * RAD - 0.011 * TAX -0.93 * PTRATIO + 0.01 * B - 0.46 * LSTAT`

**Implementation of Linear Regression in R**

Now, we will implement a linear regression using the R language.

# Load Libraries library(car) # Read Boston Housing housing <- read.csv("..data/housing.csv") str(housing)

Split the data into train and test, you can experiment with different split ratios. Now, train the model with 80% of the samples and test with the remaining 20%. We will use this 20% to test the model’s performance.

You can just “lm” function to train a model on the data. We didn’t perform any variable selection technique here.

set.seed(12345)

# Split the data into train and test

s <- sample(1:nrow(housing), 0.8 * nrow(housing))

train <-housing[s,]

test <- housing[-s,]

# Try linear model using all features

fit <- lm(MEDV~., data = train) summary(fit)

Use the “summary” function to print the model summary. The R-square of this model comes as 0.7381 which is close enough to the previous linear regression model built using excel.

As of now, we didn’t validate the assumptions of the linear regression model. R language has a “plot” function to generate diagnostic plots.

# Plot

plot(fit, which=1)

plot(fit, which=2)

plot(fit, which=3)

plot(fit, which=4)

Residuals vs Fitted plot shows a slight curvature, it is likely that there is a curvature in the relationship between the response and the predictor. Our base linear regression failed to explain this relationship.

Normal Q-Q plot suggests a few observation violates normal distribution. Ideally, they should fall in a straight line.

Cook’s distance plot suggests influential outliers in a set of independent variables. These points negatively impact model performance. You can remove one point at a time, and then rebuild the model. We will be leaving this analysis for your experimentation.

The diagnostic plot suggests that linear regression suffers from heteroskedasticity, non-normal residual distribution, and the presence of outlier.

Okay, so what we do know? We will analyze it further and try to improve our model.

sort(vif(fit), decreasing = TRUE) # Remove TAX fit <- lm(MEDV~.-TAX, data = train) sort(vif(fit), decreasing = TRUE) # Remove NOX fit <- lm(MEDV~.-TAX - NOX, data = train) sort(vif(fit), decreasing = TRUE) summary(fit)

“VIF” function from the “car” package provides the estimates for multicollinearity. If the value of VIF < 4 then less multicollinearity otherwise high multicollinearity among the predictors or group of predictors.

fit <- lm(MEDV~.-TAX - NOX - CHAS - RAD, data = train) summary(fit)

After all this processing, we have a minor impact on adjusted R-square and R-square. That gives confidence that we have eliminated only insignificant variables. That’s great.

The diagnostic plot suggested the presence of non-linearity, outlier, and non-normal distribution. So to correct that, we will analyze the target variable distribution.

hist(housing$MEDV, breaks = 50) hist(log(housing$MEDV), breaks = 50)

Target variable MEDV histogram suggests right-skewed distribution, and a log transformation would be appropriate to transform MEDV. You can perform a similar distribution analysis for all the independent variables.

fit <- lm(log(MEDV)~. -TAX - NOX - RAD, data = train) summary(fit)

R-square and adjusted R-square improved to 0.758 and 0.7521 respectively.

layout(matrix(c(1, 2, 3, 4), 2, 2)) plot(fit)

There is minor improvement in the behavior of these plots. But we still see issues with our linear regression model. You can carry-on your research to further improve this model, let me know your comments.

# Root-mean square error rmse=sqrt(mean((exp(predict(fit,test))-test$MEDV)**2)) print(rmse)

Root-mean square error is used to measure the differences between values predicted by a model and the values observed.

**Implementation of Linear Regression in Python**

Here is a snippet to implement the linear regression model in Python. We have used “sklearn” library to build the model.

# Import Libraries import pandas as pd import numpy as np from sklearn import metrics # Import library for Linear Regression from sklearn.linear_model import LinearRegression # Read Boston Housing Price Data housing = pd.read_csv("..data/housing.csv") # Splitting to training and testing data from sklearn.model_selection import train_test_split # Spliting target variable and independent variables X = housing.iloc[:,0:13] y = housing.iloc[:,13:] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4) # Create a Linear regressor lm = LinearRegression() #Train the model using the training sets lm.fit(X_train, y_train) # Check coefficients coeffcients = pd.DataFrame([X_train.columns,lm.coef_[0]]).T coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'}) print('_____________________________') print(coeffcients) print('_____________________________') # Model prediction on train data y_pred = lm.predict(X_train) print("") print('R^2:',metrics.r2_score(y_train, y_pred)) print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1)) print('MAE:',metrics.mean_absolute_error(y_train, y_pred)) print('MSE:',metrics.mean_squared_error(y_train, y_pred)) print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

**Conclusion**

That’s it for now. We have covered a lot in this post and tried to answer various questions related to linear regression.

Feel free to post your queries, will try our best to answer.

Check out another in-depth tutorial on this site: