Linear Regression in Python using StatsModels & Scikit Learn

Learn what regression analysis is and how to implement it in Python in this definitive step-by-step mini-course.

If you’re interested, you can follow along with the video and the Linear Regression in Python Juypter Notebook.

What Is Regression Analysis?

Regression analysis is a common statistical method used in finance to determine the relationship between variables. The process helps understand the factors that are important and irrelevant and how they affect each other.

Let’s cover the key terms:

  • Dependent variable: This is the target response variable we’re trying to predict or understand.
  • Independent variable(s): These are the independent input factors that we think might influence the dependent variable.

For instance, if we want to predict the price of homes, the home price prediction would be the dependent variable, and the independent variable or independent variables would be the independent variables.

Examples of independent variables or factors influencing the home price might be square feet, the number of rooms, garage, finished basement, etc.

Why Is It Called Regression?

The term “regression” was used by Francis Galton in his 1886 paper “Regression towards mediocrity in hereditary stature.” — in other words, regression towards the mean. So the terms and words might sound complex, but the process is, in fact, quite simple.

Now that we understand what regression analysis is and why it’s called regression. Let’s cover the various types of regression analysis, starting with the most simple: linear regression.

What Is Linear Regression?

The formula definition is that linear regression is a statistical method for modeling linear relationships between a dependent response variable and one or more independent explanatory variables. Simple linear regression predicts the response variable with one explanatory variable. Multiple linear regression predicts using numerous explanatory variables.

Let’s break this down. Linear regression assumes there’s a linear relationship between the predicted variable and the independent variable or variables. In the home example above, we expect larger houses to have greater and greater prices. This is what I mean by a linear relationship.

Using only the size or square feet of the house to predict the price of a home is called simple linear regression. We’re only using one dependent variable to predict our target. It’s a very simple linear model.

But typically, when we want to predict the house price, we would use more than one independent variable. We would create a linear regression model or multiple linear regression models using multiple factors that we believe affect the price of homes, e.g., square footing plus room size, number of bathrooms, acreage, etc.

And this brings me to a critical point.

When Do You Use Linear Regression?

We want to use linear regression when there’s a linear relationship. To clarify, let’s cover another example. The relationship between the S&P 500 index price and the GDP. This is also called the Warren Buffett indicator. In essence, as companies sell more goods, their stock prices should go up. Let’s see if this relationship is genuine.

I’m using data from the [Federal Reserve’s FRED database/get-economic-data-fred-python-api). Notice how as GDP rises, so does the S&P 500’s price. There is a linear relationship.

DateSP500GDP
2012-10-011444.4916420.386
2013-04-011562.1716699.551
2013-07-011614.9616911.068
2013-10-011695.0017133.114
2014-04-011885.5217462.703
2020-04-012470.5019477.444
2020-07-013115.8621138.574
2020-10-013380.8021477.597
2021-04-014019.8722740.959
2021-07-014319.9423173.496

Let’s look at the points on a scatter plot.

Notice how the data points look like a line? Well, let’s fit the best fit line through the points. We’ll cover why it’s the best fit line in a moment.

Notice the straight line, also known as the regression line, fits the data nicely. Speaking of best fit and fitting, how do we fit a regression line through these values.

Determining Best Fit Line (OLS)

Ordinary least squares (OLS) is a linear least squares method used to estimate unknown model parameters.

Let’s break this down.

First, we want to draw a straight line through our data points. Do you remember the linear equation:

y = mx + b

That’s the equation for a line.

  • y is the S&P500 price or the value we want to predict
  • x is the GDP
  • m and b are model parameters that we’re trying to fit

When we say fit, we estimate the values that make the best fit line by minimizing the error.

There’s no magic here. Look at the following graph.

The error is the difference between the GDP vs. SP500 points (actual) to the best fit line (prediction). And the OLS method takes the difference between these points and squares them, then adds them, also known as the squared error.

SquareError = (a-p)^2 + (a_2-p_2)^2 …

  • a is the actual
  • p is the predicted

We find the line that minimizes the squared residuals. Here’s an example of the ols regression results for clarity:

DateResidual
2012-10-0168.719524
2013-04-0183.436612
2013-07-0158.213944
2013-10-0156.357922
2014-04-01125.317399

The issue with squared error is that more points will lead to a higher squared error even if the line fits better. The solution to that is the MSE or mean squared error, which just divides the squared error by the number of observations:

But this doesn’t really tell us how much each observation is off. Remember, we squared the differences. We can take the square root of the above to get a meaningful number, which gives us an idea of how much our prediction strays from the actual. This is the RMSE or root mean squared error.

The reason for the formula is that it’s a great tool that accomplishes two issues:

  • We need to make the differences positive
  • We should punish increasingly poor predictions. Squaring does this also.

So what’s the formula for our best fit line?

Let’s take a look at the statsmodels Python output below. Can you figure it out?

Dep. Variable:yR-squared:0.941
Model:OLSAdj. R-squared:0.938
Method:Least SquaresF-statistic:305.0
Date:Sun, 14 Nov 2021Prob (F-statistic):3.68e-13
Time:10:25:40Log-Likelihood:-139.94
No. Observations:21AIC:283.9
Df Residuals:19BIC:286.0
Df Model:1
Covariance Type:nonrobust
COEFSTD ERRTP>|T|[0.0250.975]
const-4680.4714410.339-11.4060.000-5539.322-3821.621
x10.36880.02117.4650.0000.3250.413

If you said:

y = mx + b s&p500 = m *gdp + b * y_hat = 0.3688gdp + -4680.4714

You would be correct.

Also, as we’ll use it later on, we can rearrange the formula and turn m into a b:

y = mx + b to y = b_0 + b_1 * X_1

y or “Y hat” simply means it’s a predicted or expected variable. The X’s are the distinct independent variables, and the B’s are the regression coefficients. Each regression coefficient represents a change in Y relative to the change in the respective independent variable.

In other words, for every one unit of GDP, we expect to add 0.3688 points to the S&P500. If the GDP is zero, the S&P500 is predicted to be -4680.4714; obviously, that would never happen but remember, we’re modeling here, and there are no perfect models!

And while this post is about linear regression, perhaps the regression problem shouldn’t be modeled linearly?

Polynomial Regression

Look at the below. Do you notice how the regression line doesn’t fit the model?

We can model curved relationships using polynomial regression, which we’ll briefly touch on. If you want to learn more about polynomial regression, let me know in the comments below, and I’ll write about it.

In short, we can use exponential formulas such as the following to model our relationships.

We can model a third-degree polynomial to get a line that fits our equation as seen in the following figure.

With a general understanding of the polynomial regression problem, let’s determine if our simple linear regression model above performs well.

Understanding Regression Performance

To keep it simple, we’ll discuss r-squared. R-squared, r2, or the coefficient of determination, is the proportion of the variation explained by the predictor variables in the observed data. The higher the r-squared, the better.

Revisiting our example, the r-squared is 0.941 in our model. This means that 94% of the observed data can be explained with our formula, and 6% cannot.

An r-squared of 94% is pretty high, so the predictor variables predict the dependent variables from the historical data quite well in our model.

We can see this relationship visually in the graphs above, where the straight-line fits the data.

This does not necessarily mean our model has future predictive power. We’ll cover how to access that in cross-validation.

There are many other metrics in the output summary below, like how the standard error is calculated using a covariance matrix for the estimated coefficients using the mean squared error of the residuals, but we’ll save that for another time.

Dep. Variable:yR-squared:0.941
Model:OLSAdj. R-squared:0.938
Method:Least SquaresF-statistic:305.0
Date:Sun, 14 Nov 2021Prob (F-statistic):3.68e-13
Time:10:25:40Log-Likelihood:-139.94
No. Observations:21AIC:283.9
Df Residuals:19BIC:286.0
Df Model:1
Covariance Type:nonrobust
COEFSTD ERRTP>|T|[0.0250.975]
const-4680.4714410.339-11.4060.000-5539.322-3821.621
x10.36880.02117.4650.0000.3250.413

For now, let’s see how I created this simple linear regression model in Python. We’ll change up the data to make it more interesting.

Simple Linear Regression in Python

Let’s perform a regression analysis on the money supply and the S&P 500 price.

The Federal Reserve controls the money supply in three ways:

  • Reserve ratios – How much of their deposits banks can lend out
  • Discount rate – The rate banks can borrow from the fed
  • Open-market operations – Buying and selling government securities (the big one).
DateSP500CURCIR
2012-10-011444.491134.623
2013-04-011562.171178.421
2013-07-011614.961196.581
2013-10-011695.001213.425
2014-04-011885.521270.128
2020-04-012470.501887.380
2020-07-013115.861978.268
2020-10-013380.802040.201
2021-04-014019.872154.819
2021-07-014319.942186.128

Let’s use statsmodels to implement linear regression.

model = sm.OLS(df['sp500'].values, sm.add_constant(df2[ 'curcir'].values)).fit()
model.summary()
Dep. Variable:yR-squared:0.921
Model:OLSAdj. R-squared:0.917
Method:Least SquaresF-statistic:220.7
Date:Sun, 14 Nov 2021Prob (F-statistic):6.51e-12
Time:10:25:40Log-Likelihood:-143.10
No. Observations:21AIC:290.2
Df Residuals:19BIC:292.3
Df Model:1
Covariance Type:nonrobust
COEFSTD ERRTP>|T|[0.0250.975]
const-1083.6461242.885-4.4620.000-1592.010-575.283
x12.26100.15214.8570.0001.9422.580

Our new linear model using the currency in circulation performs worse than our GDP model when comparing the r-squared value.

Multiple Linear Regression in Python

Multiple linear regression is just like simple linear regression, except it has two or more features instead of just one independent variable.

Let’s check out the data now that we have two variables for input features.

DateSP500GDPCURCIR
2012-10-011444.4916420.3861134.623
2013-04-011562.1716699.5511178.421
2013-07-011614.9616911.0681196.581
2013-10-011695.0017133.1141213.425
2014-04-011885.5217462.7031270.128
2020-04-012470.5019477.4441887.380
2020-07-013115.8621138.5741978.268
2020-10-013380.8021477.5972040.201
2021-04-014019.8722740.9592154.819
2021-07-014319.9423173.4962186.128

The statsmodels python implementation is simple. We just pass a list of regression coefficients instead of a single variable.

model = sm.OLS(df2['sp500'].values,
                sm.add_constant(df2[['gdp', 'curcir']].values)).fit()
model.summary()
Dep. Variable:yR-squared:0.955
Model:OLSAdj. R-squared:0.950
Method:Least SquaresF-statistic:192.8
Date:Sun, 14 Nov 2021Prob (F-statistic):6.97e-13
Time:10:25:40Log-Likelihood:-137.06
No. Observations:21AIC:280.1
Df Residuals:18BIC:283.3
Df Model:2
Covariance Type:nonrobust
COEFSTD ERRTP>|T|[0.0250.975]
const-3407.5306648.810-5.2520.000-4770.629-2044.432
x10.22970.0613.7410.0010.1010.359
x20.90640.3812.3810.0290.1071.706

Linear regression in Python is so easy, isn’t it :)?

We can see that having two variables improved the regression model. Our predicted values should be improved with a higher R-squared value.

Notice there’s one more coefficient in the model coefficients section in the regression model.

y = b_0 + b_1* X_1 + b_2 * X_2

How do we visualize this? With the simple linear regression model, we drew the best-fit regression line through the observed data. Let’s think through this.

Visualizing Multiple Linear Regression

We can perform simple linear regression and graph them separately like the below.

But in truth, having two linear models is nice, but the linear regression line is just the best fit line for each independent simple linear regression model we covered above. We need more than just two columns or two dimensions.

It’s time to put on our 3d glasses.

Let’s create a multiple linear regression model 3d graph where the y-values are the s&p500, and the x and z values are GDP and currency in circulation, respectively. At first, visualizing three dimensions feels strange but it becomes natural over time.

The straight-line moves up and to the right, my favorite direction (trading joke). We can see as both GDP and Currency in Circulation increase, so does the S&P 500 price.

Why don’t we add some random data to see how that affects our model. Let’s add a random one-dimensional array between 1 and 1000 to our multiple linear regression model.

np.random.seed(1337) <em># used to replicate randomness</em>
df['rand'] = np.random.choice([1, 1000, 20], df.shape[0])
df
Datep500gdpcurcirrand
2011-12-011244.58132896.01212.5142991000
2012-02-011324.09133512.01251.6931451
2012-03-011374.09133752.01288.5425661
2012-05-011405.82133934.01310.0793261000
2012-06-011278.04134007.01322.01408720
2021-03-013901.82144057.03719.6516861000
2021-04-014019.87144326.03812.1482881
2021-06-014202.04145902.03887.0580831000
2021-07-014319.94146993.03917.8925921000
2021-09-014524.09147553.03945.0468861

We know this is random and won’t help our regression model. Let’s see how it performs.

Dep. Variable:yR-squared:0.955
Model:OLSAdj. R-squared:0.948
Method:Least SquaresF-statistic:121.4
Date:Sun, 14 Nov 2021Prob (F-statistic):1.11e-11
Time:10:25:40Log-Likelihood:-137.06
No. Observations:21AIC:282.1
Df Residuals:17BIC:286.3
Df Model:3
Covariance Type:nonrobust

The r-squared didn’t improve, which should be obvious — we added random data. But how do we know if a feature is statistically significant? How do we know this new input feature helps our predicted value? Let’s dive deeper.

Well, there’s more to it than this, but a good rule of thumb is that if the p-value is 0.05 or lower, the coefficient and independent variable are said to be statistically significant.

In other words, if the p-value is small and the increase in r2 is large, add the variable to the input features; otherwise, discard.

We can see that our p-value for x3, our random data, is 0.785, so we should remove it from our model — even if it improves our target variable, which it didn’t.

****COEFSTD ERRTP>|T|[0.0250.975]
const-2956.6462531.689-5.5610.000-4017.066-1896.226
x10.01330.0043.2520.0020.0050.021
x22.24670.07430.4040.0002.0992.394
x30.01150.0420.2740.785-0.0720.095

There’s another issue that we need to discuss.

Multicollinearity in Regression

Multi-what? When we perform linear regression, the independent variables should be … well… independent. We should understand that a regression coefficient represents the change in the predicted response for each 1 unit change in the independent variable, holding all other independent variables constant.

There are additional problems and different types of multicollinearity, but in short, you can’t trust the p-values to identify statistically significant variables.

So how do we know if the independent features are independent?

We can detect multicollinearity in our model data with VIF.

Variance Inflation Factor

Variance inflation factor or VIF detects multicollinearity in regression analysis.

A VIF of 1 indicates two variables are not correlated, a VIF greater than 1 and less than 5 indicates a moderate correlation, and a VIF of 5 or above indicates a high correlation.

We can use statsmodels to determine the VIF for each feature.

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
df2 = df[['gdp','curcir', 'rand']]
vif['feature'] = df2.columns
vif['VIF'] = [variance_inflation_factor(df2.values, i) for i in range(len(df2.columns))]
vif
FeatureVIF
gdp30.449119
curcir30.837657
rand1.529648

We see that our GDP and CURCIR are highly correlated in the two columns above. So how do we handle when two or more variables have a high VIF? In this case, we should add more data if possible. Our model has only 21 data points and then reevaluate. If they are still correlated, we should drop the feature that improves the model the least.

While the S&P 500 started in 1957, FRED has historical data beginning in 2011. You could substitute the S&P 500 index with the Wilshire 5000 Total Market Full Cap Index (WILL5000INDFC) with data starting in 1980.

Because this is getting a little long, I’ll let you try that on your own. Spolier alert: it bests GDP historically.

Before we jump into the next session, take a second to think about how many y values or predicted values we have predicted so far?

None…

Cross-Validation

The goal of a regression model in most cases is to predict future values. We’ve used all of the data until now when building/training our linear regression model. We’re overfitting because we’re building a model using observed data and asking how well it will predict that historical data.

If we use our linear regression model with next quarter’s GDP to predict the future S&P 500 price, then we’re finally making a prediction.

We should be breaking up the data into a training and test set, or even better yet, training sets and test sets. We’ll use different slices of history, the training sets, to make predictions about different periods in history, which are our testing sets.

This would help us determine if the currency in circulation or GDP was better for predicting equity prices. As we saw, GDP was the winner in the first example, and currency in circulation bested GDP over a more extended period, but what about in the middle?

It’s plain to see that this type of train/test set is more robust and often comes up with a better regression model leading to a more accurate predicted response. This is a common practice in scientific computing and machine learning. The only concern with machine learning models is that such models are prone to overfitting — we’ll discuss this in a bit.

Linear Regression in Machine Learning (Walk-Through)

Now that you have a better understanding of regression and some of the pitfalls, it’s time to connect the dots (pun intended).

I will use sklearn, another popular data science library, to create the training data, and test data splits. I’ll also use the linear regression model from sklearn, but linear regression works with both packages and can use either. We’re going to need to import a lot more libraries, and this time, instead of using plotly, we’ll use matplotlib in conjunction with seaborn.

We’ll first grab the required python modules.

Get Required Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Next, let’s organize the columns.

columns = ['gdp','curcir', 'rand', 'sp500']
df = df[columns]
df.head()

gdp
curcirrandsp500
2012-10-0116420.3861134.62310001444.49
2013-04-0116699.5511178.42111562.17
2013-07-0116911.0681196.58111614.96
2013-10-0117133.1141213.42510001695.00
2014-04-0117462.7031270.128201885.52

Scale & Normalize Data

Machine learning algorithms work better when features share a similar scale and are normally distributed. Let’s scale and standardize the variables between 0 and 1 using sklearn.preprocessing.MinMaxScaler.

scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[columns])
df2 = pd.DataFrame(scaled)
df2.columns = columns #scaler returns nd.array
df2.head()
gdpcurcirrandsp500
00.0000000.0000001.0000000.000000
10.0413390.0416530.0000000.040926
20.0726600.0589230.0000000.059285
30.1055410.0749421.0000000.087120
40.1543460.1288680.0190190.153378

Notice that all of the original features and targets are now scaled between one and zero. Also, remember that overfitting thing? We just did it…

Overfitting means that our model fits too closely to a particular set of data and may fail to predict observed values reliably.

In the above case, we scaled and fit the data to the entire data set. We can’t train on our test data because we’ll be making predictions on data that we used to create our regression model.

We should only ever use MinMaxScaler.fit_transform with training data and use MinMaxScaler.transform with test data. The reason is that we can’t scale and normalize our data based on test data. We should only scale and fit on training data.

Split Data Into Train & Test Sets

There are other ways to overfit, too. We’ll discuss a few more ways shortly. For now, let’s separate our data into training and testing sets. We’ll train on 70% of the data and test on the remaining 30%. We’ll also scale our data properly instead of overfitting like we did above.

Always remember to only call transform and not fit_transform on the test data. You should never fit to testing data!

# Create training data.
train_size = 0.7
df_train, df_test = train_test_split(df,
                                    train_size=train_size,
                                    test_size=round(1-train_size,2),
                                    shuffle=False
                                    )
# Scale the test and train data.
scaler = MinMaxScaler()
df_train[columns] = scaler.fit_transform(df_train[columns])
df_test[columns] = scaler.transform(df_test[columns]) # fit_transform

# Separate into training and testing sets
y_train = df_train.pop('sp500') # one column
X_train = df_train # three columns

y_test = df_test.pop('sp500')
X_test = df_test
print(X_test.head())
print(y_test.head())

gdp
curcirrand
2019-07-011.0443111.0346221.000000
2019-10-011.0832201.0727901.000000
2020-04-010.6278771.2720090.019019
2020-07-010.9690501.4255910.019019
2020-10-011.0386801.5302460.019019
y-hat
2019-07-011.026850
2019-10-011.010580
2020-04-010.693203
2020-07-011.129228
2020-10-011.308229

Now let’s fix our multicollinearity issue identified by VIF.

Recursive Feature Elimination

Instead of manually removing our features, imagine if we had numerous and weren’t sure which ones we should eliminate? Machine learning to the rescue.

Recursive feature elimination does just that. It’s simple to do. We furnish a hyperparameter of the number of parameters we want, and it does the hard work for us. — A hyperparameter is a parameter for parameters.

Let’s see it in action.

from sklearn.feature_selection import RFE
lm = LinearRegression()
rfe = RFE(lm, n_features_to_select=1)
rfe = rfe.fit(X_train, y_train)
print(X_train.columns)
print(rfe.support_)
print(rfe.ranking_)
Index(['gdp', 'curcir', 'rand'], dtype='object')
[ True False False]
[1 2 3]

Notice that our n_features_to_select hyperparameter was set to one, causing RFE to select only GDP. We can also see the rankings are 1, 2, and 3 for GDP, currency in circulation, and our random variable, respectively.

Create Linear Regression Model

Let’s now understand a little more about what we did above, and create another linear regression model below.

We’ll create a LinearRegression object and fit the training data to it. I’ll then use that trained LinearRegression object to predict the y_values. I’ll then compare the y_pred to the actual values (y_test) and print out our r2. sklearn requires the data to be in a 1d array. We didn’t need to do this above because the RFE took care of it for us.

lm = LinearRegression()
# Only use GDP as determined by RME & VIF
# sklearn requires 1d array
lm.fit(X_train['gdp'].values.reshape(-1,1), y_train)

# Use test data for prediction
y_pred = lm.predict(X_test['gdp'].values.reshape(-1,1))

r2 = r2_score(y_test, y_pred)
print(r2)
0.4838240551775319

RFE selects the best features recursively and applies the LinearRegression model to it. With this in mind, we should — and will — get the same answer for both linear regression models.

y_pred = rfe.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(r2)
0.4838240551775319

I wanted to show you both ways of creating a LinearRegression model. Keep in mind that RFE can be used with all sorts of estimators such as a decision tree or random forest.

Cross-Validation Using K-Folds in Python

Instead of splitting the data into one train set and one test set, we can slice the data into multiple periods and create multiple training and test sets. Let’s use four k-folds as an example. We’ll create a KFold object with four splits. The splits will segregate utilizing the test data indices.

The first set will take the first 16 elements; the second will be the following 16 elements, the next 15 elements, and finally, our most recent 15 elements. Our array length is 62 and not evenly divisible by 4.

kf = KFold(n_splits = 4)
for train_index, test_index in kf.split(X_train):
    print("Train: ", train_index,"\nTest:  ", test_index, "\n\n")
Train:  [ 4  5  6  7  8  9 10 11 12 13]
Test:   [0 1 2 3]

Train:  [ 0  1  2  3  8  9 10 11 12 13]
Test:   [4 5 6 7]

Train:  [ 0  1  2  3  4  5  6  7 11 12 13]
Test:   [ 8  9 10]

Train:  [ 0  1  2  3  4  5  6  7  8  9 10]
Test:   [11 12 13]

Notice how we now have four groups of test and train data. We can quickly estimate our r2 for each test group.

Analyze The Results

scores = cross_val_score(lm, X_train, y_train, scoring='r2', cv=kf)
scores
array([0.61332189, 0.05476058, 0.48435364, 0.91848863])

An Overfitting Conclusion

We see that the original linear regression model, which we thought was terrific, turns out to not be that great at predicting future S&P prices when we predict using unseen data. There is some predictive power, but it isn’t enough for me to put my money behind it.

The good news is that you now have everything you need to perform simple and multiple linear regression in Python to create even better predictive models — for the markets or whatever you choose.

I hope you enjoyed, and if you have any questions, please let me know in the comments below.

Leave a Comment