/ DATA SCIENCE

Autoregressive Model Python

Autoregression, or an autoregressive model, is a type of predictive modeling that uses linear regression on past values to predict the next value in a time series.

You’ll learn what autoregression is and how to forecast the price of Bitcoin using an AR model in this post. All the code is on the Analyzing Alpha GitHub Repository

Prerequisites

Table of Contents

What is Autoregression?

Autoregressive models forecast a series based solely on the past values of the series. An AR model that uses the past price lag is an autoregressive model of order one, or AR(1) for short. An AR model that uses the past two lags is an AR(2) model, and an AR(p) model is an AR process of p lags.

We can see that clearly with the following formula, which we’ll say Bitcoin’s returns where:

$% r_t = \beta_0 + \phi r_{t-1} + \varepsilon_t $%

  • $%r_t $% is the return we’re predicting.
  • $%\phi $%, the Greek letter Phi, is the slope coefficient.
  • $%r_{t-1} is the previous period’s return.
  • $%\varepsilon $% is the greek letter Epsilon is the error.

We can make a more generalized autoregression model of order p where p is the number of lagged coefficients.

$% r_t = \beta_0 + \phi r_{t-1} + \phi r_{t-2} + \varepsilon_{t-2} \dots + \varepsilon_{t-1} $%

An AR(1) model is:

  • white noise when $% \phi = 0 $%
  • is a random walk when $% \phi = 1 $%
  • is a random walk with drift when $% \phi = 1 and \beta \not 0 $%
  • mean reverting when $% phi < 1 $%

Since stationarity is an assumption with many of the time series forecasting models, including autoregression, $%phi $% will always be between -1 and 1.

Speaking of stationary time series, let’s dig into some code.

First, let’s get the minute price data for Bitcoin and resample it into 4-hour bars. Then, as we’ve learned in previous lessons, we’ll transform it using both log to reduce variance and differencing to remove trend. We’ll then validate that the return series is stationary using an Augmented Dickey-Fuller test.

import urllib
import pandas as pd
import zipfile

repo = 'https://raw.githubusercontent.com/leosmigel/analyzingalpha/master/'
file = 'time-series-analysis-with-python/btc_price_history.zip'
url = repo + file

with urllib.request.urlopen(url) as f:
  btc_price_history = pd.read_csv(f, compression='gzip',
                                  parse_dates=['btc_price_history.csv'])
  btc_price_history = btc_price_history.rename(
      columns={'btc_price_history.csv':'date'}
  )
  btc_price_history.set_index('date', inplace=True)
btc_price_history.dropna(inplace=True)

btc_log = np.log(btc_price_history.resample('4H', closed='left') \
    .agg({'close':'last'})).diff(1).dropna()

print(btc_log)
make_plot(btc_log)
                        close
date                         
2020-04-01 04:00:00 -0.007523
2020-04-01 08:00:00 -0.005499
2020-04-01 12:00:00 -0.008979
2020-04-01 16:00:00 -0.004729
2020-04-01 20:00:00  0.069989

Bitcoin 4-Hour Log Returns

Bitcoin’s returns look stationary. Let’s verify this assumption by performing an Augmented Dickey-Fuller (ADF) test.

from statsmodels.tsa.stattools import adfuller

t_stat, p_value, _, _, critical_values, _  = adfuller(btc_log.values, autolag='AIC')
print(f'ADF Statistic: {t_stat:.2f}')
print(f'p-value: {p_value:.2f}')
for key, value in critical_values.items():
     print('Critial Values:')
     print(f'   {key}, {value:.2f}')
ADF Statistic: -9.84
p-value: 0.00
Critial Values:
   1%, -3.43
Critial Values:
   5%, -2.86
Critial Values:
   10%, -2.57

The ADF test statistic gives us -9.84 with a critical value for 1% at -3.43. This means Bitcoin’s returns are stationary, and we can say that with significant confidence.

With the stationary assumption met, let’s move on to autocorrelation.

What is Autocorrelation?

Autocorrelation, also known as serial correlation, refers to the linear relationship between lagged values of a time series. More specifically, the error term for a period is correlated with the error from another period.

Think about it this way – if the price of coffee has increased over the past year, it’ll likely increase next month, too.

There are two ways past prices affect the future price of coffee:

  • Directly
  • Indirectly

Let’s pretend it’s December.

A poor Brazilian coffee harvest in October affects January’s price. Additionally, increased demand in the holiday shopping season this month directly affects the price of coffee in January. These month’s price changes directly affect January’s future price and are aptly named direct effects.

Alternatively, the price in October affects November’s prices, which affects the prices in December that affect January’s price – all prices have an indirect “chaining” effect on next month’s price – named indirect effects.

We care about autocorrelation and the above effects when discussing autoregressive models because we want to create the best possible model by picking the optimal lag value p. In other words, what lagged values are most correlated with our predicted value, and are they statistically significant?

Let’s analyze the autocorrelation of Bitcoin’s returns with lags up to 10.

corrmat = btc_log.copy()
for i in np.arange(1,11):
    name = 'p' + str(i)
    corrmat[name] = cormat['close'].shift(i)
corrmat.dropna(inplace=True)
corrmat.corr()

It looks like the correlation is pretty cold, and the heatmap isn’t very helpful. A better way is to use autocorrelation and partial autocorrelation functions and associated plots.

ACF and PACF Plot

The blue shaded area on the plots is the Barlett confidence intervals at 95% confidence.

For our purposes, we want to understand if a past lag affects the future price. With this in mind, we want to analyze the direct effect and not the indirect effect. This means that we’ll want to use the PACF for our lags and not the ACF.

In other words, according to the partial autocorrelation plot below, if the most recent past price is positive or negative, we can expect prices to mean revert on the four-hour timeframe more often than not. The second lag is within the confidence intervals, so it’s not significant. And the third lag is above the confidence bands so we can expect there to be a slightly positive correlation.

Now keep in mind – the correlation value according to the partial autocorrelation function is low for past returns is low. This does make sense. In trading, there is almost zero obvious alpha.

Optimal Lag Value

That brings us to selecting the optimal time lag for our autoregressive model. We see that lag 1, lag 3, lag 6, lag 12, lag 15, lag 23, lag 24, and lag 26 are outside the blue-shaded region and are statistically significant. So what lag should we use?

It’s a balance between underfitting and overfitting. Let’s use the time lag variables 3 and 24 for our autoregressive models.

Forecasting with Autoregressive Models

We now have stationary data and an optimal lag parameter selected, now it’s time to use this modeling technique to forecast Bitcoin’s future returns.

We’ll split our time series data into training data and test data, and use 80% of the past data for training and save 20% for the predicted values.

We’ll first import ARIMA from the statsmodels library; then, we’ll set a rolling window and create two variables – one for each AR model. The first variable in the order tuples selects the p values.

from time import time
from statsmodels.tsa.arima.model import ARIMA
s = rolling_window = int(len(btc_log) * 0.8)
order3 = (3,0,0)
order24 = (24,0,0)

Now we’ll create a model, then train it, and then print out the summary statistics. We’ll also show how the AR(24) model is much more computationally complex than the AR(3) model.

# Create and fit the model, and print out summary stats and duration
start = time()
mod = ARIMA(endog=s, order=order3)
res = mod.fit()
end = time()
print(res.summary())
print(f"It took {round(end - start,2)} seconds to fit the AR3 model")
          SARIMAX Results                                
==============================================================================
Dep. Variable:                  close   No. Observations:                 1751
Model:                 ARIMA(3, 0, 0)   Log Likelihood                5087.324
Date:                Mon, 17 May 2021   AIC                         -10164.649
Time:                        12:05:41   BIC                         -10137.309
Sample:                    04-01-2020   HQIC                        -10154.542
                         - 01-17-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0010      0.000      2.912      0.004       0.000       0.002
ar.L1         -0.0272      0.020     -1.391      0.164      -0.066       0.011
ar.L2         -0.0051      0.014     -0.377      0.706      -0.032       0.022
ar.L3          0.0801      0.018      4.534      0.000       0.045       0.115
sigma2         0.0002   2.75e-06     63.615      0.000       0.000       0.000
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):              4535.14
Prob(Q):                              0.94   Prob(JB):                         0.00
Heteroskedasticity (H):               1.64   Skew:                            -0.21
Prob(H) (two-sided):                  0.00   Kurtosis:                        10.87
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
It took 0.28 seconds to fit the AR3 model.
start = time()
# Create and fit the model
mod = ARIMA(endog=s, order=order24)
res = mod.fit()
end = time()
print(res.summary())
print(f"It took {round(end - start,2)} seconds to fit the AR24 model")
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                  close   No. Observations:                 1751
Model:                ARIMA(24, 0, 0)   Log Likelihood                5102.688
Date:                Mon, 17 May 2021   AIC                         -10153.375
Time:                        13:05:58   BIC                         -10011.209
Sample:                    04-01-2020   HQIC                        -10100.823
                         - 01-17-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0010      0.000      3.202      0.001       0.000       0.002
ar.L1         -0.0304      0.021     -1.470      0.141      -0.071       0.010
ar.L2          0.0024      0.017      0.142      0.887      -0.031       0.036
ar.L3          0.0942      0.020      4.764      0.000       0.055       0.133
ar.L4          0.0291      0.017      1.739      0.082      -0.004       0.062
ar.L5         -0.0150      0.019     -0.800      0.423      -0.052       0.022
ar.L6         -0.0787      0.022     -3.641      0.000      -0.121      -0.036
ar.L7         -0.0284      0.019     -1.457      0.145      -0.067       0.010
ar.L8          0.0182      0.020      0.927      0.354      -0.020       0.057
ar.L9          0.0430      0.020      2.159      0.031       0.004       0.082
ar.L10         0.0050      0.020      0.245      0.807      -0.035       0.045
ar.L11         0.0193      0.018      1.046      0.296      -0.017       0.055
ar.L12        -0.0415      0.021     -1.952      0.051      -0.083       0.000
ar.L13        -0.0410      0.023     -1.783      0.075      -0.086       0.004
ar.L14        -0.0288      0.020     -1.430      0.153      -0.068       0.011
ar.L15        -0.0010      0.022     -0.045      0.964      -0.045       0.043
ar.L16         0.0194      0.020      0.978      0.328      -0.019       0.058
ar.L17         0.0021      0.024      0.090      0.929      -0.044       0.049
ar.L18        -0.0114      0.023     -0.505      0.613      -0.056       0.033
ar.L19        -0.0372      0.020     -1.878      0.060      -0.076       0.002
ar.L20        -0.0425      0.023     -1.872      0.061      -0.087       0.002
ar.L21        -0.0065      0.021     -0.310      0.757      -0.048       0.035
ar.L22         0.0208      0.020      1.022      0.307      -0.019       0.061
ar.L23         0.0090      0.021      0.422      0.673      -0.033       0.051
ar.L24        -0.0423      0.019     -2.196      0.028      -0.080      -0.005
sigma2         0.0002   2.97e-06     57.984      0.000       0.000       0.000
===================================================================================
Ljung-Box (L1) (Q):                   0.05   Jarque-Bera (JB):              4832.85
Prob(Q):                              0.82   Prob(JB):                         0.00
Heteroskedasticity (H):               1.63   Skew:                            -0.20
Prob(H) (two-sided):                  0.00   Kurtosis:                        11.13
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
It took 24.63 seconds to fit the AR24 model.

In the SARIMAX Results tables, we see a lot of statistical data. I’ve created another blog post to cover how to interpret ARIMA results.

We’ll use three as our p-value in our autoregressive model as the BIC is lower than the AR(24).

Let’s create a function to predict the return and apply it to our data set. We’ll also need to shift the output by one as our prediction is for the next period.

def pred_price(train, oder):
    mod = ARIMA(endog=train, order=order)
    res = mod.fit()
    pred = res.forecast()
    return(pred[0])
btc_log['pred'] = btc_log['close'].rolling(rolling_window) \
                    .apply(lambda x: pred_price(x, order3))
btc_log['pred'] = btc_log['pred'].shift(1)

We can see the prediction isn’t that great in the following plot. Regardless, let’s take it to the next step and implement a basic trading strategy.

Autoregressive Bitcoin Trading Strategy

Now it’s time to trade the autoregressive model. We’ll buy when the predicted price is above the prior prediction.

df['signal'] = np.where((df['pred'] > df['pred'].shift(1)) & \
                        (df['close'] > df['pred'].shift(-1)), 1, -1)
df['returns'] = df['signal'] * df['close']

fig = go.Figure(data=go.Scatter(
                    y=np.exp(df['close'].cumsum()),
                    x=df.index,
                    name="Actual")
               )
fig.add_trace(
        go.Scatter(
            y=np.exp(df['returns'].cumsum()),
            x=df.index,
            name="Strategy")
    )                
fig.update_layout(title=dict(text="Bitcoin AR Strategy Returns",font=dict(size=24)),
                 legend=dict(font=dict(size=20)),
                  width=2000,
                  height=1250)

fig.update_xaxes(tickfont=dict(size=20, color="#434"))
fig.update_yaxes(tickfont=dict(size=20, color="#434"))

fig.show()

BTC Autoregressive Log Returns

While these results don’t beat our benchmark, let’s try to improve them by trying out other autoregressive models.

The Bottom Line

In this tutorial, you predicted the future returns of Bitcoin using an ARIMA model of order 3 with moderate success.

You learned that autoregression models try to predict a dependent variable using linear regression against past lagged variables or independent variables. AR models assume stationarity as the historical distribution needs to resemble the future distribution.

leo

Leo Smigel

Based in Pittsburgh, Analyzing Alpha is a blog by Leo Smigel exploring what works in the markets.