Time Series Analysis with Python Made Easy

A time series is a sequence of moments-in-time observations. The sequence of data is either uniformly spaced at a specific frequency such as hourly, or sporadically spaced in the case of a phone call log.

Having an expert understanding of time series data and how to manipulate it is required for investing and trading research. This tutorial will focus on analyzing stock data using time series analysis with Python and Pandas. All code and associated data can be found in the Analyzing Alpha Github. You can also open this file directly on Google Colab.

Table of Contents

Understanding Datetimes and Timedeltas

It’s critical to understand the difference between a moment, duration, and period in time before we can fully understand time series analysis in Python.

Type Description Examples
Date (Moment) Day of the year 2019-09-30, September 30th, 2019
Time (Moment) Single point in time 6 hours, 6.5 minutes, 6.09 seconds, 6 milliseconds
Datetime (Moment) Combination of date and time 2019-09-30 06:00:00, September 30th, 2019 at 6:00
Duration Difference between two moments 2 days, 4 hours, 10 seconds
Period Grouping of time 2019Q3, January

Python’s Datetime Module

datetime supplies classes to enable date and time manipulation in both simple and complex ways.

Creating Moments

Dates, datetimes, and times are each a separate class, and we can create them in a variety of ways, including directly and through parsing strings.

import datetime
date = datetime.date(2019,9,30)
datetime1 = datetime.datetime(2019,9,30,6,30,9,123456)
datetime2_string = "10 03 2019 13:37:00"
datetime2 = datetime.datetime.strptime(datetime2_string,
                                       '%m %d %Y %X')
datetime3_string = "Thursday, October 03, 19 1:37 PM"
datetime3 = datetime.datetime.strptime(datetime3_string,
                                       '%A, %B %d, %y %I:%M %p')
time = datetime.time(6,30,9,123456)
now = datetime.datetime.today()
today = datetime.date.today()
<class 'datetime.date'>
<class 'datetime.datetime'>
2019-09-30 06:30:09.123456
2019-10-03 13:37:00
2019-10-03 13:37:00
<class 'datetime.time'>
2019-10-06 22:54:04.786039

Creating Durations

timedeltas represent durations in time. They can be added or subtracted from moments in time.

from datetime import timedelta
daysdelta = timedelta(days=5)
alldelta = timedelta(days=1, seconds=2, 
future = now + daysdelta
past = now - alldelta
<class 'datetime.datetime'>
2019-10-12 12:43:26.337336
<class 'datetime.datetime'>
2019-08-18 06:38:24.333333

Accessing Datetime Attributes

Class and object attributes can help us isolate the information we want to see. I’ve listed the most common but you can find the exhaustive list on the datetime module’s documentation.

Class / Object Attribute Description
Shared Class Attributes class.min Earliest representable date, datetime, time
  class.max Latest representable date, datetime, time
  class.resolution The smallest difference between two dates, datetimes, or times
Date / Datetime object.year Returns year
  object.month Returns month of year (1 - 12)
  object.day Returns day of month (1-32)
Time / Datetime object.hour Returns hour (0-23)
  object.minute Returns minute (0-59)
  object.second Returns second (0-59)
0001-01-01 00:00:00
9999-12-31 23:59:59.999999

Time Series in Pandas: Moments in Time

Pandas was developed at hedge fund AQR by Wes McKinney to enable quick analysis of financial data. Pandas is an extension of NumPy that supports vectorized operations enabling quick manipulation and analysis of time series data.

Timestamps: Moments in Time

Timestamp extends NumPy’s datetime64 and is used to represent datetime data in Pandas. Pandas does not require Python’s standard library datetime. Let’s create a Timestamp now using to_datetime and pass in the above example data.

import pandas as pd
print(pd.to_datetime('September 30th, 2019'))
print(pd.to_datetime('2019-09-30 06:30:06'))
print(pd.to_datetime('September 30th, 2019 06:09.0006'))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
2019-09-30 00:00:00
2019-09-30 00:00:00
2019-10-06 06:00:00
2019-09-30 06:30:06
2019-09-30 06:09:00

Timedeltas in Pandas: Durations of Time

Timedelta is used to represent durations of time internally in Pandas.

timestamp1 = pd.to_datetime('September 30th, 2019 06:09.0006')
timestamp2 = pd.to_datetime('October 2nd, 2019 06:09.0006')
delta = timestamp2 - timestamp1
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
2 days 00:00:00

Creating a Time Series in Pandas

Let’s get Apple’s stock history provided by an Intrinio developer sandbox.

import pandas as pd
import urllib
url = "https://raw.githubusercontent.com/leosmigel/analyzingalpha/master/ ...
with urllib.request.urlopen(url) as f:
  apple_price_history = pd.read_csv(f)

apple_price_history[['open', 'high', 'low', 'close', 'volume']].head()
        open	high	low	close	volume
0	28.75	28.87	28.75	28.75	2093900
1	27.38	27.38	27.25	27.25	785200
2	25.37	25.37	25.25	25.25	472000
3	25.87	26.00	25.87	25.87	385900
4	26.63	26.75	26.63	26.63	327900

Let’s review the data types or dtypes of the dataframe to see if we have any datetime information.

id               int64
date            object
open           float64
high           float64
low            float64
close          float64
volume           int64
adj_open       float64
adj_high       float64
adj_low        float64
adj_close      float64
adj_volume       int64
intraperiod       bool
frequency       object
security_id      int64
dtype: object

Notice how the date column that contains our date information is a pandas object. We could have told pandas to parse_dates and read in our column as a date, but we can adjust it after the fact, also.

Let’s change our dataframe’s RangeIndex into a DatetimeIndex. And for good measure, let’s read the data in with a DatetimeIndex from read_csv.

apple_price_history['date'] = apple_price_history['date'].astype(np.datetime64)
id                      int64
date           datetime64[ns]
open                  float64
high                  float64
low                   float64
close                 float64
volume                  int64
adj_open              float64
adj_high              float64
adj_low               float64
adj_close             float64
adj_volume              int64
intraperiod              bool
frequency              object
security_id             int64
dtype: object
apple_price_history.set_index('date', inplace=True)
print(apple_price_history[['open', 'high', 'low', 'close']].head())
             open   high    low  close
1980-12-12  28.75  28.87  28.75  28.75
1980-12-15  27.38  27.38  27.25  27.25
1980-12-16  25.37  25.37  25.25  25.25
1980-12-17  25.87  26.00  25.87  25.87
1980-12-18  26.63  26.75  26.63  26.63
DatetimeIndex(['1980-12-12', '1980-12-15', '1980-12-16', '1980-12-17',
               '1980-12-18', '1980-12-19', '1980-12-22', '1980-12-23',
               '1980-12-24', '1980-12-26'],
              dtype='datetime64[ns]', name='date', freq=None)
import numpy as np
import urllib.request

names = ['open', 'high', 'low', 'close', 'volume']
url = 'https://raw.githubusercontent.com/leosmigel/analyzingalpha/master/ ...
with urllib.request.urlopen(url) as f:
  apple_price_history = pd.read_csv(f,
apple_price_history.columns = names
open      float64
high      float64
low       float64
close     float64
volume      int64
dtype: object
DatetimeIndex(['1980-12-12', '1980-12-15', '1980-12-16', '1980-12-17',
              dtype='datetime64[ns]', name='date', freq=None)
                open      high       low     close     volume
1980-12-12  0.410073  0.411785  0.410073  0.410073  117258400
1980-12-15  0.390532  0.390532  0.388678  0.388678   43971200
1980-12-16  0.361863  0.361863  0.360151  0.360151   26432000
1980-12-17  0.368995  0.370849  0.368995  0.368995   21610400
1980-12-18  0.379835  0.381546  0.379835  0.379835   18362400

Adding Datetimes from Strings

Frequently, dates will be in a format that we can’t read. We can use dt.strftime to convert the string into a date. We used strptime when creating the S&P 500 dataset.

sp500.loc[:,'date'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d'))

Time Series Selection

We can now easily select and slice dates using the index with loc.

Datetime Selection by Day, Month, or Year

print(apple_price_history.loc['2018-06-01': '2018-06-05'])
             open        high         low       close    volume
2018-01-02  165.657452  167.740826  164.781266  167.701884  25555934
2018-01-03  167.964740  169.931290  167.409823  167.672678  29517899
2018-01-04  167.974475  168.879867  167.526647  168.451510  22434597
2018-01-05  168.850661  170.729592  168.470981  170.369382  23660018
2018-01-08  169.736582  170.963241  169.327695  169.736582  20567766
                  open        high         low       close    volume
2018-06-01  184.471622  186.697946  184.234938  186.678320  23442510
2018-06-04  188.047203  189.798784  187.767539  188.238552  26266174
2018-06-05  189.450430  190.309049  188.758629  189.690843  21565963
2018-06-06  190.004852  190.446427  188.326867  190.348300  20933619
2018-06-07  190.505304  190.564181  188.734097  189.838035  21347180
                  open        high         low       close    volume
2018-06-01  184.471622  186.697946  184.234938  186.678320  23442510
2018-06-04  188.047203  189.798784  187.767539  188.238552  26266174
2018-06-05  189.450430  190.309049  188.758629  189.690843  21565963
open      1.844716e+02
high      1.866979e+02
low       1.842349e+02
close     1.866783e+02
volume    2.344251e+07
Name: 2018-06-01 00:00:00, dtype: float64

Using the Datetime Accessor

datetime has multiple datetime properties and methods can be used on series datetime elements as found in the Series API Documentation.

Property Description
Series.dt.date Returns numpy array of python datetime.date objects (namely, the date part of Timestamps without timezone information).
Series.dt.time Returns numpy array of datetime.time.
Series.dt.timetz Returns numpy array of datetime.time also containing timezone information.
Series.dt.year The year of the datetime.
Series.dt.month The month as January=1, December=12.
Series.dt.day The days of the datetime.
Series.dt.hour The hours of the datetime.
Series.dt.minute The minutes of the datetime.
Series.dt.second The seconds of the datetime.
Series.dt.microsecond The microseconds of the datetime.
Series.dt.nanosecond The nanoseconds of the datetime.
Series.dt.week The week ordinal of the year.
Series.dt.weekofyear The week ordinal of the year.
Series.dt.dayofweek The day of the week with Monday=0, Sunday=6.
Series.dt.weekday The day of the week with Monday=0, Sunday=6.
Series.dt.dayofyear The ordinal day of the year.
Series.dt.quarter The quarter of the date.
Series.dt.is_month_start Indicates whether the date is the first day of the month.
Series.dt.is_month_end Indicates whether the date is the last day of the month.
Series.dt.is_quarter_start Indicator for whether the date is the first day of a quarter.
Series.dt.is_quarter_end Indicator for whether the date is the last day of a quarter.
Series.dt.is_year_start Indicate whether the date is the first day of a year.
Series.dt.is_year_end Indicate whether the date is the last day of the year.
Series.dt.is_leap_year Boolean indicator if the date belongs to a leap year.
Series.dt.daysinmonth The number of days in the month.
Series.dt.days_in_month The number of days in the month.
Series.dt.tz Return timezone, if any.
Method Description
Series.dt.to_period(self, *args, **kwargs) Cast to PeriodArray/Index at a particular frequency.
Series.dt.to_pydatetime(self) Return the data as an array of native Python datetime objects.
Series.dt.tz_localize(self, *args, **kwargs) Localize tz-naive Datetime Array/Index to tz-aware Datetime Array/Index.
Series.dt.tz_convert(self, *args, **kwargs) Convert tz-aware Datetime Array/Index from one time zone to another.
Series.dt.normalize(self, *args, **kwargs) Convert times to midnight.
Series.dt.strftime(self, *args, **kwargs) Convert to Index using specified date_format.
Series.dt.round(self, *args, **kwargs) Perform round operation on the data to the specified freq.
Series.dt.floor(self, *args, **kwargs) Perform floor operation on the data to the specified freq.
Series.dt.ceil(self, *args, **kwargs) Perform ceil operation on the data to the specified freq.
Series.dt.month_name(self, *args, **kwargs) Return the month names of the DateTimeIndex with specified locale.
Series.dt.day_name(self, *args, **kwargs) Return the day names of the DateTimeIndex with specified locale.


dates = ['2019-01-01', '2019-04-02', '2019-07-03']
df = pd.Series(dates, dtype='datetime64[ns]')

DatetimeIndex includes most of the same properties and methods as the dt.accessor.

0    1
1    2
2    3
dtype: int64
0      Tuesday
1      Tuesday
2    Wednesday
dtype: object
Int64Index([4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
            3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
           dtype='int64', name='date', length=9789)
Index(['Friday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
       'Monday', 'Tuesday', 'Wednesday', 'Friday',
       'Wednesday', 'Thursday', 'Friday', 'Monday', 'Tuesday', 'Wednesday',
       'Thursday', 'Friday', 'Monday', 'Tuesday'],
      dtype='object', name='date', length=9789)

Frequency Selection

Time series can be associated with a frequency in Pandas when it is uniformly spaced.

date_range is a function that allows us to create a sequence of evenly spaced dates.

dates = pd.date_range('2019-01-01', '2019-12-31', freq='D')
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10',
               '2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25',
               '2019-12-26', '2019-12-27', '2019-12-28', '2019-12-29',
               '2019-12-30', '2019-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')

Instead of specifying a start or end date, we can substitute a period and adjust the frequency.

dates = pd.date_range('2019-01-01', periods=6, freq='M')
hours = pd.date_range('2019-01-01', periods=24, freq='H')
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30'],
              dtype='datetime64[ns]', freq='M')
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
               '2019-01-01 02:00:00', '2019-01-01 03:00:00',
               '2019-01-01 04:00:00', '2019-01-01 05:00:00',
               '2019-01-01 06:00:00', '2019-01-01 07:00:00',
               '2019-01-01 08:00:00', '2019-01-01 09:00:00',
               '2019-01-01 10:00:00', '2019-01-01 11:00:00',
               '2019-01-01 12:00:00', '2019-01-01 13:00:00',
               '2019-01-01 14:00:00', '2019-01-01 15:00:00',
               '2019-01-01 16:00:00', '2019-01-01 17:00:00',
               '2019-01-01 18:00:00', '2019-01-01 19:00:00',
               '2019-01-01 20:00:00', '2019-01-01 21:00:00',
               '2019-01-01 22:00:00', '2019-01-01 23:00:00'],
              dtype='datetime64[ns]', freq='H')

asfreq returns a dataframe or series with a new specified frequency. New rows will be added for moments that are missing in the data and filled with NaN or using a method we specify. We often need to provide an offset alias to get the desired time-frequency.

Offset Aliases

Alias Description
B business day frequency
C custom business day frequency
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter end frequency
QS quarter start frequency
BQS business quarter start frequency
A, Y year end frequency
BA, BY business year end frequency
AS, YS year start frequency
BAS, BYS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
apple_quarterly_history = apple_price_history.asfreq('BM')
               open      high       low     close    volume
1980-12-31  0.488522  0.488522  0.486810  0.486810   8937600
1981-12-31  0.315649  0.317361  0.315649  0.315649  13664000
1982-12-31  0.427903  0.433180  0.426048  0.426048  12415200
1983-12-30  0.347742  0.356585  0.345888  0.347742  22965600
1984-12-31  0.415351  0.417205  0.415351  0.415351  51940000
<class 'pandas.core.frame.DataFrame'>
                open      high       low     close      volume
1980-12-31  0.488522  0.488522  0.486810  0.486810   8937600.0
1981-01-30  0.406507  0.406507  0.402942  0.402942  11547200.0
1981-02-27  0.377981  0.381546  0.377981  0.377981   3690400.0
1981-03-31  0.353020  0.353020  0.349454  0.349454   3998400.0
1981-04-30  0.404796  0.408219  0.404796  0.404796   3152800.0

Filling Data

asfreq allows us to provide a filling method replacing NaN values.

print(apple_price_history['close'].asfreq('H', method='ffill').head())
1980-12-12 00:00:00    0.410073
1980-12-12 01:00:00         NaN
1980-12-12 02:00:00         NaN
1980-12-12 03:00:00         NaN
1980-12-12 04:00:00         NaN
Freq: H, Name: close, dtype: float64
1980-12-12 00:00:00    0.410073
1980-12-12 01:00:00    0.410073
1980-12-12 02:00:00    0.410073
1980-12-12 03:00:00    0.410073
1980-12-12 04:00:00    0.410073
Freq: H, Name: close, dtype: float64

Resampling: Upsampling & Downsampling

resample returns a resampling object, very similar to a groupby object, for which to run various calculations on.

We often need to lower (downsampling) or increase (upsampling) the frequency of our time series data. If we have daily or monthly sales data, it may be useful to downsample it into quarterly data. Alternatively, we may want to upsample our data to match the frequency of another series we’re using to make predictions. Upsampling is less common, and it requires interpolation.

apple_quarterly_history = apple_price_history.resample('BM')
apple_quarterly_history.agg({'high':'max', 'low':'min'})[:5]
<class 'pandas.core.resample.DatetimeIndexResampler'>
                high       low
1980-12-31  0.515337  0.360151
1981-01-30  0.495654  0.402942
1981-02-27  0.411785  0.338756
1981-03-31  0.385112  0.308375
1981-04-30  0.418917  0.345888

We can now use all of the properties and methods we discovered above.

Int64Index([4, 0, 1, 2, 3, 4, 0, 1, 2, 4,
            2, 3, 4, 0, 1, 2, 3, 4, 0, 1],
           dtype='int64', name='date', length=9789)
Int64Index([50, 51, 51, 51, 51, 51, 52, 52, 52, 52,
            37, 37, 37, 38, 38, 38, 38, 38, 39, 39],
           dtype='int64', name='date', length=9789)
Int64Index([1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980,
            2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
           dtype='int64', name='date', length=9789)
Index(['Friday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
       'Monday', 'Tuesday', 'Wednesday', 'Friday',
       'Wednesday', 'Thursday', 'Friday', 'Monday', 'Tuesday', 'Wednesday',
       'Thursday', 'Friday', 'Monday', 'Tuesday'],
      dtype='object', name='date', length=9789)
datetime = pd.to_datetime('2019-09-30 06:54:32.54321')
2019-09-30 06:00:00
2019-09-30 06:55:00
2019-09-30 06:54:33
Timestamp('2019-09-30 23:59:59.999999999')

Rolling Windows: Smoothing & Moving Averages

rolling allow us to split the data into aggregated windows and apply a function such as the mean or sum.

A typical example of this in trading is using the 50-day and 200-day moving averages to enter and exit an asset.

Let’s calculate the these for Apple. Notice that we need 50 days of data before we can calculate the rolling mean.

%matplotlib inline
import matplotlib.pyplot as plt

apple_price_history['rolling_50'] = apple_price_history...
apple_price_history['rolling_200'] = apple_price_history...
apple_price_history_recent = apple_price_history[-2000:]
apple_price_history_recent[['close', 'rolling_50', 'rolling_200']]...
                                    plot(title='Apple vs. 200SMA', figsize=(32,18))

Apple vs. 200-day & 50-day SMA chart

Visualizing Time Series Data using Matplotlib

Matplotlib makes it easy to visualize our Pandas time series data. Seaborn adds additional options and helps us make our graphs look prettier. Let’s import matplotlib and seaborn to try out a few basic examples. This quick summary isn’t an in-depth guide on Python Visualization.

Line Plot

lineplot draws a standard line plot. It works similar to dataframe.plot that we’ve been using above.

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(32,18))
sns.lineplot(x=apple_price_history.index, y='close', data=apple_price_history,
             ax=ax).set_title("Apple Stock Price History")

Apple Stock Price History Line Chart


boxplot enables us to group and understand distributions in our data. It’s often beneficial for seasonal data.

apple_price_recent_history = apple_price_history[-800:].copy()
apple_price_recent_history['quarter'] = apple_price_recent_history.index.year.astype(str)...
                                      + apple_price_recent_history.index.quarter.astype(str)

sns.set(rc={'figure.figsize':(18, 9)})
sns.boxplot(data=apple_price_recent_history, x='quarter', y='close')

Apple Stock Price History Boxplot

Analyzing Time Series Data in Pandas

Time series analysis methods can be divided into two classes:

  1. Frequency domain methods
  2. Time-domain methods

Frequency domain methods analyze how much a signal changes over a frequency band such as the last 100 samples. Time-domain methods analyze how much a signal changes over a specified period of time such as the prior 100 seconds.

Time Series Trend, Seasonality & Cyclicality

Time series data can be decomposed into four components:

  1. Trend
  2. Seasonality
  3. Cyclicality
  4. Noise

Not all time series have trend, seasonality, or cyclicality; moreover, there must be enough data to support that seasonality, cyclicality or a trend exists and what you’re seeing is not just a random pattern.

Time series data will often exhibit show gradual variability in addition to higher frequency variability such as seasonality and noise. An easy way to visualize these trends is with rolling means at different time scales. Let’s import Apple’s sales data to review seasonality and trend.


Trends occur when there is an increasing or decreasing slope in the time series. Amazon’s sales growth would be an example of an upward trend. Additionally, trends do not have to be linear. Trends can be deterministic and are a function of time, or stochastic where the trend is random.


Seasonality occurs when there is a distinct repeating pattern, peaks and troughs, observed at regular intervals within a year. Apple’s sales peak in Q4 would be an example of seasonality in Amazon’s revenue numbers.


Cyclicality occurs when there is a distinct repeating pattern, peaks and troughs, observed at irregular intervals. The business cycle exhibits cyclicality.

Let’s analyze Apple’s revenue history and see if we can decompose it visually and programatically.

import urllib
import pandas as pd
from scipy import stats

url = 'https://raw.githubusercontent.com/leosmigel/analyzingalpha/master/ ...
with urllib.request.urlopen(url) as f:
  apple_revenue_history = pd.read_csv(f, index_col=0)
apple_revenue_history['quarter'] = apple_revenue_history['fiscal_year'].apply(str) \
                                   + apple_revenue_history['fiscal_period'].str.upper()
slope, intercept, r_value, p_value, std_err = stats.linregress(apple_revenue_history.index,
apple_revenue_history['line'] = slope * apple_revenue_history.index + intercept

Time Series Trend Graph with Trend Line

fig = plt.figure(figsize=(32,18))
ax1 = fig.add_subplot(1,1,1)

apple_revenue_history.plot(y='value', x='quarter', title='Apple Quarterly Revenue 2010-2018', ax=ax1)
apple_revenue_history.plot(y='line', x='quarter', title='Apple Quarterly Revenue 2010-2018', ax=ax1)

Apple Quarterly Revenue with Trendline

Time Series Stacked Graph for Cycle Analysis

fig = plt.figure(figsize=(32,18))
ax1 = fig.add_subplot(1,1,1)
legend = []
yticks = np.linspace(apple_revenue_history['value'].min(), apple_revenue_history['value'].max(), 10)
for year in apple_revenue_history['fiscal_year'].unique():
  apple_revenue_year = apple_revenue_history[apple_revenue_history['fiscal_year'] == year]
  apple_revenue_year.plot(y='value', x='fiscal_period', title='Apple Quarterly Revenue 2010-2018',
                          ax=ax1,yticks=yticks, sharex=True, sharey=True)

Apple Annual Revenue Stacked

Decomposing Time Series Data using StatsModel

statsmodel enables us to statistically decompose a time series into its components.

from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse

apple_revenue_history['date'] = pd.to_datetime(apple_revenue_history['quarter']).dt.to_period('Q')
apple_revenue_history.set_index('date', inplace=True)
apple_revenue_history.index = apple_revenue_history.index.to_timestamp(freq='Q')

result_add = seasonal_decompose(apple_revenue_history['value'])
plt.rcParams.update({'figure.figsize': (32,18)})
result_add.plot().suptitle('Decompose (Additive)', fontsize=18)

Apple Revenue Decomposed

Time Series Stationarity

Time series is different from more traditional classification and regression predictive modeling problems. Time series data is ordered and needs to be stationary for meaningful summary statistics.

Stationarity is an assumption underlying many statistical procedures used in time series analysis, and non-stationarity data are often transformed into stationary data.

Stationarity is sometimes categorized into the following:

  • Stationary Process/Model: Stationary series of observations.
  • Trend Stationary: Does not exhibit a trend.
  • Seasonal Stationary: Does not exhibit seasonality.
  • Strictly Stationary: Mathematical definition of a stationary process.

In a stationary time series, the mean and standard deviation of a time series is constant. Additionally, there is no seasonality, cyclicality, or other time-dependent structure. It’s often easier to understand if a time series is stationary to first look at how stationarity can be violated.

# Stationary
vol = .002
df1 = pd.DataFrame(np.random.normal(size=200) * vol)

Stationary Time Series

Stationary Time Series

df2 = pd.DataFrame(np.random.random(size=200) * vol).cumsum()
df2.plot(title='Not Stationarty: Mean Not Constant')

Not Stationary: Mean Not Constant

Trending Time Series

df3 = pd.DataFrame(np.random.normal(size=200) * vol * np.logspace(1,2,num=200, dtype=int))
df3.plot(title='Not Stationary: Volatility Not Constant')

Not Stationary: Volatility Not Constant

Volatile Time Series

df4 = pd.DataFrame(np.random.normal(size=200) * vol)
df4['cyclical'] = df4.index.values % 20
df4[0] = df4[0] + df4['cyclical']
df4[0].plot(title='Not Stationary: Cyclcial')

Not Stationary: Cyclical

Cyclical Time Series

How to Test for Stationarity

We can test for stationarity by visually inspecting the graphs above, as we did earlier; by splitting the graphs into multiple sections and looking at summary statistics such as mean, variance and correlation; or we can use more advanced methods like the Augmented Dickey-Fuller test.

The Augmented Dickey–Fuller tests that a unit root is not present. If the time series has a unit root, it has some time-dependent structure meaning the time series is not stationary.

The more negative this statistic, the more likely have a stationary time series. In general, if the p-value > 0.05 the data has unit root and it is not stationary. Let’s use statsmodel to examine this.

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller

df1 = pd.DataFrame(np.random.normal(size=200))
result = adfuller(df1[0].values, autolag='AIC')
print(f'ADF Statistic: {result[0]:.2f}')
print(f'p-value: {result[1]:.2f}')
for key, value in result[4].items():
     print('Critial Values:')
ADF Statistic: -11.14
p-value: 0.00
Critial Values:
   1%, -3.46
Critial Values:
   5%, -2.88
Critial Values:
   10%, -2.57
ADF Statistic: -0.81
p-value: 0.81
Critial Values:
   1%, -3.47
Critial Values:
   5%, -2.88
Critial Values:
   10%, -2.58

Running the examples prints the test statistic values of 0.00 (stationary) and 0.81 (non-stationary), respectively.

Notice that we can see the likelihood that we can reject that the price series is not stationary. In the first series, we can say the price series is stationary at above the 1% confidence level. In the below series, we can not reject that the price series is not stationary – in other words, this is not a stationary price series as it doesn’t even meet the 10% threashold.

How to Handle Non-Stationary Time Series

If there is a clear trend and seasonality in a time series, then model these components, remove them from observations, then train models on the residuals.

Detrending a Time Series

There are multiple methods to remove the trend component from a time series.

  • Subtract best fit line
  • Subtract using a decomposition
  • Subtract using a filter

Best Fit Line Using SciPy

Detrend from SciPy allows us to remove the trend by subtracting the best fit line.

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from scipy import signal

vol = 2
df = pd.DataFrame(np.random.random(size=200) * vol).cumsum()
detrend = signal.detrend(df[0].values)

Trend vs. Detrended Time Series

Decompose and Extract Using StatsModels

seasonal_decompose returns an object with seasonal, trend, and resid attributes that we can subtract from our series values.

from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse

dates = pd.date_range('2019-01-01', periods=200, freq='D')
df = pd.DataFrame(np.random.random(200)).cumsum()
df.set_index(dates, inplace=True)

Trending Time Series

decompose = seasonal_decompose(df[0], model='additive', extrapolate_trend='freq')
df[0] = df[0] - decompose.trend

Detrended Time Series

Additional Resources


Leo Smigel

Based in Pittsburgh, Analyzing Alpha is a blog by Leo Smigel exploring what works in the markets.