/ DATA

The S&P 500 Historical Components & Changes

You can pay to get the S&P 500 historical constituents, or you can use various free sources. Getting an accurate list of the S&P500 components over time is critical when developing an equity investing or trading strategy due to survivorship bias.

Getting the S&P500 Historical Constituents

Multiple paid and free data providers provide the S&P500 constituents list. Finding the components of other indicides can be more difficult and generally requires a paid source. I show the best free and paid resources that I’ve found for S&P500 constituents below. I added Analyzing Alpha’s files created from Wikipedia for convenience:

SourcePaid or Free
Siblis Research S&P500 Historical ComponentsPaid
Norgate Data Historical Index ConstituentsPaid
iShares Core S&P500 ETFFree
Wikipedia List of S&P500 CompaniesFree
Analyzing Alpha (Components without History from Wikipedia)Free
Analyzing Alpha (Components with History from WikipediaFree

Download the S&P 500 Historical Components

If you’re just here for the csv data, please use the following:

  1. CSV File of SP500 Constituents
  2. CSV File of SP500 Historical Changes (Tidy)

Creating your Own S&P500 Components List

I’m going to show you how to create your own S&P500 constituents dataset using Python by web scraping Wikipedia as it provides more historical data. If you’re follow along with me, Jupyter notebook is the best tool to use for this sort of data manipulation and cleanup.

Scraping The Constituents with Pandas

pandas.read_html enables us to scrape a web page for Html tables and add them into a dataframe. I import the required libraries and grab the data.

import pandas as pd
from datetime import datetime
import numpy as np
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

If you check out Wikipedia List of S&P500 Companies, you’ll notice there is a table containing the current S&P500 components and a table listing the historical changes. Let’s grab the first table and use Pandas to manipulate it into the format we want. Iterative data tasks like these are usually best performed in Jupyter Notebook. If there was missing data, I added the first day of S&P500 trading and verified the dates using RegEx.

# Get current S&P table and set header column
sp500 = data[0].loc[1:,[0,1,6,7]]
columns = ['added_ticker', 'name', 'date', 'cik']
sp500.columns = columns
sp500.loc[sp500['date'].isnull(), 'date'] = '1957-01-01'

# One date is in the wrong format. Correcting it.
sp500.loc[~sp500['date'].str.match('\d{4}-\d{2}-\d{2}'), 'date'] = '1985-01-01'
sp500.loc[:,'date'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d'))
sp500 = pd.melt(sp500, id_vars=['date', 'name', 'cik'], value_vars=['added_ticker'])
sp500.head()
	date            name                cik         variable        value
0	1957-01-01	3M Company          0000066740	added_ticker	MMM
1	1964-03-31	Abbott Laboratories 0000001800	added_ticker	ABT
2	2012-12-31	AbbVie Inc.         0001551152	added_ticker	ABBV
3	2018-05-31	ABIOMED Inc         0000815094	added_ticker	ABMD
4	2011-07-06	Accenture plc       0001467373	added_ticker	ACN

With the current table organized in the manner we want, it’s time to work on the historical adjustments using pandas to wrangle the data into the format we want. This table needed a few more steps as it wasn’t formatted as nicely. I had to shift some of the cells to the right to correct for this by transposing with pandas.Dataframe.T. I then dedupe the records that were in both the current S&P500 listing and the S&P500 historical changes using pandas.Dataframe.duplicated.

# Get S&P500 adjustments table and set columns
sp500_adjustments = data[1]
sp500_adjustments = sp500_adjustments[2:].copy()
columns = ['date', 'added_ticker', 'added_name', 'removed_ticker', 'removed_name', 'reason']
sp500_adjustments.columns = columns
updates = sp500_adjustments[~sp500_adjustments['date'].str.contains(',')].T.shift(1).T
sp500_adjustments['date'].loc[~sp500_adjustments['date'].str.contains(',')] = np.nan
sp500_adjustments[sp500_adjustments['added_ticker'].isnull()]
sp500_adjustments.update(updates)
sp500_adjustments['date'].loc[sp500_adjustments['date'].isnull()] = sp500_adjustments['date'].T.shift(1).T
sp500_adjustments['date'].loc[sp500_adjustments['date'].isnull()] = sp500_adjustments['date'].T.shift(1).T
sp500_adjustments['date'].loc[sp500_adjustments['date'].isnull()] = sp500_adjustments['date'].T.shift(1).T
sp500_adjustments['date'].loc[sp500_adjustments['date'].isnull()] = sp500_adjustments['date'].T.shift(1).T
sp500_adjustments['date'].loc[sp500_adjustments['date'].isnull()] = sp500_adjustments['date'].T.shift(1).T
sp500_additions = sp500_adjustments[~sp500_adjustments['added_ticker'].isnull()]
sp500_additions = sp500_additions[['date', 'added_ticker', 'added_name']]
sp500_additions.rename(columns={'added_name': 'name'}, inplace=True)
sp500_additions = pd.melt(sp500_additions, id_vars=['date','name'], value_vars=['added_ticker'])
sp500_deletions = sp500_adjustments[~sp500_adjustments['removed_ticker'].isnull()]
sp500_deletions = sp500_deletions[['date', 'removed_ticker', 'removed_name']]
sp500_deletions.rename(columns={'removed_name': 'name'}, inplace=True)
sp500_deletions = pd.melt(sp500_deletions, id_vars=['date','name'], value_vars=['removed_ticker'])
sp500_history = pd.concat([sp500_deletions, sp500_additions])
sp500_history.head()
        date	        name                 cik        variable	value
0	1957-01-01	3M Company	     0000066740	added_ticker	MMM
1	1964-03-31	Abbott Laboratories  0000001800	added_ticker	ABT
2	2012-12-31	AbbVie Inc.	     0001551152	added_ticker	ABBV
3	2018-05-31	ABIOMED Inc	     0000815094	added_ticker	ABMD
4	2011-07-06	Accenture plc	     0001467373	added_ticker	ACN

Now that we’ve formatted botht he sp500 and sp500_history into the tidy format, let’s concatenate and dedupe.

df = pd.concat([sp500, sp500_history], ignore_index=True)
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='cik', ascending=False, inplace=True)
deduped_df = df[~df.duplicated(['date', 'variable', 'value'])].copy()
deduped_df.sort_values(by='date',inplace=True)
deduped_df.to_csv("sp500_history.csv")
print(deduped_df.head())
        cik	        date	        name	                      value	variable
183	0000072741	1957-01-01	Eversource Energy             ES	added_ticker
228	0000874766	1957-01-01	Hartford Financial Svc.Gp.    HIG	added_ticker
435	0001113169	1957-01-01	T. Rowe Price Group           TROW	added_ticker
349	0001111711	1957-01-01	NiSource Inc.                 NI	added_ticker
185	0001109357	1957-01-01	Exelon Corp.                  EXC	added_ticker

Creating a unique list of the S&P500 constituents without the change history is easy:

deduped_df.sort_values(by='cik', ascending=False, inplace=True)
deduped_df = deduped_df[~deduped_df.duplicated('value')]
# discovery has 2 share classes listed
deduped_df = deduped_df[~deduped_df.duplicated('cik')]
deduped_df.sort_values(by='value', inplace=True)
deduped_df.drop(['date', 'variable'], axis=1, inplace=True)
deduped_df.rename(columns={'value':'ticker'}, inplace=True)
deduped_df.to_csv("sp500_constituents.csv")
deduped_df.head()
        cik             name                          ticker
12	0001090872	Agilent Technologies Inc      A
31	0000006201	American Airlines Group       AAL
8	0001158449	Advance Auto Parts            AAP
49	0000320193	Apple Inc.                    AAPL
2	0001551152	AbbVie Inc.                   ABBV

You can download the example source code and S&P500 data on the Analyzing Alpha Github.

leo

Leo Smigel

Based in Pittsburgh, Analyzing Alpha is a blog by Leo Smigel exploring what works in the markets.