You can pay to get the S&P 500 historical constituents or use various free sources. Due to survivorship bias, getting an accurate list of the S&P500 components over time is critical when developing a trading strategy.
Getting the S&P500 Historical Constituents
Multiple paid and free data providers provide the S&P500 constituents list. Finding the components of other indices can be more complex and generally requires a paid source. I show the best free and paid resources that I’ve found for S&P500 constituents below. I added Analyzing Alpha’s files created from Wikipedia for convenience:
Download the S&P 500 Historical Components
If you’re just here for the CSV data, please use the following, which is offered freely using the following creative commons license:
Creating your Own S&P500 Components List
I will show you how to create your own S&P500 constituents dataset using Python by web scraping Wikipedia as it provides more S&P 500 history data. If you’re following along with me, the Jupyter notebook is the best tool to use for this sort of data manipulation and cleanup.
You’ll notice that Wikipedia stays up-to-date and includes the S&P 500 additions and deletions for 2022.
Scraping The Constituents with Pandas
pandas.read_html enables us to scrape a web page for Html tables and add them into a dataframe. I import the required libraries and grab the data.
import datetime as dt
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
data = pd.read_html(url)
If you check out the Wikipedia List of S&P500 Companies, you’ll notice there is a table containing the current S&P500 components and a table listing the historical changes.
Let’s grab the first table and use Pandas to manipulate it into the format we want. Iterative data tasks like these are usually best performed in Jupyter Notebook. I added the first day of S&P500 trading and verified the dates using RegEx if there was missing data.
# Get current S&P table and set header column
sp500 = data[0].iloc[:, [0,1,6,7]]
sp500.columns = ['ticker', 'name', 'date' , 'cik']
# Get rows where date is missing or not formatted correctly.
mask = sp500['date'].str.strip().str.fullmatch('\d{4}-\d{2}-\d{2}')
mask.loc[mask.isnull()] = False
mask = mask == False
sp500[mask].head()
ticker name date cik
7 AMD Advanced Micro Devices NaN 2488
51 T AT&T 1983-11-30 (1957-03-04) 732717
126 ED Consolidated Edison NaN 1047862
130 GLW Corning NaN 24741
138 DHR Danaher Corporation NaN 313616
139 DRI Darden Restaurants NaN
Fill The Missing Data
Next, we’ll use zfill to zerofill the cik code as it’s a ten-digit string and not an integer and set all missing dates to 1900-01-01. Hopefully, the community and others can help fill in these gaps!
current = sp500.copy()
current.loc[mask, 'date'] = '1900-01-01'
current.loc[:, 'date'] = pd.to_datetime(current['date'])
current.loc[:, 'cik'] = current['cik'].apply(str).str.zfill(10)
With the current table organized in the manner we want, it’s time to work on the historical adjustments using pandas to wrangle the data into the format we want. We’ll create a dataframe for additions and removals, then concatenate them.
# Get the adjustments dataframe and rename columns
adjustments = data[1]
columns = ['date', 'ticker_added','name_added', 'ticker_removed', 'name_removed', 'reason']
adjustments.columns = columns
# Create additions dataframe.
additions = adjustments[~adjustments['ticker_added'].isnull()][['date','ticker_added', 'name_added']]
additions.columns = ['date','ticker','name']
additions['action'] = 'added'
# Create removals dataframe.
removals = adjustments[~adjustments['ticker_removed'].isnull()][['date','ticker_removed','name_removed']]
removals.columns = ['date','ticker','name']
removals['action'] = 'removed'
# Merge the additions and removals into one dataframe.
historical = pd.concat([additions, removals])
historical.head()
date ticker name action
0 September 20, 2021 MTCH Match Group added
1 September 20, 2021 CDAY Ceridian added
2 September 20, 2021 BRO Brown & Brown added
3 August 30, 2021 TECH Bio-Techne added
4 July 21, 2021 MRNA Moderna added
Now that we have both the current and historical data let’s add any tickers in the S&P 500 index but not in Wikipedia history.
missing = current[~current['ticker'].isin(historical['ticker'])].copy()
missing['action'] = 'added'
missing = missing[['date','ticker','name','action', 'cik']]
missing.loc[:, 'cik'] = current['cik'].apply(str).str.zfill(10)
missing.head()
date ticker name action cik
0 1976-08-09 MMM 3M added 0000066740
1 1964-03-31 ABT Abbott Laboratories added 0000001800
6 1997-05-05 ADBE Adobe added 0000796343
9 1998-10-02 AES AES Corp added 0000874761
10 1999-05-28 AFL Aflac added 0000004977
Merge and Dedup the Data
We’ll now merge the historical and the S&P 500 companies and then dedupe them.
sp500_history = pd.concat([historical, missing])
sp500_history = sp500_history.sort_values(by=['date','ticker'], ascending=[False, True])
sp500_history = sp500_history.drop_duplicates(subset=['date','ticker'])
sp500_history
date ticker name action cik
112 September 8, 2016 CHTR Charter Communications added NaN
112 September 8, 2016 EMC EMC Corporation removed NaN
113 September 6, 2016 MTD Mettler Toledo added NaN
113 September 6, 2016 TYC Tyco International removed NaN
208 September 5, 2012 LYB LyondellBasell added NaN
... ... ... ... ... ...
484 1900-01-01 00:00:00 WAT Waters Corporation added 0001000697
493 1900-01-01 00:00:00 WHR Whirlpool Corporation added 0000106640
483 1900-01-01 00:00:00 WM Waste Management added 0000823768
491 1900-01-01 00:00:00 WRK WestRock added 0001732845
492 1900-01-01 00:00:00 WY Weyerhaeuser added 0000106535
Export Data to CSV
And finally, we’ll export both files out to a CSV for download, which you can find in this notebook and the associated files on the Analyzing Alpha Github.