Pandas was developed at hedge fund AQR by Wes McKinney to enable quick analysis of financial data. Pandas is an extension of NumPy that supports vectorized operations enabling fast manipulation of financial information.
I’ll be using company data provided by an Intrinio developer sandbox. If you want to follow along, you can find the code and the data on the Analyzing Alpha Github Repo.
import pandas as pd
import numpy as np
import urllib.request
url= 'https://raw.githubusercontent.com/leosmigel ...
/analyzingalpha/master/2019-09-30-data-manipulation-with-python/companies.csv'
with urllib.request.urlopen(url) as f:
companies = pd.read_csv(f, index_col='id')
companies[['name', 'sector']].head()
name sector
id
1 Apple Inc Consumer Goods
2 American Express Co Financial
3 Boeing Co Industrial Goods
4 Caterpillar Inc Industrial Goods
5 Cisco Systems Inc Technology
Making Your Data Tidy
While it’s not needed for these simple examples, I want to introduce Tidy Data. When working with large datasets, and especially for factor analysis, you’ll want to make your life easier and tidy your dataset using pandas.melt. It makes the data easier to analyze and usually more performant. For those who are interested, there’s a great article on medium on Tidying Dataset in Python.
String Operations in Pandas
Often, we’ll need to manipulate string data that are formatted incorrectly. Pandas provides [vectorized string operations/python-for-finance#vectorization). For most string operations, you’ll need to be familiar with regular expressions.
Method | Description |
---|---|
pandas.Series.str.split | Splits string on specified delimiter |
pandas.Series.str.replace | Replaces string on match of string or regex |
pandas.Series.str.extract | Extracts string on regex group match |
Let’s perform an example extract operation by smushing some of our existing data together.
companies_smushed = pd.DataFrame()
companies_smushed['sector_employees'] = companies['sector'] + \
companies['employees'].astype(str)
print(companies[['sector', 'employees']].head())
print(companies_smushed.head())
companies_smushed = companies_smushed['sector_employees'].str.extract(r'(\D+)(\d+)')
companies_smushed.head()
sector employees
0 Consumer Goods 132000
1 Financial 59000
2 Industrial Goods 153027
3 Industrial Goods 104000
4 Technology 74200
sector_employees
0 Consumer Goods 132000
1 Financial 59000
2 Industrial Goods 153027
3 Industrial Goods 104000
4 Technology 74200
0 1
0 Consumer Goods 132000
1 Financial 59000
2 Industrial Goods 153027
3 Industrial Goods 104000
4 Technology 74200
Grouping Operations: Split-Apply-Combine
Grouping operations can be broken down into three steps:
- Split the data into groups
- Apply a function to each group
- Combine the groups into a result
Before we use pandas to group and modify our data, let’s look at how we could accomplish counting the number of companies in each sector using python.
sector_count = {}
for sector in companies['sector']:
if sector_count.get(sector):
sector_count.update({sector: sector_count[sector] + 1})
else:
sector_count.update({sector: 1})
print(sector_count)
{'Consumer Goods': 4,
'Financial': 5,
'Industrial Goods': 5,
'Technology': 5,
'Basic Materials': 2,
'Services': 3,
'Healthcare': 4}
Grouping
pandas.Dataframe.groupby gives us a better way to group data. groupby returns a DataFrameGroupBy or a SeriesGroupBy object. These objects can be thought of as a group. You can see below that sector_group.groups returns a dictionary of key/value pairs being sectors and their associated rows. See below that the financial group contains rows ([1, 6, 8, 14, 23]) respectively.
sector_group = companies.groupby(by='sector')
print(sector_group)
print(sector_group.groups)
print(sector_group.get_group('Financial'))
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2a7853b828>
{'Basic Materials': Int64Index([5, 27], dtype='int64'),
'Consumer Goods': Int64Index([0, 15, 20, 22], dtype='int64'),
'Financial': Int64Index([1, 6, 8, 14, 23], dtype='int64'),
'Healthcare': Int64Index([13, 18, 21, 24], dtype='int64'),
'Industrial Goods': Int64Index([2, 3, 7, 12, 17], dtype='int64'),
'Services': Int64Index([9, 16, 26], dtype='int64'),
'Technology': Int64Index([4, 10, 11, 19, 25], dtype='int64')}
id name ... industry_group employees
1 2 American Express Co ... Credit Services 59000
6 25 Visa Inc ... Credit Services 17000
8 8 Goldman Sachs Group Inc ... Investment Brokerage - National 36600
14 13 JPMorgan Chase & Co ... Money Center Banks 256105
23 22 Travelers Companies Inc ... Property & Casualty Insurance 30400
[5 rows x 7 columns]
To show that the groupby object is just a blueprint, we can replicate get_group by using iloc.
print(companies.iloc[[1,6,8,14,23],[1,5,6]])
name industry_group employees
1 American Express Co Credit Services 59000
6 Visa Inc Credit Services 17000
8 Goldman Sachs Group Inc Investment Brokerage - National 36600
14 JPMorgan Chase & Co Money Center Banks 256105
23 Travelers Companies Inc Property & Casualty Insurance 30400
In summary, groupby creates a blueprint that enables us to run many useful operations on the group. Using a groupby object is efficient as it allows us to have a one-to-many relationship in regard to calculating group values.
Alternatively, we can use the power of Pandas and use boolean indexing and an aggregation method to return the number of companies in each sector.
Aggregation
Aggregation takes the values and returns a value of a lesser dimension. For example, a function that takes a series and returns a single scalar value. Every GroupBy aggregation either implicitly or explicitly has a grouping column, an aggregation column, and a function column. Common aggregation functions are shown below:
Function | Description |
---|---|
mean() | Compute mean of groups |
sum() | Compute sum of group values |
size() | Compute group sizes |
count() | Compute count of group |
std() | Standard deviation of groups |
var() | Compute variance of groups |
sem() | Standard error of the mean of groups |
describe() | Generates descriptive statistics |
first() | Compute first of group values |
last() | Compute last of group values |
nth() | Take nth value, or a subset if n is a list |
min() | Compute min of group values |
max() | Compute max of group values |
print(sector_group.size())
print(sector_group.size().sum())
sector
Basic Materials 2
Consumer Goods 4
Financial 5
Healthcare 4
Industrial Goods 5
Services 3
Technology 5
dtype: int64
28
We can also group by multiple columns. Notice that the aggregate function was called on the employees column automatically as it’s the only number-type column.
companies.groupby(['sector', 'industry_group'])['employees'].sum()
sector industry_group
Basic Materials Major Integrated Oil & Gas 119600
Consumer Goods Beverages - Soft Drinks 62600
Electronic Equipment 132000
Personal Products 92000
Textile - Apparel Footwear & Accessories 73100
Financial Credit Services 76000
Investment Brokerage - National 36600
Money Center Banks 256105
Property & Casualty Insurance 30400
Healthcare Drug Manufacturers - Major 296500
Health Care Plans 300000
Industrial Goods Aerospace/Defense Products & Services 393027
Diversified Machinery 376516
Farm & Construction Machinery 104000
Services Discount, Variety Stores 2200000
Home Improvement Stores 413000
Restaurants 210000
Technology Business Software & Services 131000
Information Technology Services 350600
Networking & Communication Devices 74200
Semiconductor - Broad Line 107400
Telecom Services - Domestic 144500
Name: employees, dtype: int64
pandas.core.groupby.DataFrame.agg allows us to perform multiple aggregations at once including user-defined aggregations.
def half(column):
return column.sum() / 2
def total(column):
return column.sum()
companies.groupby(['sector']).agg({'employees': [np.min, half, total]})
employees
amin half total
sector
Basic Materials 48600 59800.0 119600
Consumer Goods 62600 179850.0 359700
Financial 17000 199552.5 399105
Healthcare 69000 298250.0 596500
Industrial Goods 93516 436771.5 873543
Services 210000 1411500.0 2823000
Technology 74200 403850.0 807700
Transformation
Transformation returns manipulated data in the same size as the passed data. There are many methods and functions at your disposal to transform data, as shown below.
Method / Function | Description |
---|---|
Series.map | Substitute each value for another |
Series.apply | Invoke a function elementwise on a series |
DataFrame.applymap | Apply a function elementwise to a dataframe |
DataFrame.apply | Invoke a function on each column or row |
Series.transform | Invoke a function returning series of transformed values |
DataFrame.transform | Invoke a function returning dataframe of transformed values |
pandas.melt | Reshape a dataframe. Useful for graphing, vectorized operations, and tidying data. |
pandas.pivot | Reshape a dataframe. Essentially an unmelt operation. |
In the below example, we pass a defined function find_percent and return the transformed series.
def find_percent(column):
return column / float(column.sum())
companies.groupby('sector').agg({'employees': 'sum'}).transform(find_percent)
employees
sector
Basic Materials 0.020003
Consumer Goods 0.060159
Financial 0.066749
Healthcare 0.099763
Industrial Goods 0.146098
Services 0.472141
Technology 0.135086
As with agg, we can pass multiple functions, including lambda functions.
companies.groupby('sector').agg({'employees':'sum'}).transform([lambda x: x / x.sum()])
employees
<lambda>
sector
Basic Materials 0.020003
Consumer Goods 0.060159
Financial 0.066749
Healthcare 0.099763
Industrial Goods 0.146098
Services 0.472141
Technology 0.135086
While we can’t pass multiple functions to apply as we can with transform, we can access other columns using apply where we are limited with transform. Again, it’s best to know what’s available to you and use the best tool for the job at that current moment.
companies.groupby('sector').apply(lambda x: x['employees'] * 2)
sector
Basic Materials 5 97200
27 142000
Consumer Goods 0 264000
15 125200
20 146200
22 184000
Financial 1 118000
6 34000
8 73200
14 512210
23 60800
Healthcare 13 270200
18 138000
21 184800
24 600000
Industrial Goods 2 306054
3 208000
7 566000
12 480000
17 187032
Services 9 826000
16 420000
26 4400000
Technology 4 148400
10 701200
11 214800
19 262000
25 289000
Name: employees, dtype: int64
Filtration
Filter returns a subset of the original data. It works similar to boolean indexing except instead of working on individual rows, it works on individual groups. Filter must return a True or False value for the group as a whole. The services sector is the only sector that has more than 1,000,000 employees.
companies.groupby('sector').filter(
lambda x: x['employees'].sum() > 1000000
)[['name', 'employees']]
name employees
id
9 Home Depot Inc 413000
15 McDonalds Corp 210000
27 Walmart Inc 2200000
Pivot Tables
We can produce the same data in a different format by using pandas.pivot_table.
companies.pivot_table(columns='sector', values='employees', aggfunc='sum')
sector Basic Materials Consumer Goods ...
employees 119600 359700
Joining Data
We can quickly join two dataframes through join, merge, and concat. Merge is the underlying function for all join/merge behavior. Join has slightly different defaults and is provided as a convenience method.
Method / Function | Description |
---|---|
pandas.DataFrame.join | Join dataframe on on index, or key column |
pandas.DataFrame.merge | Merge dataframe using database-style joins |
pandas.concat | Concatenate dataframes along a particular axis |
Let’s see an example in action by bringing in the security data and joining on the id column.
import pandas as pd
import numpy as np
import urllib.request
url= 'https://raw.githubusercontent.com/leosmigel ...
/analyzingalpha/master/data-manipulation-with-python/securities.csv'
with urllib.request.urlopen(url) as f:
securities = pd.read_csv(f, index_col='id')
print(securities.head())
securities_companies = companies.join(securities)
securities_companies[['name', 'ticker', 'figi']].head(7)
ticker currency figi
id
1 AAPL USD BBG000B9Y5X2
2 AXP USD BBG000BCR153
3 BA USD BBG000BCSV38
4 CAT USD BBG000BF0LJ6
5 CSCO USD BBG000C3JBN9
name ticker figi
id
1 Apple Inc AAPL BBG000B9Y5X2
2 American Express Co AXP BBG000BCR153
3 Boeing Co BA BBG000BCSV38
4 Caterpillar Inc CAT BBG000BF0LJ6
5 Cisco Systems Inc CSCO BBG000C3JBN9
6 Chevron Corp CVX BBG000K4NHJ5
25 Visa Inc V BBG000PSL0X0