/ DATA SCIENCE

Data Manipulation with Python using Pandas

Pandas was developed at hedge fund AQR by Wes McKinney to enable quick analysis of financial data. Pandas is an extension of NumPy that supports vectorized operations enabling fast manipulation of financial information.

I’ll be using company data provided by an Intrinio developer sandbox. If you want to follow along, you can find the code and the data on the Analyzing Alpha Github Repo.

import pandas as pd
import numpy as np
import urllib.request
url= 'https://raw.githubusercontent.com/leosmigel ...
  /analyzingalpha/master/data-manipulation-with-python/companies.csv'
with urllib.request.urlopen(url) as f:
  companies = pd.read_csv(f, index_col='id')
companies[['name', 'sector']].head()
        name	          sector
id		
1	Apple Inc         Consumer Goods
2	American          Express Co Financial
3	Boeing Co         Industrial Goods
4	Caterpillar Inc	  Industrial Goods
5	Cisco Systems Inc Technology

Making Your Data Tidy

While it’s not needed for these simple examples, I want to introduce Tidy Data. When working with large datasets, and especially for factor analysis, you’ll want to make your life easier and tidy your dataset using pandas.melt. It makes the data easier to analyze and usually more performant. For those who are interested, there’s a great article on medium on Tidying Dataset in Python.

String Operations in Pandas

Often, we’ll need to manipulate string data that are formatted incorrectly. Pandas provides vectorized string operations. For most string operations, you’ll need to be familiar with regular expressions.

MethodDescription
pandas.Series.str.splitSplits string on specified delimiter
pandas.Series.str.replaceReplaces string on match of string or regex
pandas.Series.str.extractExtracts string on regex group match

Let’s perform an example extract operation by smushing some of our existing data together.

companies_smushed = pd.DataFrame()
companies_smushed['sector_employees'] = companies['sector'] + \
                                        companies['employees'].astype(str)
print(companies[['sector', 'employees']].head())
print(companies_smushed.head())
companies_smushed = companies_smushed['sector_employees'].str.extract(r'(\D+)(\d+)')
companies_smushed.head()
    sector              employees
0   Consumer Goods      132000
1   Financial           59000
2   Industrial Goods    153027
3   Industrial Goods    104000
4   Technology          74200

    sector_employees
0   Consumer Goods132000
1   Financial59000
2   Industrial Goods153027
3   Industrial Goods104000
4   Technology74200

0   1
0   Consumer Goods    132000
1   Financial         59000
2   Industrial Goods  153027
3   Industrial Goods  104000
4   Technology        74200

Grouping Operations: Split-Apply-Combine

Grouping operations can be broken down into three steps:

  1. Split the data into groups
  2. Apply a function to each group
  3. Combine the groups into a result

Before we use pandas to group and modify our data, let’s look at how we could accomplish counting the number of companies in each sector using python.

sector_count = {}
for sector in companies['sector']:
  if sector_count.get(sector):
    sector_count.update({sector: sector_count[sector] + 1})
  else:
    sector_count.update({sector: 1})
print(sector_count)
{'Consumer Goods': 4,
'Financial': 5,
'Industrial Goods': 5,
'Technology': 5,
'Basic Materials': 2,
'Services': 3,
'Healthcare': 4}

Grouping

pandas.Dataframe.groupby gives us a better way to group data. groupby returns a DataFrameGroupBy or a SeriesGroupBy object. These objects can be thought of the group. You can see below that sector_group.groups returns a dictionary of key/value pairs being sectors and their associated rows. See below that the financial group contains rows ([1, 6, 8, 14, 23]) respectively.

sector_group = companies.groupby(by='sector')
print(sector_group)
print(sector_group.groups)
print(sector_group.get_group('Financial'))
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2a7853b828>

{'Basic Materials': Int64Index([5, 27], dtype='int64'),
'Consumer Goods': Int64Index([0, 15, 20, 22], dtype='int64'),
'Financial': Int64Index([1, 6, 8, 14, 23], dtype='int64'),
'Healthcare': Int64Index([13, 18, 21, 24], dtype='int64'),
'Industrial Goods': Int64Index([2, 3, 7, 12, 17], dtype='int64'),
'Services': Int64Index([9, 16, 26], dtype='int64'),
'Technology': Int64Index([4, 10, 11, 19, 25], dtype='int64')}

    id                     name  ...                   industry_group employees
1    2      American Express Co  ...                  Credit Services     59000
6   25                 Visa Inc  ...                  Credit Services     17000
8    8  Goldman Sachs Group Inc  ...  Investment Brokerage - National     36600
14  13      JPMorgan Chase & Co  ...               Money Center Banks    256105
23  22  Travelers Companies Inc  ...    Property & Casualty Insurance     30400

[5 rows x 7 columns]

To show that the groupby object is just a blueprint, we can replicate get_group by using iloc.

print(companies.iloc[[1,6,8,14,23],[1,5,6]])
                       name                   industry_group  employees
1       American Express Co                  Credit Services      59000
6                  Visa Inc                  Credit Services      17000
8   Goldman Sachs Group Inc  Investment Brokerage - National      36600
14      JPMorgan Chase & Co               Money Center Banks     256105
23  Travelers Companies Inc    Property & Casualty Insurance      30400

In summary, groupby creates a blueprint that enables us to run many useful operations on the group. Using a groupby object is efficient as it allows us to have a one-to-many relationship in regards to calculating group values.

Alternatively, we can use the power of Pandas and use boolean indexing and an aggregation method to return the number of companies in each sector.

Aggregation

Aggregation takes the values and returns a value of a lesser dimension. For example, a function that takes a series and returns a single scalar value. Every GroupBy aggregation either implicitly or explicitly has a grouping column, an aggregation column, and a function column. Common aggregation functions are shown below:

FunctionDescription
mean()Compute mean of groups
sum()Compute sum of group values
size()Compute group sizes
count()Compute count of group
std()Standard deviation of groups
var()Compute variance of groups
sem()Standard error of the mean of groups
describe()Generates descriptive statistics
first()Compute first of group values
last()Compute last of group values
nth()Take nth value, or a subset if n is a list
min()Compute min of group values
max()Compute max of group values

Let’s think back to our original example of getting the total number of companies in each sector.

print(sector_group.size())
print(sector_group.size().sum())
sector
Basic Materials     2
Consumer Goods      4
Financial           5
Healthcare          4
Industrial Goods    5
Services            3
Technology          5
dtype: int64
28

We can also group by multiple columns. Notice that the aggregate function was called on the employees column automatically as it’s the only number-type column.

companies.groupby(['sector', 'industry_group'])['employees'].sum()
sector            industry_group                          
Basic Materials   Major Integrated Oil & Gas                   119600
Consumer Goods    Beverages - Soft Drinks                       62600
                  Electronic Equipment                         132000
                  Personal Products                             92000
                  Textile - Apparel Footwear & Accessories      73100
Financial         Credit Services                               76000
                  Investment Brokerage - National               36600
                  Money Center Banks                           256105
                  Property & Casualty Insurance                 30400
Healthcare        Drug Manufacturers - Major                   296500
                  Health Care Plans                            300000
Industrial Goods  Aerospace/Defense Products & Services        393027
                  Diversified Machinery                        376516
                  Farm & Construction Machinery                104000
Services          Discount, Variety Stores                    2200000
                  Home Improvement Stores                      413000
                  Restaurants                                  210000
Technology        Business Software & Services                 131000
                  Information Technology Services              350600
                  Networking & Communication Devices            74200
                  Semiconductor - Broad Line                   107400
                  Telecom Services - Domestic                  144500
Name: employees, dtype: int64

pandas.core.groupby.DataFrame.agg allows us to perform multiple aggregations at once including user-defined aggregations.

def half(column):
  return column.sum() / 2

def total(column):
  return column.sum()

companies.groupby(['sector']).agg({'employees': [np.min, half, total]})
                        employees
                  amin	  half	        total
sector			
Basic Materials   48600	  59800.0       119600
Consumer Goods    62600	  179850.0	359700
Financial         17000	  199552.5	399105
Healthcare        69000	  298250.0	596500
Industrial Goods  93516	  436771.5	873543
Services          210000  1411500.0	2823000
Technology        74200	  403850.0	807700

Transformation

Transformation returns manipulated data in the same size as the passed data. There are many methods and functions at your disposal to transform data, as shown below.

Method / FunctionDescription
Series.mapSubstitute each value for another
Series.applyInvoke a function elementwise on a series
DataFrame.applymapApply a function elementwise to a dataframe
DataFrame.applyInvoke a function on each column or row
Series.transformInvoke a function returning series of transformed values
DataFrame.transformInvoke a function returning dataframe of transformed values
pandas.meltReshape a dataframe. Useful for graphing, vectorized operations, and tidying data.
pandas.pivotReshape a dataframe. Essentially an unmelt operation.

In the below example, we pass a defined function find_percent and return the transformed series.

def find_percent(column):
  return column / float(column.sum())
companies.groupby('sector').agg({'employees': 'sum'}).transform(find_percent)
	employees
sector	
Basic Materials	  0.020003
Consumer Goods	  0.060159
Financial	  0.066749
Healthcare	  0.099763
Industrial Goods  0.146098
Services	  0.472141
Technology	  0.135086

As with agg, we can pass multiple functions, including lambda functions.

companies.groupby('sector').agg({'employees':'sum'}).transform([lambda x: x / x.sum()])
	  employees
                    <lambda>
sector	
Basic Materials     0.020003
Consumer Goods      0.060159
Financial           0.066749
Healthcare          0.099763
Industrial Goods    0.146098
Services            0.472141
Technology          0.135086

While we can’t pass multiple functions to apply as we can with transform, we can access other columns using apply where we are limited with transform. Again, it’s best to know what’s available to you and use the best tool for the job at that current moment.

companies.groupby('sector').apply(lambda x: x['employees'] * 2)
sector              
Basic Materials   5       97200
                  27     142000
Consumer Goods    0      264000
                  15     125200
                  20     146200
                  22     184000
Financial         1      118000
                  6       34000
                  8       73200
                  14     512210
                  23      60800
Healthcare        13     270200
                  18     138000
                  21     184800
                  24     600000
Industrial Goods  2      306054
                  3      208000
                  7      566000
                  12     480000
                  17     187032
Services          9      826000
                  16     420000
                  26    4400000
Technology        4      148400
                  10     701200
                  11     214800
                  19     262000
                  25     289000
Name: employees, dtype: int64

Filtration

Filter returns a subset of the original data. It works similar to boolean indexing except instead of working on individual rows, it works on individual groups. Filter must return a True or False value for the group as a whole. The services sector is the only sector that has more than 1,000,000 employees.

companies.groupby('sector').filter(
    lambda x: x['employees'].sum() > 1000000
    )[['name', 'employees']]
name	employees
id		
9	Home Depot Inc    413000
15	McDonalds Corp    210000
27	Walmart Inc       2200000

Pivot Tables

We can produce the same data in a different format by using pandas.pivot_table.

companies.pivot_table(columns='sector', values='employees', aggfunc='sum')
sector    Basic Materials Consumer Goods ...
employees 119600          359700

Joining Data

We can quickly join two dataframes through join, merge, and concat. Merge is the underlying function for all join/merge behavior. Join has slightly different defaults and is provided as a convenience method.

Method / FunctionDescription
pandas.DataFrame.joinJoin dataframe on on index, or key column
pandas.DataFrame.mergeMerge dataframe using database-style joins
pandas.concatConcatenate dataframes along a particular axis

Let’s see an example in action by bringing in the security data and joining on the id column.

import pandas as pd
import numpy as np
import urllib.request
url= 'https://raw.githubusercontent.com/leosmigel ... 
  /analyzingalpha/master/data-manipulation-with-python/securities.csv'
with urllib.request.urlopen(url) as f:
  securities = pd.read_csv(f, index_col='id')
print(securities.head())
securities_companies = companies.join(securities)
securities_companies[['name', 'ticker', 'figi']].head(7)
 ticker currency          figi
id                              
1    AAPL      USD  BBG000B9Y5X2
2     AXP      USD  BBG000BCR153
3      BA      USD  BBG000BCSV38
4     CAT      USD  BBG000BF0LJ6
5    CSCO      USD  BBG000C3JBN9
    name                ticker    figi
id			
1   Apple Inc           AAPL  BBG000B9Y5X2
2   American Express Co	AXP   BBG000BCR153
3   Boeing Co           BA    BBG000BCSV38
4   Caterpillar Inc     CAT   BBG000BF0LJ6
5   Cisco Systems Inc   CSCO  BBG000C3JBN9
6   Chevron Corp        CVX   BBG000K4NHJ5
25  Visa Inc            V     BBG000PSL0X0
leo

Leo Smigel

Based in Pittsburgh, Analyzing Alpha is a blog by Leo Smigel exploring what works in the markets.