Data Manipulation with Python using Pandas

Pandas was developed at hedge fund AQR by Wes McKinney to enable quick analysis of financial data. Pandas is an extension of NumPy that supports vectorized operations enabling fast manipulation of financial information.

I’ll be using company data provided by an Intrinio developer sandbox. If you want to follow along, you can find the code and the data on the Analyzing Alpha Github Repo.

import pandas as pd
import numpy as np
import urllib.request
url= 'https://raw.githubusercontent.com/leosmigel ...
  /analyzingalpha/master/2019-09-30-data-manipulation-with-python/companies.csv'
with urllib.request.urlopen(url) as f:
  companies = pd.read_csv(f, index_col='id')
companies[['name', 'sector']].head()
        name	                 sector
id
1	Apple Inc                Consumer Goods
2	American                Express Co Financial
3	Boeing Co              Industrial Goods
4	Caterpillar Inc	       Industrial Goods
5	Cisco Systems Inc Technology

Making Your Data Tidy

While it’s not needed for these simple examples, I want to introduce Tidy Data. When working with large datasets, and especially for factor analysis, you’ll want to make your life easier and tidy your dataset using pandas.melt. It makes the data easier to analyze and usually more performant. For those who are interested, there’s a great article on medium on Tidying Dataset in Python.

String Operations in Pandas

Often, we’ll need to manipulate string data that are formatted incorrectly. Pandas provides [vectorized string operations/python-for-finance#vectorization). For most string operations, you’ll need to be familiar with regular expressions.

MethodDescription
pandas.Series.str.splitSplits string on specified delimiter
pandas.Series.str.replaceReplaces string on match of string or regex
pandas.Series.str.extractExtracts string on regex group match

Let’s perform an example extract operation by smushing some of our existing data together.

companies_smushed = pd.DataFrame()
companies_smushed['sector_employees'] = companies['sector'] + \
                                        companies['employees'].astype(str)
print(companies[['sector', 'employees']].head())
print(companies_smushed.head())
companies_smushed = companies_smushed['sector_employees'].str.extract(r'(\D+)(\d+)')
companies_smushed.head()
    sector                          employees
0   Consumer Goods      132000
1   Financial                    59000
2   Industrial Goods        153027
3   Industrial Goods        104000
4   Technology                74200

    sector_employees
0   Consumer Goods      132000
1   Financial                    59000
2   Industrial Goods        153027
3   Industrial Goods        104000
4   Technology                74200

0   1
0   Consumer Goods    132000
1   Financial                  59000
2   Industrial Goods      153027
3   Industrial Goods      104000
4   Technology              74200

Grouping Operations: Split-Apply-Combine

Grouping operations can be broken down into three steps:

  1. Split the data into groups
  2. Apply a function to each group
  3. Combine the groups into a result

Before we use pandas to group and modify our data, let’s look at how we could accomplish counting the number of companies in each sector using python.

sector_count = {}
for sector in companies['sector']:
  if sector_count.get(sector):
    sector_count.update({sector: sector_count[sector] + 1})
  else:
    sector_count.update({sector: 1})
print(sector_count)
{'Consumer Goods': 4,
'Financial': 5,
'Industrial Goods': 5,
'Technology': 5,
'Basic Materials': 2,
'Services': 3,
'Healthcare': 4}

Grouping

pandas.Dataframe.groupby gives us a better way to group data. groupby returns a DataFrameGroupBy or a SeriesGroupBy object. These objects can be thought of as a group. You can see below that sector_group.groups returns a dictionary of key/value pairs being sectors and their associated rows. See below that the financial group contains rows ([1, 6, 8, 14, 23]) respectively.

sector_group = companies.groupby(by='sector')
print(sector_group)
print(sector_group.groups)
print(sector_group.get_group('Financial'))
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2a7853b828>

{'Basic Materials': Int64Index([5, 27], dtype='int64'),
'Consumer Goods': Int64Index([0, 15, 20, 22], dtype='int64'),
'Financial': Int64Index([1, 6, 8, 14, 23], dtype='int64'),
'Healthcare': Int64Index([13, 18, 21, 24], dtype='int64'),
'Industrial Goods': Int64Index([2, 3, 7, 12, 17], dtype='int64'),
'Services': Int64Index([9, 16, 26], dtype='int64'),
'Technology': Int64Index([4, 10, 11, 19, 25], dtype='int64')}

    id                     name  ...                   industry_group                                employees
1    2      American Express Co  ...       Credit Services                                59000
6   25                 Visa Inc  ...                 Credit Services                                17000
8    8  Goldman Sachs Group Inc  ...    Investment Brokerage - National    36600
14  13      JPMorgan Chase & Co  ...    Money Center Banks                      256105
23  22  Travelers Companies Inc  ...     Property & Casualty Insurance       30400

[5 rows x 7 columns]

To show that the groupby object is just a blueprint, we can replicate get_group by using iloc.

print(companies.iloc[[1,6,8,14,23],[1,5,6]])
                       name                        industry_group                            employees
1       American Express Co            Credit Services                            59000
6       Visa Inc                                  Credit Services                            17000
8       Goldman Sachs Group Inc    Investment Brokerage - National 36600
14     JPMorgan Chase & Co          Money Center Banks                   256105
23     Travelers Companies Inc       Property & Casualty Insurance    30400

In summary, groupby creates a blueprint that enables us to run many useful operations on the group. Using a groupby object is efficient as it allows us to have a one-to-many relationship in regard to calculating group values.

Alternatively, we can use the power of Pandas and use boolean indexing and an aggregation method to return the number of companies in each sector.

Aggregation

Aggregation takes the values and returns a value of a lesser dimension. For example, a function that takes a series and returns a single scalar value. Every GroupBy aggregation either implicitly or explicitly has a grouping column, an aggregation column, and a function column. Common aggregation functions are shown below:

FunctionDescription
mean()Compute mean of groups
sum()Compute sum of group values
size()Compute group sizes
count()Compute count of group
std()Standard deviation of groups
var()Compute variance of groups
sem()Standard error of the mean of groups
describe()Generates descriptive statistics
first()Compute first of group values
last()Compute last of group values
nth()Take nth value, or a subset if n is a list
min()Compute min of group values
max()Compute max of group values
print(sector_group.size())
print(sector_group.size().sum())
sector
Basic Materials       2
Consumer Goods   4
Financial                 5
Healthcare              4
Industrial Goods     5
Services                 3
Technology             5
dtype: int64
28

We can also group by multiple columns. Notice that the aggregate function was called on the employees column automatically as it’s the only number-type column.

companies.groupby(['sector', 'industry_group'])['employees'].sum()
sector                       industry_group
Basic Materials         Major Integrated Oil & Gas                          119600
Consumer Goods     Beverages - Soft Drinks                               62600
                                 Electronic Equipment                                   132000
                                 Personal Products                                       92000
                                 Textile - Apparel Footwear & Accessories    73100
Financial                   Credit Services                                            76000
                                 Investment Brokerage - National                 36600
                                 Money Center Banks                                   256105
                                 Property & Casualty Insurance                    30400
Healthcare                Drug Manufacturers - Major                        296500
                                 Health Care Plans                                       300000
Industrial Goods       Aerospace/Defense Products & Services   393027
                                 Diversified Machinery                                  376516
                                 Farm & Construction Machinery                  104000
Services                   Discount, Variety Stores                              2200000
                                 Home Improvement Stores                          413000
                                 Restaurants                                                 210000
Technology               Business Software & Services                    131000
                                 Information Technology Services                350600
                                 Networking & Communication Devices       74200
                                 Semiconductor - Broad Line                       107400
                                 Telecom Services - Domestic                     144500
Name: employees, dtype: int64 

pandas.core.groupby.DataFrame.agg allows us to perform multiple aggregations at once including user-defined aggregations.

def half(column):
  return column.sum() / 2

def total(column):
  return column.sum()

companies.groupby(['sector']).agg({'employees': [np.min, half, total]})
                                           employees
                                   amin	   half	          total
sector
Basic Materials        48600	  59800.0     119600
Consumer Goods    62600	  179850.0	  359700
Financial                  17000	  199552.5	  399105
Healthcare               69000	  298250.0	  596500
Industrial Goods      93516	  436771.5	  873543
Services                  210000  1411500.0	2823000
Technology              74200	  403850.0	  807700

Transformation

Transformation returns manipulated data in the same size as the passed data. There are many methods and functions at your disposal to transform data, as shown below.

Method / FunctionDescription
Series.mapSubstitute each value for another
Series.applyInvoke a function elementwise on a series
DataFrame.applymapApply a function elementwise to a dataframe
DataFrame.applyInvoke a function on each column or row
Series.transformInvoke a function returning series of transformed values
DataFrame.transformInvoke a function returning dataframe of transformed values
pandas.meltReshape a dataframe. Useful for graphing, vectorized operations, and tidying data.
pandas.pivotReshape a dataframe. Essentially an unmelt operation.

In the below example, we pass a defined function find_percent and return the transformed series.

def find_percent(column):
  return column / float(column.sum())
companies.groupby('sector').agg({'employees': 'sum'}).transform(find_percent)
	employees
sector
Basic Materials	      0.020003
Consumer Goods	  0.060159
Financial	                0.066749
Healthcare	              0.099763
Industrial Goods      0.146098
Services	                0.472141
Technology	            0.135086

As with agg, we can pass multiple functions, including lambda functions.

companies.groupby('sector').agg({'employees':'sum'}).transform([lambda x: x / x.sum()])
	  employees
                              <lambda>
sector
Basic Materials       0.020003
Consumer Goods   0.060159
Financial                 0.066749
Healthcare              0.099763
Industrial Goods     0.146098
Services                 0.472141
Technology             0.135086

While we can’t pass multiple functions to apply as we can with transform, we can access other columns using apply where we are limited with transform. Again, it’s best to know what’s available to you and use the best tool for the job at that current moment.

companies.groupby('sector').apply(lambda x: x['employees'] * 2)
sector
Basic Materials        5       97200
                                27     142000
Consumer Goods    0      264000
                                15     125200
                                20     146200
                                22     184000
Financial                  1      118000
                                6       34000
                                8       73200
                                14     512210
                                23      60800
Healthcare               13     270200
                                18     138000
                                21     184800
                                24     600000
Industrial Goods      2      306054
                                3      208000
                                7      566000
                                12     480000
                                17     187032
Services                  9      826000
                               16     420000
                               26    4400000
Technology             4      148400
                               10     701200
                               11     214800
                               19     262000
                               25     289000
Name: employees, dtype: int64

Filtration

Filter returns a subset of the original data. It works similar to boolean indexing except instead of working on individual rows, it works on individual groups. Filter must return a True or False value for the group as a whole. The services sector is the only sector that has more than 1,000,000 employees.

companies.groupby('sector').filter(
    lambda x: x['employees'].sum() > 1000000
    )[['name', 'employees']]
name	                           employees
id
9	Home Depot Inc       413000
15	McDonalds Corp    210000
27	Walmart Inc            2200000

Pivot Tables

We can produce the same data in a different format by using pandas.pivot_table.

companies.pivot_table(columns='sector', values='employees', aggfunc='sum')
sector         Basic Materials      Consumer Goods ...
employees 119600                   359700

Joining Data

We can quickly join two dataframes through join, merge, and concat. Merge is the underlying function for all join/merge behavior. Join has slightly different defaults and is provided as a convenience method.

Method / FunctionDescription
pandas.DataFrame.joinJoin dataframe on on index, or key column
pandas.DataFrame.mergeMerge dataframe using database-style joins
pandas.concatConcatenate dataframes along a particular axis

Let’s see an example in action by bringing in the security data and joining on the id column.

import pandas as pd
import numpy as np
import urllib.request
url= 'https://raw.githubusercontent.com/leosmigel ...
  /analyzingalpha/master/data-manipulation-with-python/securities.csv'
with urllib.request.urlopen(url) as f:
  securities = pd.read_csv(f, index_col='id')
print(securities.head())
securities_companies = companies.join(securities)
securities_companies[['name', 'ticker', 'figi']].head(7)
        ticker      currency       figi
id
1      AAPL      USD             BBG000B9Y5X2
2      AXP        USD             BBG000BCR153
3      BA          USD             BBG000BCSV38
4      CAT        USD             BBG000BF0LJ6
5      CSCO    USD             BBG000C3JBN9
         name                            ticker         figi
id
1       Apple Inc                       AAPL    BBG000B9Y5X2
2       American Express Co	 AXP      BBG000BCR153
3       Boeing Co                     BA         BBG000BCSV38
4       Caterpillar Inc               CAT       BBG000BF0LJ6
5       Cisco Systems Inc        CSCO   BBG000C3JBN9
6       Chevron Corp               CVX      BBG000K4NHJ5
25     Visa Inc                         V           BBG000PSL0X0

Leave a Comment