/ DATA SCIENCE

Python for Finance: An Easy Introduction

If you’re interested in finance and don’t mind programming, learning this trio will open you up to a whole new world, including but not limited to, automated trading, backtest analysis, factor investing, portfolio optimization, and more.

I think of the Python, pandas and NumPy combination as a more robust and flexible Microsoft Excel.

Python is extremely popular as it allows developers to create programs quickly without worrying about a lot of the details. It’s a high-level language. The problem with high-level languages is that they are often slow. The reason why Python is an excellent language for finance is that you get both the benefit of development speed with fast execution as both pandas and NumPy are highly optimized.

What is NumPy?

Let’s pretend we are calculating the price to earnings ratio for 800 stocks. We have two columns stored in lists with the first being price and the second being earnings. If we were to loop through each row, our CPU would need 800 cycles to convert the calculation into bytecode and then calculate the answer.

Vectorization

NumPy takes advantage of vectorization processing multiple calculations at a single time using Single Instruction Multiple Data (SIMD). While the loop may have taken 800 cycles, the vectorized computation may only take 200 cycles or even less.

NumPy also makes our life as a developer easier as it simplifies many of the everyday tasks we need to perform on a dataset. Take the below as an abbreviated example of the 800 stock scenario above:

Example Using Python and a List of Lists

price_earnings = [
                ['Company A', 10.0, 2.0],
                ['Company B', 30.0, 4.0],
                ['Company C', 50.0, 6.0],
                ['Company D', 70.0, 8.0],
                ['Company E', 90.0, 10.0]
                ]

pe_ratios = []

for row in price_earnings:
    pe_ratio = row[1] / row[2]
    pe_ratios.append(pe_ratio)

Example Using NumPy

`pe_ratios = price_earnings[:,0] / price_earnings[:,1]`

While you may not understand the exact syntax at the moment, you get the point: NumPy makes both development and execution faster. You can use any of the standard Python numeric operators in your vector math. The one catch, which is evident when you think about it, is that both vectors must be of the same shape and type such as a 5-item integer or 10-element float.

Creating a NumPy Array

import numpy as np
price_earnings = [
                ['Company A', 10.0, 2.0],
                ['Company B', 30.0, 4.0],
                ['Company C', 50.0, 6.0],
                ['Company D', 70.0, 8.0],
                ['Company E', 90.0, 10.0]
                ]

## Creating array from prior values
company_table = np.array(price_earnings)

## Autogenrate a 1D array
array1d = np.arange(10)

## Autogenerate a 2D array into 2 rows by 5 columns
array2d = array1d.reshape(2,5)

NumPy also makes it easy for us to calculate various statistics.

numpy.mean returns the average along the specified axis. numpy.ndarray.max returns the maximum value along a given axis. numpy.ndarray.min returns the minimum value along a given axis.

We’ve been discussing 1d or 1-dimensional arrays. You can think of these as equivalent to Excel columns.

2d or 2-dimensional arrays are like an Excel sheet where values are organized by both rows and columns. max, min, and mean will calculate the respective calculation across all of the cells of the 2d array unless we specify otherwise.

If we want to get the max of a row, we would specify ndarray.max(axis=1). If we want to find the max of each column, we would specify ndarray.max(axis=0) switching out the ndarray with your ndimensional array. When using these type of operations, all of the fields must be of the same type.

Selecting Data

Selecting data in NumPy is similar to highlighting and copying cells in Excel. The colon represents the entire row or column. When retrieving data from a NumPy 2D array, it uses the row, column format where both row and column can take a value or a list:

numpy_2d_array[row,column]

Common selection patterns

    ## Retrieve a single cell
    cell = company_table[1,1]
    
    ## Retrieve the first row and all of the columns
    row = company_table[0,:]
    
    ## Retreive all rows for the third column
    column = company_table[:,2]
    
    ## Retreive two columns
    two_columns = company_table[:,[0,2]]
    
    ## Retreive subsection
    subsection = company_table[[0,2],[0,1]]

Selecting Data using Boolean Arrays

Sometimes called Boolean vectors or Boolean masks, Boolean arrays are an incredibly powerful tool. Booleans result into either True or False based upon a comparison operator such as greater than (>), equal to (==) or less than (<).

If we wanted to check for all stocks from the above example with a pe_ratio of less than 8, we could do the following:

low_pelow_pe_filter = pe_ratios < 8

This would return a 1D ndarray containing True and False values. I’ve added the other columns for readability, but again, the result would only be a single column:

CompanyPriceEarningsP/E RatioPE < 8
Company A1025.0True
Company B3047.5True
Company C5068.3False
Company D7088.75False
Company E90109.0False

Once we have a Boolean mask, we apply it to the original array to filter the values we want. This is known as Boolean indexing.

low_pe_stocks = company_table[low_pe_filter]

CompanyPriceEarningsP/E Ratio
Company A1025.0
Company B3047.5

Working with a 2-dimensional mask is no different. We just have to make sure that the mask dimension is the same as the array dimension being filtered. See the below NumPy 2D array with working and incorrect examples.

2d_array = np.array([
                    [1,2,3,4],
                    [5,6,7,8],
                    [9,10,11,12],
                    [13,14,15,16],
                    [17,18,19,20]
                    ])
1234
5678
9101112
13141516
17181920
array2d = np.array([
                    [1,2,3,4],
                    [5,6,7,8],
                    [9,10,11,12],
                    [13,14,15,16],
                    [17,18,19,20]
                    ])

rows_mask = [True, False, False, True, True]
columns_mask = [True, True, False, True]

# working examples
works1 = array2d[rows_mask]
works2 = array2d[:, columns_mask]
works3 = array2d[rows_mask, columns_mask]

# incorrect examples due to dimension mismatch
does_not_work1 = array2d[columns_mask]
does_not_work2 = array2d[row_mask, row_mask]

Modifying Data

Now that we know how to select data, modifying data becomes easy. To modify the above array, we can simply assign a value to our selections.

# Change single value
array2d[0,0] = 5
array2d[0,0] = array2d[0,0] * 2

# Change row
array2d[0,:] = 0

# Change column
array2d[:,0] = 0

# Change a block
array2d[1:,[2,3]] = 0

# Change any value greater than 5 to 5
array2d[array2d > 5] = 5

# Change any value in column 2 greater than 5 to 5
array2d[array2d[:,1] > 5, 1] = 5

What is Pandas?

Pandas is an extension of NumPy that builds on top of what we already know, supports vectorized operations, and makes data analysis and manipulation even easier.

Pandas has two core data types: Series and DataFrames.

  1. A pandas Series is a one-dimensional object. It’s pandas version of the NumPy one-dimensional array.
  2. A pandas DataFrame is a two-dimensional object. It’s similar to a NumPy 2D array.

Both pandas Series and pandas DataFrames are similar to their NumPy, equivalents but they both have additional features making manipulating data easier and more fun.

Both 2D NumPy arrays and Panda DataFrames have a row (index) and a column axis; however, the pandas versions can have named labels.

There are three main components of a DataFrame:

  1. Index axis (row labels)
  2. Columns axis (column labels)
  3. Data (values)

See the below image sourced from Selecting data from a Pandas DataFrame

Pandas Dataframe Example

Creating a Pandas DataFrame

Generally, you’ll create a pandas dataframe by reading in records from a csv, json, api, or database. We can also construct one from a dictionary as shown below.

dataframe.set_index allows you to change the index if desired.

data = {'company': ['Company A', 'Company B', 'Company C', 'Company D', 'Company E'],
        'price': [10, 30, 50, 70, 90],
        'earnings': [2.0, 4.0, 6.0, 8.0, 10.0]
        }

df = pd.DataFrame(data)

Creating a Pandas Series

Just like a dataframe, you’ll usually read in the data for a series from a csv, json, api or database. For example purposes, we’ll use the above dataframe to construct our series by returning a single column or row from our dataframe.

# Create a series  
column_series = df['company']
row_series = df.loc[0]

The Data Types

Data is generally categorized as continuous or categorical. Continuous data is numeric and represents a measurement such as height, price or volume. Categorical data represents discrete amounts such as eye color, industry sector or share type.

Pandas data can be many different dtypes with the most common shown below:

General TypePandas String NameNumPy / Pandas Object
Booleanboolnp.bool
Integerintnp.int
Floatfloatnp.float
Objectobjectnp.object
Datetimedatetime64np.datetime64, pd.Timestamp
Timedeltatimedelta64np.timedelta64, pd.Timedelta
Categoricalcategorypd.Categorical

Understanding the Data

dataframe.info returns common information such as all of the dtypes, which is shorthand for datatypes, the shape, and other pertinent information regarding the dataframe. Notice that the order of the columns are not preserved.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
company     5 non-null object
earnings    5 non-null float64
price       5 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 192.0+ bytes
print(df.head(2))
print(df.tail(2))

        company  earnings  price
0  Company A       2.0     10
1  Company B       4.0     30
        company  earnings  price
3  Company D       8.0     70
4  Company E      10.0     90

Selecting Data

We can select data in Pandas can be done with one of three indexers [], loc[] and iloc[]:

  • DataFrame[row_selection, column_selection] indexing operator best used for selecting columns.
  • DataFrame.loc[row_selection, column_selection] index rows and columns by labels inclusive of the last column
  • DataFrame.iloc[row_selection, column_selection] index rows and columns by integer exclusive of the last integer postion
  • Series.loc[index_selection] index rows by the index label.
  • Series.iloc[index_selection] index rows by the integer position.

Shortcuts for Selecting Data

The shortcut dataframe[‘label’] can be used when selecting a single row or list of rows. The shortcut dataframe[‘label1’:’label3’] can be used for slicing rows; however, I believe using dataframe.loc and dataframe.iloc when performing row selection as it makes what you’re doing more explicit.

## Retrieve a single cell
cell = df.loc[1,'price']

## Retrieve the first row and all of the columns
row = df.loc[0,:]

## Retreive all rows for the third column
column = df.loc[:,'earnings']
column_short = df['earnings']

## Retreive a range of rows
rows = df.loc[1:3]
rows_short = df[1:3]

## Retreive two columns
two_columns = df.loc[:,['company','earnings']]
two_columns_by_position = df.iloc[:,[0,2]]
two_columns_short = df[['company','earnings']]

## Retrieve a range of columns
range_of_columns = df.loc[:,'earnings':'price']
range_of_columns_by_position = df.iloc[:,1:3]

## Retreive subsection
subsection = df.loc[0:2,['company','price']]

## Selecting from Series
## Selecting a single cell
cell = column_series.loc[1]
cell_short = column_series[1]

## Selecting a range of cells from a column
cell_range = column_series.loc[0:3]
cell_range_short = column_series[0:3]

## Selecting a range of cells from a row
cell_range = row_series.loc['company':'price']
cell_range_short = row_series['company':'price']

Understanding Data

Pandas dataframes and series are both unique objects so they have different methods; however, many methods are implemented for both dataframes and series as shown below. Large numbers are shown in E-notation.

Series.describe and DataFrame.describe returns descriptive statistics. We can use method chaining to return a pandas series and then describe it as shown in the second example below. Describe works differently on numeric and non-numeric values.

print(df.describe())

            price   earnings
count   5.000000   5.000000
mean   50.000000   6.000000
std    31.622777   3.162278
min    10.000000   2.000000
25%    30.000000   4.000000
50%    50.000000   6.000000
75%    70.000000   8.000000
max    90.000000  10.000000

print(df['price'].describe())
count     5.000000
mean     50.000000
std      31.622777
min      10.000000
25%      30.000000
50%      50.000000
75%      70.000000
max      90.000000
Name: price, dtype: float64

print(df['company'].describe())
count             5
unique            5
top       Company B
freq              1
Name: company, dtype: object

For some dataframe methods we’ll need to specify if we want to calculate the result across the index or columns. For example, if we sum across the row values (index), we’ll get a result for each column. If we sum across all of the column values, we’ll get a result for each row.

To remember the axis numbers, I remember how to reference a dataframe[0,1] where 0 is the first position and represents rows and 1 is our second position and represents columns. And while 0 represents calculating down rows, it’s also often the default and doesn’t need to be explicitly stated. If you’re still having trouble remembering the order, you can think that a 0 will add another row and a 1 will add another column.

## Sum across columns
df[['earnings','price']].sum(axis=1)

## Sum across rows
## Sum across rows calculating a value for each column. This is the default
df[['earnings','price']].sum()
df[['earnings','price']].sum(axis=0)

Standard Operations Between Two Series

As with NumPy, we can utilize all of the Python numeric operations on two panda series returning the resultant series.

## Standard Operation examples
print(df[['earnings','price']] + 10)
    earnings  price
0      12.0     20
1      14.0     40
2      16.0     60
3      18.0     80
4      20.0    100
                        
df['pe_ratio'] = df['price'] / df['earnings']       
print(df)
        company  price  earnings  pe_ratio
0  Company A     10       2.0  5.000000
1  Company B     30       4.0  7.500000
2  Company C     50       6.0  8.333333
3  Company D     70       8.0  8.750000
4  Company E     90      10.0  9.000000

Assigning Values

Now that we understand selection, assignment becomes trivial. It’s just like NumPy only we can use the pandas selection shortcuts. This includes both assignment and assignment using Boolean indexing. See the examples below from the dataframe selection section. Note that I set the index to company instead of using a numeric index.

# Assignment                                      
df.set_index('company', inplace=True) 
#df.loc['Company B', 'pe_ratio'] = 0.0
                                                            
## Assignment to a single element
df.loc['Company B','price'] = 0.0
                                                    
## Assignment to an entire row
df.loc['Company B',:] = 0                                                            
                                                                                
## Assignment to an entire column             
df.loc[:,'earnings'] = 0.0    
df['earnings'] = 1.0
                        
## Assignment to a range of rows      
df.loc['Company A':'Company C'] = 0.0
df['Company A':'Company C'] = 1.0    
                                    
## Assignment to two columns                   
df.loc[:,['price','earnings']] = 2.0           
df[['price','earnings']] = 3.0                    
                                                    
## Assignment to a range of columns              
df.loc[:,'earnings':'pe_ratio'] = 4.0                   
                            
## Assignment to a subsection                
df.loc['Company A':'Company C',['price','earnings']] = 5.0

## Assignment using boolean indexing
### Using boolean value
price_bool = df['price'] < 4.0
df.loc[price_bool,'price'] = 6.0

### Skipping boolean value
df.loc[df['price'] >= 4, 'price'] = 7.0
leo

Leo Smigel

Based in Pittsburgh, Analyzing Alpha is a blog by Leo Smigel exploring what works in the markets.