How to Create an Equities Database

When researching what works in the markets, you’ll want to store your data in a database. There are many reasons why, but the two primary purposes are data persistence as flexibility. As your investing and trading strategies get more advanced, you’ll often need to combine multiple sources of data and manipulate that data into the format you need for your backtesting simulations.

In this post, I’m going to show you how to design your database schema, create your very own equity database in PostgreSQL, and create the tables via a GUI, Python or SQL depending upon your needs. And while this database is for equities, it could easily be adapted for different markets such as futures or forex, or different database technology such as MySQL or MSSQL.

Designing an Equities Database

Before designing your database, it’s important to understand the market you’re working with. For instance, backtesting an equities strategy against the S&P500 would require us to consider a number of data-related issues:

  1. What is our desired trading frequency? Do we want need daily or more frequent data?
  2. Is there surviorship-bias in our dataset? What about mergers, spinoffs, and changed symbols?
  3. Do we have total return? Are we adding back dividends, buybacks, and other corporate actions to our price data?

With these considerations in mind, we’re ready to design our database. When developing a complex database, you may want to use a database modeler. Database modelers, such as SqlDBM, allow you to map out the complex relationships between data tables visually. Below is the database schema, which is just another way of saying database blueprint, that I’ll be creating in this article. Once you get comfortable, with the below entity-relationship diagram (ERD), you will likely want to add more to the schema such as fundamental and economic data.

Equities Entity-Relational Map

With the design of the database complete, it’s time to get it installed and configured.

Installing PostgreSQL

PostgreSQL is my database of choice. It’s free and arguably the world’s most advanced open source database. I’m going to be going over how to install and configure PostgreSQL, but there are plenty of hosted versions available such as ElephantSQL if you prefer your data in the cloud. I use the local option for my research databases as it’s faster and gives me more flexibility.

Unlike the Zipline installation, there’s plenty of resources available that demonstrate how to install PostgreSQL. So instead of duplicating work, follow one of the two guides:

  1. How to Install and Use PostgreSQL on Ubuntu
  2. PostgreSQL Wiki Detailed Installation Guides

Once PostgreSQL is installed, you’ll want to install PgAdmin to help you manage it. PgAdmin is an open source administration platform for PostgreSQL. Below is a screenshot of PgAdmin. PgAdmin Screenshot

You can run PgAdmin in a standalone or server mode. If you’re only accessing PgAdmin on the computer it’s running on, use the standalone method. If you’re accessing PostgreSQL and PgAdmin from a different computer, you’ll need to run a few more commands to configure it as a server.

For those of you running macOS or Windows and want a standalone installation, you’ve got it easy. Download PgAdmin, and install it. That’s it. If you’re running Linux, which is most likely the case if you’re doing advanced research, there are plenty of PgAdmin setup articles available with my favorites listed below:

PgAdmin Installation (Standalone)

The Postgresql Wiki provides both directions and a script to add their Apt repository to install PgAdmin4 in Ubuntu.

PgAdmin Installation (Server)

Digital Ocean provides directions on how to install PgAdmin4 in server mode.

After you’ve installed PgAdmin, and if you’re going to configure your tables graphically, you’ll want to show system types so you can use enumerated fields. You can do this with the following click path in PgAdmin:

File → Preferences → Browser → Display → Show system objects = True

Creating the Equities Database Graphically

With a blueprint of the tables, relationships, and columns that our database will hold in hand, we need to feed the schema into PostgreSQL somehow. We can do this one of two ways:

  1. Using a GUI
  2. Programmatically

Creating the Database in PgAdmin

If you’re new to databases and PostgreSQL, this is likely the route you’ll want to take. You can log in to PgAdmin and create your database schema through the graphical user interface (GUI). You can click on the SQL tab to see the SQL statement that will create the database if you’re interested in the SQL details. And if you are, I highly suggest this great PostgreSQL tutorial.

PgAdmin Database Creation

Creating the Database Programmatically

If you’re not interested in using the PgAdmin GUI, you can insert the schema directly into PostgreSQL using SQL or Python. As with most things Python-related, there are multiple ways to do this. I recommend learning SQLAlchemy as it’s in object-relational mapper (ORM) that sits on top of database adapters such as psycopg2 and abstracts much of the database specifics out of the equation. In short, once you learn SQLAlchemy, you can switch out the underlying database quite easily. The only catch is that if you want to alter the database, you will need to import another library or execute the alter command using engine.execute.

Creating the Database Using SQLAlchemy

We first need to connect to the database. You can make this as simple or sophisticated as you would like. Datacamp’s Using Python with PostgreSQL is a great article that explains how to create a database session, and auth0 provides a great tutorial on SQLAlchemy. You’ll want to make sure you’ve selected the appropriate Python environment using conda or pip. If you’re not sure how to do this, reference my post on How to Install Zipline on Ubuntu Linux.

Create setup_psql_environment.yaml to store your credentials. YAML does a good job storing configuration information. You’ll also likely want to add this file to your .gitignore so you don’t accidentally make your credentials public.

PGHOST: localhost
PGDATABASE: your_database_here
PGUSER: your_user_here
PGPASSWORD: your_password_here
PGPORT: 5432

The file setup_psql_enironment.py imports all of the modules we need to connect to Postgres.

from sqlalchemy import create_engine
import yaml
import logging
log = logging.getLogger(__name__)

def get_database():
    try:
        engine = get_connection_from_profile()
        log.info("Connected to PostgreSQL database!")
    except IOError:
        log.exception("Failed to get database connection!")
        return None, 'fail'
    return engine

def get_connection_from_profile(config_file_name="setup_psql_environment.yaml"):
    """
    Sets up database connection from config file.
    Input:
    config_file_name: File containing PGHOST, PGUSER,
                        PGPASSWORD, PGDATABASE, PGPORT, which are the
                        credentials for the PostgreSQL database
    """
    with open(config_file_name, 'r') as f:
        vals = yaml.safe_load(f)
    if not ('PGHOST' in vals.keys() and
            'PGUSER' in vals.keys() and
            'PGPASSWORD' in vals.keys() and
            'PGDATABASE' in vals.keys() and
            'PGPORT' in vals.keys()):
        raise Exception('Bad config file: ' + config_file_name)
    return get_engine(vals['PGDATABASE'], vals['PGUSER'],
                        vals['PGHOST'], vals['PGPORT'],
                        vals['PGPASSWORD'])

def get_engine(db, user, host, port, passwd):
    """
    Get SQLalchemy engine using credentials.
    Input:
    db: database name
    user: Username
    host: Hostname of the database server
    port: Port number
    passwd: Password for the database
    """
    url = 'postgresql://{user}:{passwd}@{host}:{port}/{db}'.format(
        user=user, passwd=passwd, host=host, port=port, db=db)
    engine = create_engine(url, pool_size = 50, echo=True)
    return engine

Create a models.py file that will define our schema and map our Python classes to PostgreSQL tables. You will want to adjust your model based on your own preferences such as using ENUM types for company.industry_category or adding a table for quarterly fundamentals. Remember, this is just to get you started!

from sqlalchemy import Column, ForeignKey, Boolean, String, \
                       Integer, BigInteger, Float, Date    
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from sqlalchemy import Enum, UniqueConstraint
import enum

Base = declarative_base()

class PriceFrequency(enum.Enum):
    daily = 'daily'
    weekly = 'weekly'
    monthly = 'monthly'
    quarterly = 'quarterly'
    yearly = 'yearly'

class Security(Base):
    __tablename__ = 'security'
    id = Column(Integer, primary_key=True, autoincrement=True)
    id_intrinio = Column('id_intrinio', String(10), unique=True, nullable=False)
    code = Column('code', String(3), nullable=False)
    currency = Column('currency', String(3), nullable=False)
    ticker = Column('ticker', String(12), nullable=False)
    name = Column('name', String(200), nullable=False)
    figi = Column('figi', String(12))
    composite_figi = Column('composite_figi', String(12))
    share_class_figi = Column('share_class_figi', String(12))
    exchange_id = Column(Integer, ForeignKey('exchange.id',
                                    onupdate="CASCADE",
                                    ondelete="SET NULL"))
    has_invalid_data = Column('has_invalid_data', Boolean)
    has_missing_company = Column('has_missing_company', Boolean)
    exchange = relationship('Exchange')
    company = relationship('Company')

class Exchange(Base):
    __tablename__ = 'exchange'
    id = Column(Integer, primary_key=True, autoincrement=True)
    mic = Column('mic', String(10), unique=True, nullable=False)
    acronym = Column('acronym', String(20))
    name = Column('name', String(200), nullable=False)
    security = relationship('Security')

class SecurityPrice(Base):
    __tablename__ = 'security_price'
    id = Column(Integer, primary_key=True)
    date = Column('date', Date, nullable=False)
    open = Column('open', Float)
    high = Column('high', Float)
    low = Column('low', Float)
    close = Column('close', Float)
    volume = Column('volume', BigInteger)
    adj_open = Column('adj_open', Float)
    adj_high = Column('adj_high', Float)
    adj_low = Column('adj_low', Float)
    adj_close = Column('adj_close', Float)
    adj_volume = Column('adj_volume', BigInteger)
    intraperiod = Column('intraperiod', Boolean, nullable=False)
    frequency = Column('frequency', Enum(PriceFrequency),nullable=False)
    security_id = Column(Integer, ForeignKey('security.id',
                                    onupdate="CASCADE",
                                    ondelete="CASCADE"),
                                    nullable=False)
    UniqueConstraint('date', 'security_id')
    security = relationship('Security')

class StockAdjustment(Base):
    __tablename__ = 'stock_adjustment'
    id = Column(Integer, primary_key=True)
    date = Column('date', Date, nullable=False)
    factor = Column('factor', Float, nullable=False)
    dividend = Column('dividend', Float)
    split_ratio = Column('split_ratio', Float)
    security_id = Column(Integer, ForeignKey('security.id',
                                    onupdate="CASCADE",
                                    ondelete="CASCADE"),
                                    nullable=False)
    security = relationship('Security')

class Company(Base):
    __tablename__ = 'company'
    id = Column(Integer, primary_key=True)
    name = Column('name', String(100), nullable=False)
    cik = Column('cik', String(10))
    description = Column('description', String(2000))
    company_url = Column('company_url', String(100))
    sic = Column('sic', String(4))
    employees = Column('employees', Integer)
    sector = Column('sector', String(200))
    industry_category = Column('industry_category', String(200))
    industry_group = Column('industry_group', String(200))
    security_id = Column(Integer, ForeignKey('security.id',
                                    onupdate="CASCADE",
                                    ondelete="CASCADE"),
                                    nullable=False)
    security = relationship('Security')

With the database tables defined, we can now tell SQLAlchemy to create our database. You can run create_psql_database.py or run the code from the Python REPL, which is just a fancy way of saying the Python command line.

from sqlalchemy import MetaData
from sqlalchemy.orm import sessionmaker
from setup_psql_environment import get_database
from sqlalchemy.ext.declarative import declarative_base

import models

# Setup environment and create a session
db = get_database()
Session = sessionmaker(bind=db)
meta = MetaData(bind=db)
session = Session()

# Create database from SQLAlchemy models
models.Base.metadata.create_all(db)

Creating the Database Using SQL

If you’re not interested in using SQLAlchemy, you can create the tables using SQL and change out the names where applicable.

-- Table: public.exchange
-- DROP TABLE public.exchange;

CREATE TABLE public.exchange
(
    id integer NOT NULL DEFAULT nextval('exchange_id_seq'::regclass),
    mic character varying(10) COLLATE pg_catalog."default" NOT NULL,
    acronym character varying(20) COLLATE pg_catalog."default",
    name character varying(200) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT exchange_pkey PRIMARY KEY (id),
    CONSTRAINT exchange_mic_key UNIQUE (mic)

)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

ALTER TABLE public.exchange
    OWNER to username;


-- Table: public.security
-- DROP TABLE public.security;

CREATE TABLE public.security
(
    id integer NOT NULL DEFAULT nextval('security_id_seq'::regclass),
    id_intrinio character varying(10) COLLATE pg_catalog."default" NOT NULL,
    code character varying(3) COLLATE pg_catalog."default" NOT NULL,
    currency character varying(3) COLLATE pg_catalog."default" NOT NULL,
    ticker character varying(12) COLLATE pg_catalog."default" NOT NULL,
    name character varying(200) COLLATE pg_catalog."default" NOT NULL,
    figi character varying(12) COLLATE pg_catalog."default",
    composite_figi character varying(12) COLLATE pg_catalog."default",
    share_class_figi character varying(12) COLLATE pg_catalog."default",
    exchange_id integer,
    has_invalid_data boolean,
    has_missing_company boolean,
    CONSTRAINT security_pkey PRIMARY KEY (id),
    CONSTRAINT security_id_intrinio_key UNIQUE (id_intrinio)
,
    CONSTRAINT security_exchange_id_fkey FOREIGN KEY (exchange_id)
        REFERENCES public.exchange (id) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE SET NULL
        NOT VALID
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

ALTER TABLE public.security
    OWNER to username;


-- Table: public.security_price
-- DROP TABLE public.security_price;

CREATE TABLE public.security_price
(
    id integer NOT NULL DEFAULT nextval('security_price_id_seq'::regclass),
    date date NOT NULL,
    open double precision,
    high double precision,
    low double precision,
    close double precision,
    volume bigint,
    adj_open double precision,
    adj_high double precision,
    adj_low double precision,
    adj_close double precision,
    adj_volume bigint,
    intraperiod boolean NOT NULL,
    frequency pricefrequency NOT NULL,
    security_id integer NOT NULL,
    CONSTRAINT security_price_pkey PRIMARY KEY (id)
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

ALTER TABLE public.security_price
    OWNER to username;


-- Table: public.stock_adjustment
-- DROP TABLE public.stock_adjustment;

CREATE TABLE public.stock_adjustment
(
    id integer NOT NULL DEFAULT nextval('stock_adjustment_id_seq'::regclass),
    date date NOT NULL,
    factor double precision NOT NULL,
    dividend double precision,
    split_ratio double precision,
    security_id integer NOT NULL,
    CONSTRAINT stock_adjustment_pkey PRIMARY KEY (id)
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

ALTER TABLE public.stock_adjustment
    OWNER to username;


-- Table: public.company
-- DROP TABLE public.company;

CREATE TABLE public.company
(
    id integer NOT NULL DEFAULT nextval('company_id_seq'::regclass),
    name character varying(100) COLLATE pg_catalog."default" NOT NULL,
    cik character varying(10) COLLATE pg_catalog."default",
    description character varying(2000) COLLATE pg_catalog."default",
    company_url character varying(100) COLLATE pg_catalog."default",
    sic character varying(4) COLLATE pg_catalog."default",
    employees integer,
    sector character varying(200) COLLATE pg_catalog."default",
    industry_category character varying(200) COLLATE pg_catalog."default",
    industry_group character varying(200) COLLATE pg_catalog."default",
    security_id integer NOT NULL,
    CONSTRAINT company_pkey PRIMARY KEY (id)
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

ALTER TABLE public.company
    OWNER to username;
leo

Leo Smigel

Based in Pittsburgh, Analyzing Alpha is a blog by Leo Smigel exploring what works in the markets.