This post will introduce a couple of interesting datasets I recently stumbled upon. They contain historical stock return and fundamental data going back to`the 1980’s. Below I will outline the process by which I have made this data available, and perform an initial exploratory analysis.
Background If you read this post, you will know I am collecting accounting and fundamental data for US stocks via the SEC EDGAR database. Price and other reference type data is also collected, and you can read about it here.
“datasets are often highly structured, containing clusters of non-independent observational units that are hierarchical in nature, and Linear Mixed Models allow us to explicitly model the non-independence in such data”
(Harrison et al., 2018) [1]
They allow modeling of data measured on different levels at the same time - for instance students nested within classes and schools - thus taking complex dependency structures into account
(Burkner, 2018) [2]
I recently finished reading Statistical Rethinking by Richard McElreath. The reviews are true, it’s a great book. As a bonus, it comes with 20 hours of supporting lectures taking you through the content.
Statistical Rethinking is an introduction to statistical modelling using Bayesian methods. Bayesian methods seem pretty popular at the moment, so whats the deal? To distill it into a couple of lines, Bayesian methods provide a distribution for model parameters and allow for incorporation of prior knowledge.
This post is going to analyse the momentum effect in US stocks using both publicly available aggregate data, and privately collected individual stock level data.
The momentum effect is the tendency for stocks that have gone up (down) in the past to continue going up (down) in the immediate future. Going up or down in the past is usually defined as the prior 12 months returns and is measured on a relative basis.
That title deserves an explanation.
This note will look at the Theil Sen estimator for robust regression. I’m going to use the UCI Machine Learning abalone data set to compare this technique with Ordinary Least Squares.
This one is via a Colab notebook, all is explained here.
What’s a stock master? It’s database, that contains data on stocks. It is also the master or authoritative, at least for me, source of that data. What kind of data exactly? Prices and fundamentals (and maybe economic time series).
This post is going to document the data sources and tools used in building this database. The repo for the project is here.
Motivation Firstly, why do I need a database that containing this type of information?
This is a quick post about intra-portfolio correlation.
Intra-portfolio correlation (“IPC”) is defined as a weighted average for all unique pairwise correlations within a portfolio. It has typically been used to measure a portfolio’s diversification. That’s not what I’m interested in however. I’m looking at IPC as a potential technical trading indicator.
The idea being that an increase or decrease in the co-movement of a group of stocks (or the market as a whole for that matter) may say something about their future returns.
Work has initiated a Coursera led training program so it is goodbye to Dataquest for now.
The “Applied Data Science Capstone” in the IBM Specialiation has participants using Foursquare business venue data retrieved via an API to solve a business or other problem.
I have decided to investigate the utilisation of loans under the Small Business Administration Paycheck Protection Program in the United States.
What follows is a brief outline of the work and findings.
We continue on our IFRS9 disclosures quest!
Part 2 had us doing some heavy data munging, followed by modeling to estimate an ECL balance.
In this post we will massage the dataset from part 2 and prepare the report we specified in the first post. This report sets out an opening to closing balance of the loan and expected credit loss balances, and also details transfers between risk stages.
The next instalment of the Dataquest guided projects. This one covers the datetime module and like the last, heavily utilises loops and dictionaries.
Next up its Numpy and pandas!!