Time Series Cross-validation in Python

09 Dec 2019, 00:00

Python

The content below is a Jupyter Notebook converted to markdown following the instructions of this blog. This allows for publishing of Jupyter Notebook’s using blogdown. Everything appears to work except for the rendering of tables. That will be something to resolve. Table formatting notwithstanding, it doesn’t really look as nice as the standard Jupyter Notebook format. I’ve therefore linked the original here.

Introduction

This notebook will implement a rolling out of sample forecast on time series data. The aim of this notebook is to set up the initial infrastructure while allowing for subsequent iteration. Getting the piping to work, so to speak. Subsequent iterations will add an inner loop allowing for nested cross validation. This will be used to tune hyper-parameters in a time series context.

We will forecast the S&P500 index using an Elastic net logistic regression. This will be implemented using Scikit Learn’s SGDClassifier specifying the loss function “log” and penalty “elasticnet”. This specification will result in a regularised logistic regression whereby parameter may be excluded from the model via the shrinkage operator.

A key output of the notebook is a time series of the co-efficients of the parameters included in model. This will allow for assessment of the stability of the model over time.

# Import required libraries
import numpy as np
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import datetime
import math
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
plt.rcParams["figure.figsize"] = (12, 8)
import sklearn.metrics as metrics
from sklearn import linear_model
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

The data. It comes from this R script. The data contains an attribute y1, which is a binary indicator that looks forward over the next 6 months and returns 1 if the maximum drawdown is more than 20% or the return is less than 2.5%.

Below we convert dates stored as string to the datetime format and then set this date as the index.

# Read csv fle from github
url = 'https://raw.githubusercontent.com/Brent-Morrison/Misc_scripts/master/econ_fin_data.csv'
df_raw = pd.read_csv(url)

# Convert date to datetime and create month end date
df_raw['date'] = pd.to_datetime(df_raw['date'], format='%Y-%m-%d')
df_raw['me_date'] = pd.Index(df_raw['date']).to_period('M').to_timestamp('M')

# Set date as index in new df
df = df_raw.set_index('me_date')

# Inspect csv data - shape
df.shape

(899, 47)

Some simple feature engineering for attributes that may forecast stock returns. We are not too concerned about the data and model form at this stage. Remember we are just getting the piping to work.

# Creation of attributes
df = df.assign(
    CRED_SPRD = df.BAA - df.AAA
    ,YLD_SPRD = df.GS10 - df.FEDFUNDS
    ,LOAN_GROWTH = np.log(df.LOANS / df.LOANS.shift(6))
    )

# Inspect csv data - first and last records
df.iloc[np.r_[0:4, len(df) - 4:len(df)],]

	date	AAA	ACDGNO	AHETPI	AWHMAN	BAA	BOGMBASE	CFNAIDIFF	CPIAUCSL	CPILFESL	FEDFUNDS	GS10	GS2	INDPRO	ISRATIO	KCFSI	LOANS	M2SL	NEWORDER	PERMIT	TB3MS	TWEXMMTH	UNRATE	IC4WSA	NEWORD	HMI	P	D	E	CPI	Fraction	Rate.GS10	Price	Dividend	Earnings	CAPE	low	close	volume	rtn_m	fwd_rtn_m	rtn_6m	min_6m	dd_6m	flag	y1	diff_flag	CRED_SPRD	YLD_SPRD	LOAN_GROWTH
me_date
1945-01-31	1945-01-01	2.69	NaN	NaN	45.5	3.46	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16.8372	NaN	NaN	NaN	NaN	NaN	NaN	0.38	NaN	NaN	NaN	NaN	NaN	13.490000	0.643333	0.940000	17.8000	1945.041667	2.370	189.448293	9.034717	13.200993	11.960463	13.210000	13.470000	0	0.000000	0.059795	NaN	NaN	NaN	NaN	0.0	NaN	0.77	NaN	NaN
1945-02-28	1945-02-01	2.65	NaN	NaN	45.5	3.41	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16.7818	NaN	NaN	NaN	NaN	NaN	NaN	0.38	NaN	NaN	NaN	NaN	NaN	13.940000	0.646667	0.950000	17.8000	1945.125000	2.355	195.767917	9.081539	13.341429	12.341754	13.500000	14.300000	0	0.059795	-0.049455	NaN	NaN	NaN	NaN	0.0	0.0	0.76	NaN	NaN
1945-03-31	1945-03-01	2.62	NaN	NaN	45.3	3.38	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16.6710	NaN	NaN	NaN	NaN	NaN	NaN	0.38	NaN	NaN	NaN	NaN	NaN	13.930000	0.650000	0.960000	17.8000	1945.208333	2.340	195.627481	9.128346	13.481865	12.323310	13.390000	13.610000	0	-0.049455	0.086521	NaN	NaN	NaN	NaN	0.0	0.0	0.76	NaN	NaN
1945-04-30	1945-04-01	2.61	NaN	NaN	45.0	3.36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16.3664	NaN	NaN	NaN	NaN	NaN	NaN	0.38	NaN	NaN	NaN	NaN	NaN	14.280000	0.650000	0.973333	17.8000	1945.291667	2.325	200.542744	9.128346	13.669109	12.631867	13.670000	14.840000	0	0.086521	0.011390	NaN	NaN	NaN	NaN	0.0	0.0	0.75	NaN	NaN
2019-08-31	2019-08-01	2.98	44576.0	23.60	41.5	3.87	3271378.0	-0.09	256.300	264.245	2.13	1.63	1.57	109.9273	1.4	-0.33	9883.1084	14930.9	68901.0	1425.0	1.95	92.2746	3.7	216750.0	47.2	67.0	2897.498182	56.839092	134.063333	256.5580	2019.625000	1.630	2909.712357	57.078692	134.628467	28.704955	2822.120117	2926.459961	79599440000	-0.018257	0.017035	0.049729	2722.270020	-0.022599	0.0	NaN	NaN	0.89	-0.50	0.022257
2019-09-30	2019-09-01	3.03	43233.0	23.67	41.5	3.91	3202682.0	-0.24	256.358	264.595	2.04	1.70	1.65	109.5940	1.4	-0.35	9902.6579	15028.3	68533.0	1391.0	1.89	92.6991	3.5	212750.0	47.3	68.0	2982.156000	57.220000	133.460000	256.7590	2019.708333	1.700	2992.382665	57.416224	133.917673	29.228182	2891.850098	2976.739990	73992330000	0.017035	0.020226	0.048998	2728.810059	-0.037965	0.0	NaN	NaN	0.88	-0.34	0.019858
2019-10-31	2019-10-01	3.01	42476.0	23.76	41.4	3.92	3252830.0	-0.22	257.271	265.011	1.83	1.71	1.55	108.6714	1.4	-0.51	9931.7438	15195.0	69267.0	1461.0	1.65	92.3729	3.6	215000.0	49.1	71.0	2977.680000	NaN	NaN	257.3460	2019.791667	1.710	2981.076008	NaN	NaN	28.838509	2855.939941	3037.560059	77564550000	0.020226	0.033480	0.030664	2728.810059	-0.076525	0.0	NaN	NaN	0.91	-0.12	0.020239
2019-11-30	2019-11-01	3.06	42476.0	23.83	41.4	3.94	3315603.0	-0.22	NaN	NaN	1.55	1.81	1.61	NaN	1.4	NaN	NaN	NaN	NaN	NaN	1.54	92.4113	3.5	217750.0	47.2	70.0	3120.460000	NaN	NaN	257.6395	2019.875000	1.820	3120.460000	NaN	NaN	30.008420	3050.719971	3140.979980	72179920000	0.033480	0.001568	0.132185	2728.810059	-0.008484	0.0	NaN	NaN	0.88	0.26	NaN

To the modelling. Select the attributes to be predictors and label in our model.

# List variables  
# Dependent variable
dep_var = ['y1']
# Predictors/ independent variables
ind_vars = ['CRED_SPRD', 'YLD_SPRD', 'LOAN_GROWTH', 'rtn_6m']
# Other variables to be used in the plot
oth_vars = ['close']
vars = dep_var + ind_vars + oth_vars
df = df[vars]

Our new data frame.

# Drop na' when variables are not null / Nan
df = df.dropna(subset = vars)

# Inspect csv data - first and last records
df.iloc[np.r_[0:4, len(df) - 4:len(df)],]

	y1	CRED_SPRD	YLD_SPRD	LOAN_GROWTH	rtn_6m	close
me_date
1954-07-31	0.0	0.61	1.50	0.006512	0.168940	30.879999
1954-08-31	0.0	0.62	1.14	0.016628	0.131665	29.830000
1954-09-30	0.0	0.58	1.32	0.019759	0.181765	32.310001
1954-10-31	0.0	0.59	1.58	0.026823	0.114238	31.680000
2019-03-31	0.0	1.07	0.16	0.029602	-0.027690	2834.399902
2019-04-30	0.0	1.01	0.11	0.028409	0.082800	2945.830078
2019-05-31	0.0	0.96	0.01	0.028386	-0.002943	2752.060059
2019-06-30	0.0	1.04	-0.31	0.025044	0.159981	2941.760010

Lets get modelling. The variables defined in the next cell inform the training and testing ranges.

# Set training and testing ranges
train_length = 300
test_length = 4
loops = math.floor((len(df) - train_length) / test_length)
start = len(df) - (loops * test_length + train_length)
stop = math.floor((len(df) - train_length) / test_length) * test_length

# Empty objects
y_pred_prob = None
model_coef = None

The training, testing loop. This piece of code selects a training subset, fits a model and then applies that model to new unseen data. Predicted probabilities, and parameter co-efficients are returned to a dataframe.

# Training loop
for i in range(start, stop, test_length):

    # Model data
    y_train_raw = np.array(df.iloc[i:i + train_length, 0])
    x_train_raw = np.array(df.iloc[i:i + train_length, 1:len(vars) - 1])
    y_test_raw = np.array(df.iloc[i + train_length:i + train_length + test_length, 0])
    x_test_raw = np.array(df.iloc[i + train_length:i + train_length + test_length, 1:len(vars) - 1])

    # Scale for model ingestion
    sc = StandardScaler()
    x_train = sc.fit_transform(x_train_raw)

    # Apply mean and standard deviation from transform applied to training data to test data
    x_test = sc.transform(x_test_raw)

    # Specify model
    sgd = linear_model.SGDClassifier(
        loss = 'log'
        ,penalty = 'elasticnet'
        ,max_iter = 2500
        ,n_iter_no_change = 500
        ,tol = 1e-3)

    # Train model
    sgd.fit(x_train, y_train_raw)

    # Predict on test data and write to array
    y_pred = sgd.predict_proba(x_test)
    if y_pred_prob is None:
        y_pred_prob = y_pred
    else:
        y_pred_prob = np.concatenate((y_pred_prob, y_pred))

    # Capture co-efficients and write to array
    coef = np.repeat(
        np.append(
            sgd.intercept_[0], 
            sgd.coef_).reshape((1, -1)),
        test_length, 
        axis = 0
        )
    
    if model_coef is None:
        model_coef = coef
    else:
        model_coef = np.concatenate((model_coef, coef))

# Create predictions dataframe with date index
df_preds = pd.DataFrame(
    data = y_pred_prob
    ,index = pd.date_range(
        start = df.index[train_length + start]
        ,periods = stop
        ,freq = 'M')
    ,columns = [0, 1]
    )
# Theshold for hard prediction, to populate confusion matrix
df_preds = df_preds.assign(pred = np.where(df_preds[1] > 0.25, 1, 0))

# Join predictions to df & rename prediction to pred_prob
df_preds = df_preds.join(df, how = 'inner')
df_preds = df_preds.rename(columns = {1:'pred_prob'}).drop(columns = 0)
df_preds.y1 = df_preds.y1.astype(int)
df_preds.close = np.log(df_preds['close'])

# Create co-efficients dataframe with date index
ind_vars.insert(0, 'Int')
ind_vars = [x + '_coef' for x in ind_vars]
df_model_coef = pd.DataFrame(
    data = model_coef
    ,index = pd.date_range(
        start = df.index[train_length + start]
        ,periods = stop
        ,freq = 'M')
    ,columns = ind_vars
    )

# Join predictions & co-efficients df's
df_preds_coefs = df_preds.join(df_model_coef, how = 'inner')

Lets inspect the dataframe containing the prediction probability, predictors, intercept and co-efficients.

# Inspect dataframe of prediction probability
df_preds_coefs.iloc[np.r_[0:4, len(df_preds_coefs) - 4:len(df_preds_coefs)],]

	pred_prob	pred	y1	CRED_SPRD	YLD_SPRD	LOAN_GROWTH	rtn_6m	close	Int_coef	CRED_SPRD_coef	YLD_SPRD_coef	LOAN_GROWTH_coef	rtn_6m_coef
1979-07-31	0.564997	1	0	1.09	-1.52	0.084374	0.038092	4.642562	-0.850876	-0.059612	-0.678869	0.095629	0.015339
1979-08-31	0.611875	1	0	1.12	-1.91	0.084071	0.127019	4.694279	-0.850876	-0.059612	-0.678869	0.095629	0.015339
1979-09-30	0.633431	1	1	1.10	-2.10	0.085207	0.073334	4.694279	-0.850876	-0.059612	-0.678869	0.095629	0.015339
1979-10-31	0.754928	1	0	1.27	-3.47	0.075088	0.000589	4.623207	-0.850876	-0.059612	-0.678869	0.095629	0.015339
2019-03-31	0.436434	1	0	1.07	0.16	0.029602	-0.027690	7.949586	-1.462967	0.089407	-0.667003	-0.593875	-0.700222
2019-04-30	0.289830	1	0	1.01	0.11	0.028409	0.082800	7.988146	-1.462967	0.089407	-0.667003	-0.593875	-0.700222
2019-05-31	0.421149	1	0	0.96	0.01	0.028386	-0.002943	7.920105	-1.462967	0.089407	-0.667003	-0.593875	-0.700222
2019-06-30	0.259595	1	0	1.04	-0.31	0.025044	0.159981	7.986763	-1.462967	0.089407	-0.667003	-0.593875	-0.700222

Now to plotting the prediction probability, predictors, intercept and co-efficients.

# Set plot style
sns.set_style('white', {"xtick.major.size": 2, "ytick.major.size": 2})
flatui = ["#c5b4cc", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71","#f4cae4"]
sns.set_palette(sns.color_palette(flatui,7))

# Plot timeseries of SP500, prediction %, and y label shading
fig1, (ax1, ax2) = plt.subplots(nrows = 2)
fig1.suptitle('Prediction probability and S&P 500', size = 16).set_y(1.05)
fig1.subplots_adjust(top = 0.85)

ax1.plot(df_preds_coefs.index, df_preds_coefs['pred_prob'], 'k-', 
    color = sns.xkcd_rgb['grey'])
ax1.fill_between(
    df_preds_coefs.index, 
    df_preds_coefs['pred_prob'], 
    y2 = 0, 
    where = df_preds_coefs['y1']
    )
ax1.set_ylabel('Probability')

ax2.plot(df_preds_coefs.index, df_preds_coefs['close'], 'k-',
    color = sns.xkcd_rgb['grey'])
ax2.fill_between(
    df_preds_coefs.index, 
    df_preds_coefs['close'], 
    y2 = 0, 
    where = df_preds_coefs['y1']
    )
ax2.set_ylim(bottom = 4.5)
ax2.set_ylabel('S&P500 (log scale)')
fig1.tight_layout()

Our model is not very good at forecasting the drawdown in the S&P 500. This is not surprising given the little time put into it.

Lets now look at the regression parameters.

# Plot parameters
fig2, (ax1, ax2, ax3, ax4) = plt.subplots(nrows = 4)
fig2.suptitle('Rolling regression parameters', size = 16).set_y(1.05)
fig2.subplots_adjust(top = 0.85)

ax1.plot(df_preds_coefs.index, df_preds_coefs['Int_coef'], 'k-', 
    color = sns.xkcd_rgb['grey'])
ax1.fill_between(
    df_preds_coefs.index, 
    df_preds_coefs['Int_coef'], 
    y2 = df_preds_coefs['Int_coef'].min(), 
    where = df_preds_coefs['y1']
    )
ax1.set_ylabel('Intercept')

ax2.plot(df_preds_coefs.index, df_preds_coefs['CRED_SPRD_coef'], 'k-', 
    color = sns.xkcd_rgb['grey'])
ax2.fill_between(
    df_preds_coefs.index, 
    df_preds_coefs['CRED_SPRD_coef'], 
    y2 = df_preds_coefs['CRED_SPRD_coef'].min(),
    where = df_preds_coefs['y1']
    )
ax2.set_ylabel('Credit spread')

ax3.plot(df_preds_coefs.index, df_preds_coefs['YLD_SPRD_coef'], 'k-',
    color = sns.xkcd_rgb['grey'])
ax3.fill_between(
    df_preds_coefs.index, 
    df_preds_coefs['YLD_SPRD_coef'], 
    y2 = df_preds_coefs['YLD_SPRD_coef'].min(),
    where = df_preds_coefs['y1']
    )
ax3.set_ylabel('Yield spread')

ax4.plot(df_preds_coefs.index, df_preds_coefs['LOAN_GROWTH_coef'], 'k-',
    color = sns.xkcd_rgb['grey'])
ax4.fill_between(
    df_preds_coefs.index, 
    df_preds_coefs['LOAN_GROWTH_coef'], 
    y2 = df_preds_coefs['LOAN_GROWTH_coef'].min(),
    where = df_preds_coefs['y1']
    )
ax4.set_ylabel('Loan growth')

fig2.tight_layout()

Conclusion

The code embedded in this notebook is working as expected. We have produced a rolling out of sample rolling regression, capturing the prediction probability and regression parameters.

The next steps for this piece of analysis is to embed an inner loop for nested cross validition. This will enable the tuning model hyper-parameters in a time series context.

References

https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9 https://github.com/sam31415/timeseriescv/blob/master/timeseriescv/cross_validation.py
https://hub.packtpub.com/cross-validation-strategies-for-time-series-forecasting-tutorial/
https://www.mikulskibartosz.name/nested-cross-validation-in-time-series-forecasting-using-scikit-learn-and-statsmodels/
https://arxiv.org/pdf/1905.11744.pdf