Nested time series data frames · Brent Morrison's blog

22 Dec 2019, 00:00

Data leakage can be tricky when analysing time series. Ensuring you are not using the future to predict the future is very important if you want to use the past to predict the future! After all, you don’t get to use future data when you are in the present!

These earthquake researchers have been accused of mixing things up.

Let’s say we want to apply a machine learning algorithm that requires hyper-parameter tuning, and hence a validation data set, to a time series. Normally this is implemented using a walk forward training, validation, testing approach. There are a couple of ways to do this.

You can code up everything in loops. A primitive stab at this has been performed here. Of course to perform cross-validation an inner loop is required.
Sticking to R and avoiding loops, any of the following can be used, zoo::rollapply(), tibbletime::rollify() or tsibble::slide(). tibbletime has been retired in lieu of the slide family of functions from tsibble. Subsequently, this functionality has been moved to the slide package.

This post will take a different approach. We will create a nested data frame containing multiple slices of the same time series. The idea behind this is to use this nested data frame to apply functions and train models using the functionality from the purrr package. This will help negate the risk of data leakage. Each time slice of our data will be neatly compartmentalised in each of the underlying nested data frames.

Lets get started building this nested data frame.

The required packages.

library("tidyverse")
library("lubridate")

We set parameters for the length of each time slice, and grab some data.

# Parameters
train_length <- 300
test_length <- 6

# Data
econ_fin_data <- readRDS("C:/Users/brent/Documents/R/Misc_scripts/econ_fin_data.Rda")
data_length <- nrow(econ_fin_data)
loops <- floor((nrow(econ_fin_data) - train_length) / test_length)
start <- nrow(econ_fin_data) - ((loops * test_length) + train_length) + 1

Next, we loop over the data set to create each slice of data, labeling these with the date range to which they relate. Each newly labelled slice is appending to a new data frame. This new data frame is then grouped by the labels just created and nested.

# Empty tibble
nested_df = tibble()

# Loop for time slices
for (i in seq(start, by = test_length, length.out = loops)) {
  df <- econ_fin_data
  df <- slice(df, i:(i + train_length + test_length - 1)) %>% 
    mutate(nest_label = paste(format(strftime(min(date), "%Y-%m")), 
                              format(strftime(max(date), "%Y-%m")),
                              sep = ":"))
  nested_df <- bind_rows(nested_df,df) 
}
nested_df <- nested_df %>% 
  group_by(nest_label) %>% 
  nest() %>% 
  ungroup()

Here is the resulting data frame.

tail(nested_df)

## # A tibble: 6 x 2
##   nest_label                 data
##   <chr>           <list<df[,47]>>
## 1 1991-12:2017-05      [306 x 47]
## 2 1992-06:2017-11      [306 x 47]
## 3 1992-12:2018-05      [306 x 47]
## 4 1993-06:2018-11      [306 x 47]
## 5 1993-12:2019-05      [306 x 47]
## 6 1994-06:2019-11      [306 x 47]

Lets unnest.

tail(unnest(nested_df[99,2], cols = c(data)))

## # A tibble: 6 x 47
##   date         AAA ACDGNO AHETPI AWHMAN   BAA BOGMBASE CFNAIDIFF CPIAUCSL
##   <date>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>     <dbl>    <dbl>
## 1 2019-06-01  3.42  44973   23.4   41.7  4.46  3274825     -0.15     255.
## 2 2019-07-01  3.29  45146   23.5   41.5  4.28  3260316     -0.21     256.
## 3 2019-08-01  2.98  44576   23.6   41.5  3.87  3271378     -0.09     256.
## 4 2019-09-01  3.03  43233   23.7   41.5  3.91  3202682     -0.24     256.
## 5 2019-10-01  3.01  42476   23.8   41.4  3.92  3252830     -0.22     257.
## 6 2019-11-01  3.06  42476   23.8   41.4  3.94  3315603     -0.22      NA 
## # ... with 38 more variables: CPILFESL <dbl>, FEDFUNDS <dbl>, GS10 <dbl>,
## #   GS2 <dbl>, INDPRO <dbl>, ISRATIO <dbl>, KCFSI <dbl>, LOANS <dbl>,
## #   M2SL <dbl>, NEWORDER <dbl>, PERMIT <dbl>, TB3MS <dbl>, TWEXMMTH <dbl>,
## #   UNRATE <dbl>, IC4WSA <int>, NEWORD <dbl>, HMI <dbl>, P <dbl>, D <dbl>,
## #   E <dbl>, CPI <dbl>, Fraction <dbl>, Rate.GS10 <dbl>, Price <dbl>,
## #   Dividend <dbl>, Earnings <dbl>, CAPE <dbl>, low <dbl>, close <dbl>,
## #   volume <dbl>, rtn_m <dbl>, fwd_rtn_m <dbl>, rtn_6m <dbl>,
## #   min_6m <dbl>, dd_6m <dbl>, flag <dbl>, y1 <dbl>, diff_flag <dbl>

This should be identical to the original data.

filter(econ_fin_data, between(date, as.Date("2019-06-01"), as.Date("2019-11-01")))

## # A tibble: 6 x 47
##   date         AAA ACDGNO AHETPI AWHMAN   BAA BOGMBASE CFNAIDIFF CPIAUCSL
##   <date>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>     <dbl>    <dbl>
## 1 2019-06-01  3.42  44973   23.4   41.7  4.46  3274825     -0.15     255.
## 2 2019-07-01  3.29  45146   23.5   41.5  4.28  3260316     -0.21     256.
## 3 2019-08-01  2.98  44576   23.6   41.5  3.87  3271378     -0.09     256.
## 4 2019-09-01  3.03  43233   23.7   41.5  3.91  3202682     -0.24     256.
## 5 2019-10-01  3.01  42476   23.8   41.4  3.92  3252830     -0.22     257.
## 6 2019-11-01  3.06  42476   23.8   41.4  3.94  3315603     -0.22      NA 
## # ... with 38 more variables: CPILFESL <dbl>, FEDFUNDS <dbl>, GS10 <dbl>,
## #   GS2 <dbl>, INDPRO <dbl>, ISRATIO <dbl>, KCFSI <dbl>, LOANS <dbl>,
## #   M2SL <dbl>, NEWORDER <dbl>, PERMIT <dbl>, TB3MS <dbl>, TWEXMMTH <dbl>,
## #   UNRATE <dbl>, IC4WSA <int>, NEWORD <dbl>, HMI <dbl>, P <dbl>, D <dbl>,
## #   E <dbl>, CPI <dbl>, Fraction <dbl>, Rate.GS10 <dbl>, Price <dbl>,
## #   Dividend <dbl>, Earnings <dbl>, CAPE <dbl>, low <dbl>, close <dbl>,
## #   volume <dbl>, rtn_m <dbl>, fwd_rtn_m <dbl>, rtn_6m <dbl>,
## #   min_6m <dbl>, dd_6m <dbl>, flag <dbl>, y1 <dbl>, diff_flag <dbl>

That looks good. As expected, identical results have been returned.

Lets put it in a function and test the output.

#===========================================================================================
#==   NEST TIME SERIES DATA                                                               ==
#==                                                                                       ==
#==   The following arguments are accepted:                                               ==
#==   - 'df' is a dataframe containing one column labelled "date".  The "date" column     ==
#==     must be in a date format                                                          ==
#==   - 'train_length' is the length of the training dataset                              ==
#==   - 'test_length' is the length of the testing dataset and will represent             == 
#==     the step forward in time                                                          ==
#===========================================================================================

ts_nest <- function(df, train_length, test_length) {
  
  # Parameters
  loops <- floor((nrow(df) - train_length) / test_length)
  start <- nrow(df) - ((loops * test_length) + train_length) + 1
  
  # Empty tibble
  nested_df = tibble()
  
  # Loop
  for (i in seq(start, by = test_length, length.out = loops)) {
    df <- econ_fin_data
    df <- slice(df, i:(i + train_length + test_length - 1)) %>% 
      mutate(nest_label = paste(format(strftime(min(date), "%Y-%m")), 
                                format(strftime(max(date), "%Y-%m")),
                                sep = ":"))
    # Join tables
    nested_df <- bind_rows(nested_df,df) 
  }
  
  nested_df <- nested_df %>% 
    group_by(nest_label) %>% 
    nest() %>% 
    ungroup()
}

# Test data frame using the same parameters as above 
nested_test <- ts_nest(econ_fin_data, 300, 6)

Our test and original data frame should be identical.

identical(nested_df, nested_test)

## [1] TRUE

That’s a wrap. This will put us in a good place to look at this type of analysis.