Data leakage can be tricky when analysing time series. Ensuring you are not using the future to predict the future is very important if you want to use the past to predict the future! After all, you don’t get to use future data when you are in the present!
These earthquake researchers have been accused of mixing things up.
Let’s say we want to apply a machine learning algorithm that requires hyper-parameter tuning, and hence a validation data set, to a time series. Normally this is implemented using a walk forward training, validation, testing approach. There are a couple of ways to do this.
- You can code up everything in loops. A primitive stab at this has been performed here. Of course to perform cross-validation an inner loop is required.
- Sticking to
R
and avoiding loops, any of the following can be used,zoo::rollapply()
,tibbletime::rollify()
ortsibble::slide()
.tibbletime
has been retired in lieu of theslide
family of functions fromtsibble
. Subsequently, this functionality has been moved to theslide
package.
This post will take a different approach. We will create a nested data frame containing multiple slices of the same time series. The idea behind this is to use this nested data frame to apply functions and train models using the functionality from the purrr
package. This will help negate the risk of data leakage. Each time slice of our data will be neatly compartmentalised in each of the underlying nested data frames.
Lets get started building this nested data frame.
The required packages.
library("tidyverse")
library("lubridate")
We set parameters for the length of each time slice, and grab some data.
# Parameters
train_length <- 300
test_length <- 6
# Data
econ_fin_data <- readRDS("C:/Users/brent/Documents/R/Misc_scripts/econ_fin_data.Rda")
data_length <- nrow(econ_fin_data)
loops <- floor((nrow(econ_fin_data) - train_length) / test_length)
start <- nrow(econ_fin_data) - ((loops * test_length) + train_length) + 1
Next, we loop over the data set to create each slice of data, labeling these with the date range to which they relate. Each newly labelled slice is appending to a new data frame. This new data frame is then grouped by the labels just created and nested.
# Empty tibble
nested_df = tibble()
# Loop for time slices
for (i in seq(start, by = test_length, length.out = loops)) {
df <- econ_fin_data
df <- slice(df, i:(i + train_length + test_length - 1)) %>%
mutate(nest_label = paste(format(strftime(min(date), "%Y-%m")),
format(strftime(max(date), "%Y-%m")),
sep = ":"))
nested_df <- bind_rows(nested_df,df)
}
nested_df <- nested_df %>%
group_by(nest_label) %>%
nest() %>%
ungroup()
Here is the resulting data frame.
tail(nested_df)
## # A tibble: 6 x 2
## nest_label data
## <chr> <list<df[,47]>>
## 1 1991-12:2017-05 [306 x 47]
## 2 1992-06:2017-11 [306 x 47]
## 3 1992-12:2018-05 [306 x 47]
## 4 1993-06:2018-11 [306 x 47]
## 5 1993-12:2019-05 [306 x 47]
## 6 1994-06:2019-11 [306 x 47]
Lets unnest.
tail(unnest(nested_df[99,2], cols = c(data)))
## # A tibble: 6 x 47
## date AAA ACDGNO AHETPI AWHMAN BAA BOGMBASE CFNAIDIFF CPIAUCSL
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-01 3.42 44973 23.4 41.7 4.46 3274825 -0.15 255.
## 2 2019-07-01 3.29 45146 23.5 41.5 4.28 3260316 -0.21 256.
## 3 2019-08-01 2.98 44576 23.6 41.5 3.87 3271378 -0.09 256.
## 4 2019-09-01 3.03 43233 23.7 41.5 3.91 3202682 -0.24 256.
## 5 2019-10-01 3.01 42476 23.8 41.4 3.92 3252830 -0.22 257.
## 6 2019-11-01 3.06 42476 23.8 41.4 3.94 3315603 -0.22 NA
## # ... with 38 more variables: CPILFESL <dbl>, FEDFUNDS <dbl>, GS10 <dbl>,
## # GS2 <dbl>, INDPRO <dbl>, ISRATIO <dbl>, KCFSI <dbl>, LOANS <dbl>,
## # M2SL <dbl>, NEWORDER <dbl>, PERMIT <dbl>, TB3MS <dbl>, TWEXMMTH <dbl>,
## # UNRATE <dbl>, IC4WSA <int>, NEWORD <dbl>, HMI <dbl>, P <dbl>, D <dbl>,
## # E <dbl>, CPI <dbl>, Fraction <dbl>, Rate.GS10 <dbl>, Price <dbl>,
## # Dividend <dbl>, Earnings <dbl>, CAPE <dbl>, low <dbl>, close <dbl>,
## # volume <dbl>, rtn_m <dbl>, fwd_rtn_m <dbl>, rtn_6m <dbl>,
## # min_6m <dbl>, dd_6m <dbl>, flag <dbl>, y1 <dbl>, diff_flag <dbl>
This should be identical to the original data.
filter(econ_fin_data, between(date, as.Date("2019-06-01"), as.Date("2019-11-01")))
## # A tibble: 6 x 47
## date AAA ACDGNO AHETPI AWHMAN BAA BOGMBASE CFNAIDIFF CPIAUCSL
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-01 3.42 44973 23.4 41.7 4.46 3274825 -0.15 255.
## 2 2019-07-01 3.29 45146 23.5 41.5 4.28 3260316 -0.21 256.
## 3 2019-08-01 2.98 44576 23.6 41.5 3.87 3271378 -0.09 256.
## 4 2019-09-01 3.03 43233 23.7 41.5 3.91 3202682 -0.24 256.
## 5 2019-10-01 3.01 42476 23.8 41.4 3.92 3252830 -0.22 257.
## 6 2019-11-01 3.06 42476 23.8 41.4 3.94 3315603 -0.22 NA
## # ... with 38 more variables: CPILFESL <dbl>, FEDFUNDS <dbl>, GS10 <dbl>,
## # GS2 <dbl>, INDPRO <dbl>, ISRATIO <dbl>, KCFSI <dbl>, LOANS <dbl>,
## # M2SL <dbl>, NEWORDER <dbl>, PERMIT <dbl>, TB3MS <dbl>, TWEXMMTH <dbl>,
## # UNRATE <dbl>, IC4WSA <int>, NEWORD <dbl>, HMI <dbl>, P <dbl>, D <dbl>,
## # E <dbl>, CPI <dbl>, Fraction <dbl>, Rate.GS10 <dbl>, Price <dbl>,
## # Dividend <dbl>, Earnings <dbl>, CAPE <dbl>, low <dbl>, close <dbl>,
## # volume <dbl>, rtn_m <dbl>, fwd_rtn_m <dbl>, rtn_6m <dbl>,
## # min_6m <dbl>, dd_6m <dbl>, flag <dbl>, y1 <dbl>, diff_flag <dbl>
That looks good. As expected, identical results have been returned.
Lets put it in a function and test the output.
#===========================================================================================
#== NEST TIME SERIES DATA ==
#== ==
#== The following arguments are accepted: ==
#== - 'df' is a dataframe containing one column labelled "date". The "date" column ==
#== must be in a date format ==
#== - 'train_length' is the length of the training dataset ==
#== - 'test_length' is the length of the testing dataset and will represent ==
#== the step forward in time ==
#===========================================================================================
ts_nest <- function(df, train_length, test_length) {
# Parameters
loops <- floor((nrow(df) - train_length) / test_length)
start <- nrow(df) - ((loops * test_length) + train_length) + 1
# Empty tibble
nested_df = tibble()
# Loop
for (i in seq(start, by = test_length, length.out = loops)) {
df <- econ_fin_data
df <- slice(df, i:(i + train_length + test_length - 1)) %>%
mutate(nest_label = paste(format(strftime(min(date), "%Y-%m")),
format(strftime(max(date), "%Y-%m")),
sep = ":"))
# Join tables
nested_df <- bind_rows(nested_df,df)
}
nested_df <- nested_df %>%
group_by(nest_label) %>%
nest() %>%
ungroup()
}
# Test data frame using the same parameters as above
nested_test <- ts_nest(econ_fin_data, 300, 6)
Our test and original data frame should be identical.
identical(nested_df, nested_test)
## [1] TRUE
That’s a wrap. This will put us in a good place to look at this type of analysis.
comments powered by Disqus