A new tidy data structure to support exploration and modeling of temporal data

Describing the data structures and design principles of the tsibble R package

Earo Wang https://earo.me (Monash University)
2019-04-20

Introduction

Going from raw temporal data to model-ready time series objects is painful. Temporal data in the wild can arrive in many possible formats: irregular or multiple time intervals, point events that need aggregating, multiple observational units or repeated measurements on multiple individuals, and heterogeneous data types. But time series required for modelling are the notably simplified matrices, and it does not provide formal organisation on how to map wild data to the expected objects. This results in a myriad of ad hoc (and possibly inaccurate) solutions and the loss of the data richness while in transition. Mining temporal information from time series objects is also inhibited by limited toolkits that hinge on that specialist format. Figure 1 illustrates the lumpy workflow of current time series analysis.

Illustration of the current time series analysis workflow, adapted from R for Data Science (Wickham and Grolemund 2016). The missing “tidy” temporal data structure leads to a myriad of ad hoc solutions and duplicated efforts from raw temporal data to model-oriented time series objects.

Figure 1: Illustration of the current time series analysis workflow, adapted from R for Data Science (Wickham and Grolemund 2016). The missing “tidy” temporal data structure leads to a myriad of ad hoc solutions and duplicated efforts from raw temporal data to model-oriented time series objects.

The tsibble package aims to make this workflow more robust, accurate, and efficient, as visualised in Figure 2. It provides a new data abstraction to represent temporal data, referred to as “tsibble”, allowing the “tidy data” principles (Wickham 2014) to be brought to the time domain. It also helps to lay the plumbing for temporal data analysis modules of transformation, visualisation, and modeling, to achieve rapid iteration in gaining data insights.

Tsibble defines tidy data in temporal context and lubricates the process of time series analysis.

Figure 2: Tsibble defines tidy data in temporal context and lubricates the process of time series analysis.

Tidy data representation for temporal data

First and foremost, tsibble represents tidy temporal data. Contextual semantics—index and key—are introduced to tidy data in order to support more intuitive time-aware manipulations and enlighten new perspectives for time series model inputs. A variable representing time, referred as “index”, provides a contextual basis for temporal data. The “key” consisted of a single or multiple variables informs what identities or observational units are recorded over time. The remaining columns are considered as measurements. Figure 3 demonstrates the shape of a tsibble, but the tsibble for being valid has to pass the hurdle of distinct rows identified by the index and key. Otherwise duplicates signal a data quality issue, which would likely affect subsequent analyses and hence decision making. This ensures the tsibble to be a valid input for time series analytics and models.

The architecture of the tsibble structure is built on top of the tidy data, with time-series contextual semantics: index and key.

Figure 3: The architecture of the tsibble structure is built on top of the tidy data, with time-series contextual semantics: index and key.

To create a tsibble, one needs to declare the index and key. Hourly pedestrian counts at different locations in downtown Melbourne are used as an example. The “key” includes four sensors, and the “index” contains date-times of one hour interval along with the time zone of Melbourne. Below displays contextually comprehensive details that aid users in understanding their data better.


library(tsibble)
pedestrian %>% 
  as_tsibble(key = Sensor, index = Date_Time)

#> # A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
#> # Key:       Sensor [4]
#>   Sensor         Date_Time           Date        Time Count
#>   <chr>          <dttm>              <date>     <int> <int>
#> 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
#> 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826
#> 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567
#> # … with 6.603e+04 more rows

Domain specific language for wrangling temporal data

Besides a data abstraction, tsibble is a domain specific language (DSL) in R for transforming temporal data. It leverages the tidyverse verbs and adds some new verbs/adverbs to the vocabularies for more intuitive time-based manipulations. Since a tsibble permits time gaps in the index, it is good practice to check and inspect any gaps in time following the creation of a tsibble, in order to prevent inviting some avoidable errors into the analysis. A suite of verbs are provided to understand and tackle implicit missing values; and fill_gaps() turns them into explicit ones along with imputing by values or functions.


library(tidyverse)
pedestrian %>% 
  fill_gaps(Count = 0L) %>% 
  group_by(Sensor) %>% 
  mutate(Lagged_Count = lag(Count)) %>% 
  ggplot(aes(x = Lagged_Count, y = Count)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~ Sensor, ncol = 2, scales = "free")

A new adverb index_by() adjacent to a verb allows for index subgrouping operations. In conjunction with summarise(), it performs aggregations over time to different time resolutions.


pedestrian %>% 
  group_by(Sensor) %>% 
  index_by(Year_Month = yearmonth(Date_Time)) %>% 
  summarise(Avg_Count = mean(Count)) %>% 
  ggplot(aes(x = Year_Month, y = Avg_Count, colour = Sensor)) +
  geom_line() +
  geom_point() +
  expand_limits(y = 0)

The tsibble DSL provides a set of functional yet evocative verbs for empowering users to compose expressive statements; that facilitates to frame the data analytic problems and document how to approach them through clean code.

Rolling windows with functional programming

Rolling window calculations are widely used techniques in time series analysis, and often apply to other applications. These operations are dependent on having an ordering, particularly time ordering for temporal data. Three common types of variations for rolling window operations are:

  1. slide()/slide2()/pslide(): sliding window with overlapping observations.
  2. tile()/tile2()/ptile(): tiling window without overlapping observations.
  3. stretch()/stretch2()/pstretch(): fixing an initial window and expanding to include more observations.
Rolling window animation

Rolling window embraces functional programming, which frees users from verbose statements but focuses on succinct expressions instead. These expressions can be as simple as moving averages, or as complex as rolling regressions and forecasting. This family turns out useful for time series cross validation. Their multiprocessing equivalents prefixed by future_*() enable easy switch to rolling in parallel.


library(furrr)
plan(multiprocess)
my_diag <- function(...) {
  data <- tibble(...)
  fit <- lm(Count ~ Time, data = data)
  list(fitted = fitted(fit), resid = residuals(fit))
}
pedestrian %>%
  filter_index(~ "2015-03") %>%
  nest(-Sensor) %>%
  mutate(diag = future_map(data, ~ future_pslide_dfr(., my_diag, .size = 24 * 7)))

#> # A tibble: 4 x 3
#>   Sensor                      data                diag                
#>   <chr>                       <list>              <list>              
#> 1 Birrarung Marr              <tsibble [2,160 × … <tibble [334,825 × …
#> 2 Bourke Street Mall (North)  <tsibble [1,032 × … <tibble [145,321 × …
#> 3 QV Market-Elizabeth St (We… <tsibble [2,160 × … <tibble [334,825 × …
#> 4 Southern Cross Station      <tsibble [2,160 × … <tibble [334,825 × …

Conclusions

The tsibble R package articulates the time series data pipeline, which shepherds raw temporal data through to time series analysis, and plots. It also lays the fundamental computing infrastructure of a new ecosystem for tidy time series analysis, known as “tidyverts”, including time series features (the feasts package) and forecasting (the fable package).

Note

This article is originally written for the Software Corner of the Biometric Bulletin (to appear).

This is a joint work with Dianne Cook and Rob J Hyndman. A longer version of the data structure and design principles about the tsibble R package can be found at arXiv (https://arxiv.org/abs/1901.10257).

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). Foundation for Open Access Statistics: 1–23.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. O’Reilly Media. http://r4ds.had.co.nz/.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/earowang/less, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wang (2019, April 20). Less <> More: A new tidy data structure to support exploration and modeling of temporal data. Retrieved from https://less.earo.me/posts/2019-04-tsibble-design/

BibTeX citation

@misc{wang2019tsibble,
  author = {Wang, Earo},
  title = {Less <> More: A new tidy data structure to support exploration and modeling of temporal data},
  url = {https://less.earo.me/posts/2019-04-tsibble-design/},
  year = {2019}
}