14 Time Series (applied)

‘Predict the influence of discrete events’ (and more) on web traffic, signups, etc

14.0.1 The case at hand (broad terms)

Modeling what drives/effects/predicts …

‘traffic and number of conversions over the days/weeks of this year’ (count data)
quality of conversions

Particularly in response to discrete events, media coverage, etc.¹

“Did any of these events (at different times) seem to increase numbers?*” From which they aim to infer (more loosely)
“What kinds of events will, in the future, most likely increase numbers?” obviously this involves also assessing what normal variation is, where there’s a trend, seasonality etc.

14.0.2 Our proposed overall strategy

Essentially, this is ‘time series data’ (although there may be some panel and cross-sectional elements). We want to model this in a way that allows for overall trends (possibly nonlinear(, seasonality, and possibly autocorrelation (AR and MA terms?).²

… and maybe also ‘structural breaks’.

When considering the impact of an event, we/they want to measure its ‘full effect over time’, taking into account its lagged effects, and possibly indirect ‘autoregressive effects’ (as converted people may bring in additional converts, etc.)

DR: I think we need the partner to clarify their goals a bit more. Is it: 1. Predict the evolution in the future to aid their planning?, 2. Understand the value of specific types of media coverage and other things they might influence?, 3. Or just to the specifically mentioned event? 4. Understand how the outcomes reacted to things largely out of their control?

DR: I think a simple model ‘adjusted to trends (and maybe seasonality)’ might be good enough as a first pass. I’m not sure the data is rich enough to justify much more.

Nik’s approach/background

From past cursory reads of the lit my impression’s been that there are lots of equivalencies to areas I’m more familiar with,

e.g. an $ARIMA(0,1,0)$ model is a Brownian motion is a GP with Gaussian kernel $min(x_1,x_2)$

From your description of them problem, there are a few salient properties / complexities:

The focal outcome is a discrete nonzero number, ie the number of apps / page views. The focal parameter is something like being featured in the WSJ, i.e. a $(0,1)$ indicator. Maybe the effect of this isn’t one-and-done, but but gives an immediate boost that e.g. decays exponentially through time?

The data are temporally autocorrelated, i.e. samples closer together in time are probably going to be more similar to each other than those further away

There may also be a linear trend through time separate from being featured in the WSJ, e.g. increasing general interest in 80kh, such that average interest on either side of the ‘discontinuity’ is different in a manner unrelated to 2. There may also be nonlinear trends through time, e.g. seasonality effects.

Nik: I think my first stab would be to model this with a GLM.

Specifically, a Poisson likelihood w/ a log link and linear model with intercept,

a linear coefficient on time,
and an indicator for some discrete bump corresponding to whether a point came before or after the WSJ feature.

⁴

Then, add a mean-0 multivariate normal error term for overdispersion to this, composing like an RBF kernel w/ a periodic kernel?

⁵

This would sorta be like ‘kriging’ / using GPs to model temporal autocorrelation and I think should satisfy 1-3 above. And then if that works focus on trying to build out to cover more nuanced dynamics. But there might be better / more established ways to treat problems like this! ⁶

Kim C’s thoughts

structural change analysis (e.g., like that implemented in the R package strucchange).

You can use this procedure to find “breakpoints” in the data where for example, is the slope changes value.

If you think the trends are nonlinear, I’ve done a similar thing using GAMs, and then taking the 1st derivative of the fitted curves and checking if it intersects zero. Explained pretty well here.

14.1 Pete’s General notes on time series (in a prediction/ML context)

⁷:::

In Machine Learning, Time series problems are problems that involve forecasting (extrapolating) the future based on the information of the past. Typically we have to make a chain of non-independent predictions rather than predict discrete, independent events.

DR: this prediction problem should be distinguished from ‘time series econometrics’, see, e.g., Diebold’s text, which focuses on estimating fundamental structural parameters, and considers forms of ‘causality’.

Most machine learning problems assume that the order of rows don’t matter and that each row is independent of each other. Since time series problems involve time-ordered rows where past events may influence future events, the rows are not independent and this independence assumption is violated.

Another key assumption of ML is that the training data is similar to the data being predicted (in this case, future data). This assumption must be true for time series as well.

14.1.1 Decomposition

You can decompose a time series into four key parts:

The trend (T), where the mean is changing over time (e.g., sales generally keep increasing over time)
Seasonality (S) (e.g., sales are higher in the holiday season)
A non-seasonal cyclical (C) component (e.g., stock market follows “business cycles”). This is distinct from seasonality as seasonality has a fixed period (e.g., every November), whereas a cycle does not.
A random component (e)

DR: The ‘random component’ could be distinguised or described further; there are various types of ‘random’ terms … shifts, changes in trends, ‘moving average’ one-off terms, ‘autoregressive’ error terms

14.1.2 Types of decomposition

There are two basic types of decomposition – addititive, where y = T + S + C + e and multiplicative, where y = T * S * C * e.

14.2 Stationary vs. non-stationarity

Stationary time series (a) do not have a trend and also (b) do not have variance that changes over time. Non-stationary time series’ do have (a) and/or (b). Typically (a) and (b) create problems for modeling, as models have trouble extrapolating these and they tend to violate the assumption that the training data is similar to the data being predicted.

DR: There can be trend-stationary series, that are stationary after including a trend. These are pretty easy to deal with. Note also that nonstationary series can be described as following a ‘random walk’

But I’m also not convinced that this should necessarily be a problem in a prediction problem. If the series is a random walk/nonstationary … we can still use that knowledge to make a decent prediction of where the outcome will be at time T+t given it’s value at time T.

14.2.1 Converting to stationary

We can resolve these issues by converting a non-stationary series to a stationary series. This is done by differencing, where we look at the differences in the target over time rather than the actual target (y[t]-> y[t] - y[t-1]).

DR: Most economic time series are in fact stationary after first-differencing. However, this is not guaranteed. You may still want to test for stationarity after first-differencing. (But my memory is that the whole tests for stationarity thing is a huge can of worms).

We can also handle exponential trends using techniques like log transformations.

14.3 Handling Features

14.3.1 Lags

Lagging is pretty key to time series. A lag is when you use the value from the previous series to forecast the next series. You can lag the target variable (e.g., use last month’s sales to predict next month’s sales) and/or you can lag independent variables. Lagged target variables often have strong explanatory power because the real world has delays (e.g., it takes a few weeks for marketing to transition to sales so marketing spend from three weeks ago may be more predictive than marketing spend of the same week) and causations that occur over time (e.g., sales from last year show that the store is more popular so there is more word of mouth and it is even more popular the next year).

Lagging always involves losing some data, as if you are using data from the previous month you won’t be able to use the first month in your training data (because there’s no previous month data for month -1). If you are using data from the three previous months, you won’t be able to train on the first three months. This may not be an issue though, because unlike with non-time series problems, more data in time series isn’t always better (since you might prefer to only be modeling the most recent trend).

We can mix lags of different lengths.

14.3.2 Rolling Statistics

Lags aren’t the only thing we can do - we can also calculate rolling statistics, like the mean of a variable over the past 14 days. You can also do rolling stats on differences.

14.3.3 Which Lags / Rolling Stats to Use?

Assess with cross correlation function, which tests correlation of many different lags.

ooAssess with partial autocorrelation function (PACF), which gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. Also can assess with the autocorrelation function (ACF), which does not control for other lags.

changing lag lengths based on data type, domain knowledge, and how quickly you think the series reacts to change (e.g., use shorter lags for stocks)
compare by backtesting

14.3.4 Known in Advance vs. Not

When we are trying to predict in the future, we run into an issue that the features we are using for prediction might also be unknown at prediction time. For example, our historical data might contain information about rainfall and how that connects to sales, but we can’t reliably know the rainfall three months in the future to predict future sales.

However, some features are known in advance - like Christmas time may also have a big impact on sales and we always know exactly when Christmas will be.

For features that are not known in advance, we can still use them by either explicitly forecasting them, extrapolating them out using lags / rolling stats of sufficient size, or extrapolating from current values using differing forecast differences.

14.4 Validation

Normally for ML problems we use cross validation, where we randomly partition the data and then predict one partition using data from all the other partitions. The problem with this for time series is that this will involve using future data to predict the past, which will make for unrealistically good predictions.

Instead, we can use backtesting where we predict a future time using a window of past times (e.g., predict Year 6 using Years 3-5, predict Year 5 using Years 2-4, predict Year 4 using Years 1-3).

For metrics, it’s good to compare the evaluation metric to a naive baseline model, or an intentionally minimal extrapolation.

14.5 Types of Models

Integrated model - move model one step at a time, e.g., ARIMA, exponential smoothing - univariate
Forecast distance model - predict fixed distances, e.g., XGB
Trend and decomposition model - predict a different model for each distance, e.g., FB Prophet

14.5.1 ARIMA

Autoregressive process: AR(p): fit coefficients to p lags

Moving average process: MA(q): fit coefficients to q previous errors

ARMA model: X = f(AR, MA)

I(d) = Differencing d times

ARIMA(p, d, q) = AR(p) + I(d) + MA(q)

14.6 Multiseries

Predict a different time series for each unit (e.g., sales by store). Can use features across series (cross series features).

DR: The latter makes this seem more like a causal inference problem (‘what works to boost good applications’) … than a ‘predict the future to plan around it’ problem (as in Pete’s ML notes).↩︎
Do we need to test for stationarity (and then difference and test it again?)↩︎
DR: Poisson because its ‘arrival of events’. Why a ‘log link’?↩︎
DR: Fundamentally, there will be bumps in individual periods that may occur randomly. A shock or permanent shift after the WSJ feature need not have been caused by it. The question is ‘how unusual is such a shock, in the context of a long time series of shocks’? And how long is this series, anyways?↩︎
DR: Can you explain more? And does this allow for flexible autocorrelation? Do we need to ‘test if it’s a random walk’?)↩︎
DR: I don’t know what ‘kriging’ means; does anyone else understand it?↩︎
DR: Not sure what/how to adapt this.↩︎

## Time Series (applied) 'Predict the influence of discrete events' (and more) on web traffic, signups, etc ### The case at hand (broad terms) Modeling what drives/effects/predicts ... - 'traffic and number of conversions over the days/weeks of this year' (count data) - quality of conversions Particularly in response to discrete events, media coverage, etc.^[DR: The latter makes this seem more like a causal inference problem (‘what works to boost good applications’) ... than a ‘predict the future to plan around it’ problem (as in Pete’s ML notes).] - "Did any of these events (at different times) seem to increase numbers?*" From which they aim to infer (more loosely) - "What kinds of events will, in the future, most likely increase numbers?" *obviously this involves also assessing what normal variation is, where there's a trend, seasonality etc. * ### Our proposed overall strategy Essentially, this is 'time series data' (although there may be some panel and cross-sectional elements). We want to model this in a way that allows for overall trends (possibly nonlinear(, seasonality, and possibly autocorrelation (AR and MA terms?).^[ Do we need to test for stationarity (and then difference and test it again?) ] ... and maybe also 'structural breaks'. When considering the impact of an event, we/they want to measure its 'full effect over time', taking into account its lagged effects, and possibly indirect 'autoregressive effects' (as converted people may bring in additional converts, etc.) > DR: I think we need the partner to clarify their goals a bit more. > Is it: > 1. Predict the evolution in the future to aid their planning?, > 2. Understand the value of specific types of media coverage and other things they might influence?, > 3. Or just to the specifically mentioned event? > 4. Understand how the outcomes reacted to things largely *out of their control*? DR: I think a simple model 'adjusted to trends (and maybe seasonality)' might be good enough as a first pass. I'm not sure the data is rich enough to justify much more. ::: {.callout-note collapse="true"} ## Nik's approach/background From past cursory reads of the lit my impression’s been that there are lots of equivalencies to areas I’m more familiar with, e.g. an $ARIMA(0,1,0)$ model is a Brownian motion is a GP with Gaussian kernel $min(x_1,x_2)$ From your description of them problem, there are a few salient properties / complexities: The focal outcome is a discrete nonzero number, ie the number of apps / page views. The focal parameter is something like being featured in the WSJ, i.e. a $(0,1)$ indicator. Maybe the effect of this isn't one-and-done, but but gives an immediate boost that e.g. decays exponentially through time? The data are temporally autocorrelated, i.e. samples closer together in time are probably going to be more similar to each other than those further away There may also be a linear trend through time separate from being featured in the WSJ, e.g. increasing general interest in 80kh, such that average interest on either side of the ‘discontinuity’ is different in a manner unrelated to 2. There may also be nonlinear trends through time, e.g. seasonality effects. ::: ::: {.callout-note collapse="true"} ## Nik: I think my first stab would be to model this with a GLM. Specifically, a Poisson likelihood w/ a log link and linear model with intercept, ^[DR: Poisson because its 'arrival of events'. Why a 'log link'?] - a linear coefficient on time, - and an indicator for some discrete bump corresponding to whether a point came before or after the WSJ feature. ^[DR: Fundamentally, there will be bumps in individual periods that may occur randomly. A shock or permanent shift after the WSJ feature need not have been *caused* by it. The question is 'how unusual is such a shock, in the context of a long time series of shocks'? And how long is this series, anyways?] Then, add a mean-0 multivariate normal error term for overdispersion to this, composing like an RBF kernel w/ a periodic kernel? ^[DR: Can you explain more? And does this allow for flexible autocorrelation? Do we need to 'test if it's a random walk'?)] This would sorta be like ‘kriging’ / using GPs to model temporal autocorrelation and I think should satisfy 1-3 above. And then if that works focus on trying to build out to cover more nuanced dynamics. But there might be better / more established ways to treat problems like this! ^[DR: I don't know what 'kriging' means; does anyone else understand it?] ::: ::: {.callout-note collapse="true"} ## Kim C's thoughts 1. structural change analysis (e.g., like that implemented in the R package [strucchange](https://cran.r-project.org/web/packages/strucchange/strucchange.pdf)). You can use this procedure to find "breakpoints" in the data where for example, is the slope changes value. If you think the trends are nonlinear, I've done a similar thing using GAMs, and then taking the 1st derivative of the fitted curves and checking if it intersects zero. Explained pretty well [here](https://www.frontiersin.org/articles/10.3389/fevo.2018.00149/full). ::: ## Pete's General notes on time series (in a prediction/ML context) ^[DR: Not sure what/how to adapt this.]::: In Machine Learning, **Time series** problems are problems that involve forecasting (extrapolating) the future based on the information of the past. Typically we have to make a chain of non-independent predictions rather than predict discrete, independent events. > DR: this prediction problem should be distinguished from 'time series econometrics', see, e.g., [Diebold's text](https://www.sas.upenn.edu/~fdiebold/Teaching706/TimeSeriesEconometrics.pdf), which focuses on estimating fundamental structural parameters, and considers forms of 'causality'. Most machine learning problems assume that the order of rows don't matter and that each row is independent of each other. Since time series problems involve time-ordered rows where past events may influence future events, the rows are not independent and this independence assumption is violated. Another key assumption of ML is that the training data is similar to the data being predicted (in this case, future data). This assumption must be true for time series as well. ### Decomposition You can **decompose** a time series into four key parts: - The **trend** (T), where the mean is changing over time (e.g., sales generally keep increasing over time) - **Seasonality** (S) (e.g., sales are higher in the holiday season) - A non-seasonal **cyclical** (C) component (e.g., stock market follows "business cycles"). This is distinct from seasonality as seasonality has a fixed period (e.g., every November), whereas a cycle does not. - A **random** component (e) > DR: The 'random component' could be distinguised or described further; there are various types of 'random' terms ... shifts, changes in trends, 'moving average' one-off terms, 'autoregressive' error terms ### Types of decomposition There are two basic types of decomposition -- **addititive**, where `y = T + S + C + e` and **multiplicative**, where `y = T * S * C * e`.  ## Stationary vs. non-stationarity **Stationary** time series (a) do not have a trend and also (b) do not have variance that changes over time. **Non-stationary** time series' do have (a) and/or (b). Typically (a) and (b) create problems for modeling, as models have trouble extrapolating these and they tend to violate the assumption that the training data is similar to the data being predicted. > DR: There can be *trend-stationary* series, that are stationary after including a trend. These are pretty easy to deal with. > Note also that nonstationary series can be described as following a 'random walk' > But I'm also not convinced that this should necessarily be a problem in a prediction problem. If the series is a random walk/nonstationary ... we can still use that knowledge to make a decent prediction of where the outcome will be at time T+t given it's value at time T. ### Converting to stationary We can resolve these issues by converting a non-stationary series to a stationary series. This is done by **differencing**, where we look at the differences in the target over time rather than the actual target (y[t]-> y[t] - y[t-1]). > DR: Most economic time series are in fact stationary after first-differencing. However, this is not guaranteed. You may still want to test for stationarity after first-differencing. (But my memory is that the whole tests for stationarity thing is a huge can of worms). We can also handle exponential trends using techniques like log transformations. ## Handling Features ### Lags Lagging is pretty key to time series. A **lag** is when you use the value from the previous series to forecast the next series. You can lag the target variable (e.g., use last month's sales to predict next month's sales) and/or you can lag independent variables. Lagged target variables often have strong explanatory power because the real world has delays (e.g., it takes a few weeks for marketing to transition to sales so marketing spend from three weeks ago may be more predictive than marketing spend of the same week) and causations that occur over time (e.g., sales from last year show that the store is more popular so there is more word of mouth and it is even more popular the next year). Lagging always involves losing some data, as if you are using data from the previous month you won't be able to use the first month in your training data (because there's no previous month data for month -1). If you are using data from the three previous months, you won't be able to train on the first three months. This may not be an issue though, because unlike with non-time series problems, more data in time series isn't always better (since you might prefer to only be modeling the most recent trend). We can mix lags of different lengths. ### Rolling Statistics Lags aren't the only thing we can do - we can also calculate **rolling statistics**, like the mean of a variable over the past 14 days. You can also do rolling stats on differences.  ### Which Lags / Rolling Stats to Use? - Assess with **cross correlation function**, which tests correlation of many different lags.  ooAssess with **partial autocorrelation function** (PACF), which gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. Also can assess with the **autocorrelation function** (ACF), which does not control for other lags. - changing lag lengths based on data type, domain knowledge, and how quickly you think the series reacts to change (e.g., use shorter lags for stocks) - compare by backtesting ### Known in Advance vs. Not When we are trying to predict in the future, we run into an issue that the features we are using for prediction might also be unknown at prediction time. For example, our historical data might contain information about rainfall and how that connects to sales, but we can't reliably know the rainfall three months in the future to predict future sales. However, some features are known in advance - like Christmas time may also have a big impact on sales and we always know exactly when Christmas will be. For features that are not known in advance, we can still use them by either explicitly forecasting them, extrapolating them out using lags / rolling stats of sufficient size, or extrapolating from current values using differing forecast differences.  ## Validation Normally for ML problems we use cross validation, where we randomly partition the data and then predict one partition using data from all the other partitions. The problem with this for time series is that this will involve using future data to predict the past, which will make for unrealistically good predictions. Instead, we can use **backtesting** where we predict a future time using a window of past times (e.g., predict Year 6 using Years 3-5, predict Year 5 using Years 2-4, predict Year 4 using Years 1-3). For metrics, it's good to compare the evaluation metric to a **naive baseline model**, or an intentionally minimal extrapolation. ## Types of Models - **Integrated model** - move model one step at a time, e.g., ARIMA, exponential smoothing - univariate - **Forecast distance model** - predict fixed distances, e.g., XGB - **Trend and decomposition model** - predict a different model for each distance, e.g., FB Prophet  ### ARIMA **Autoregressive process**: AR(p): fit coefficients to p lags **Moving average process**: MA(q): fit coefficients to q previous errors **ARMA model**: X = f(AR, MA) **I(d)** = Differencing d times **ARIMA(p, d, q)** = AR(p) + I(d) + MA(q)  ## Multiseries Predict a different time series for each unit (e.g., sales by store). Can use features across series (**cross series features**).