Time series forecasting is a method of predicting future values of time-stamped data using statistical, econometric, signal processing, or machine learning techniques. Time series data sets consist of historical data points that are collected across time periods, often at regular intervals. Each observation in the time series data sets has a time stamp ranging from very short intervals below a second, such as nanoseconds used to measure how fast computers process information, to very long intervals, such as millions of years, to describe geological processes.
The process for creating a time series forecasting model should answer the following questions:
This leads to the following steps to go from problem definition to an operational model:
Define what we want to predict and the metrics to be used to evaluate the model’s predictive accuracy.
Gathering and storing time series data in raw form if a new project or identifying data sources (database tables, spreadsheets, files) currently exist.
Identifying entity relationships, data schemas, and data type identification and canonicalization.
Cleaning the data, deduplication, outlier, and illegal value elimination, missing data imputation, and string and categorical canonicalization.
Identifying patterns and characteristics of the data is useful for understanding the data and hypothesizing about the types of features that may be important.
Applying filters, aggregation, normalization, and transformation to the raw source data to create new variables that improve predictive performance relative to the target variable.
Choosing the appropriate forecasting model based on the characteristics of the time series data and the problem to solve. For time series data, common approaches include autoregressive integrated moving average (ARIMA), exponential smoothing, seasonal decomposition, long-short-term memory (LSTM) neural networks, decision trees, and dynamic system models.
Estimating model parameters from a historical training partition of the full-time series range (typically around 80% of the full-time range).
Assessing the accuracy of the fitted model on a validation partition from the full-time series range (typically 10% of the time range, post-training set) using appropriate metrics (such as mean squared error, mean absolute error, root mean squared error or percent error).
Making a prediction of the fitted model on a test partition from the full-time series range (the final 10% of the historical range) and evaluating the prediction accuracy.
Using the model to generate predictions for the business use case on new data unseen during model development.
Periodically evaluating the model performance on new data and updating the model if required, which often is required due to new patterns or trends emerging in the data.
Time series forecasting is subject to various challenges, such as seasonality, trend shifts, irregular patterns, and noisy data. Handling these complexities and selecting the appropriate features and model is critical for creating a reliable and robust model capable of generating accurate forecasts. As seen in the image above, manual feature engineering can take months, limiting the ability to create models on time. Thus, establishing an automated data-centric feature discovery process is required to remain competitive in today’s AI-driven world.