Practical Guide for Feature Engineering of Time Series Data
- Technical Posts
Time series modeling is one of the most impactful machine learning use cases with broad applications across industries. Traditional time series modeling techniques, such as ARIMA, often automatically incorporate the time component of the data by using lagged values of the target variable as model inputs. While these techniques provide interpretable coefficients that aid in understanding the contribution of each variable to the forecast, they can be sensitive to outliers, missing data, and changes in the underlying data-generating process over time. As a result, their accuracy may be compromised.
On the other hand, machine learning combined with feature engineering offers a more robust approach to time series modeling. This approach can handle complex, non-linear relationships and is well-suited for large relational datasets with more complex relationships and intricate interdependencies.
Feature engineering plays a crucial role in time series modeling, as it involves selecting and transforming raw data into meaningful features that can enhance the accuracy of predictive statistical models. The importance of feature engineering in time series modeling cannot be overstated.
By carefully selecting and transforming relevant features, statistical models become more adept at capturing the underlying patterns and relationships within the data. This ultimately leads to improved forecasting accuracy. Moreover, feature engineering enables the incorporation of domain knowledge and intuition, allowing the model to leverage human expertise and enhance performance.
A typical approach to feature engineering in time series forecasting involves the following types of features:
In this tutorial, we will walk through examples of these feature types for a well-known time series dataset and discuss using pandas and SQL to manually create these features. The feature definitions we are discussing are common in many time series problems, and the code we provide can certainly be leveraged for other time series modeling projects.
We are using the Store Sales – Time Series Forecasting competition on Kaggle (https://www.kaggle.com/competitions/store-sales-time-series-forecasting). The competition aims to build a model to predict the sales for the items in the test dataset.
The Train dataset contains the following columns:
The Kaggle competition data consists of several data source CSV files, including stores.csv, oil.csv, and holidays_events.csv.
Stores:
Oil:
Holidays_events:
We aim to build a machine-learning model to predict the Sales column for the items in the test dataset. We must first develop time series features from this complex relational dataset to do that.
In the following sections, we will create four key features to enhance forecasting models in this Store Sales – Time Series use case.
These four features: lag, window, time-based, and their combined effects, will significantly enhance our time series forecasting capabilities. By incorporating the influence of past sales, identifying underlying trends, capturing time-based patterns, and leveraging the synergy between these approaches, we are equipped to uncover valuable insights and optimize our forecasting strategies.
Using lagged features is a common technique in time-series analysis and machine-learning applications. By shifting the values of a variable backward or forward in time by a certain number of time periods, lagged features can capture temporal dependencies and trends in the data, providing valuable insights and improving the accuracy of predictive models.
In particular, lagged features are useful in predicting future values of a variable, as they can help identify patterns and relationships between variables over time. They are often used in fields such as finance, economics, and weather forecasting, where accurate predictions of future values are critical.
Several methods exist for generating lagged features, including shifting the values of a variable by a fixed number of time periods or by using a rolling window of previous values. The choice of method will depend on the specific application and the data characteristics.
In this example, we will create two lagged features that may be helpful in prediction: The sales on the prior 1 day and the sales 2 days prior. The pandas and SQL code that can be used to generate these features are given below.
import pandas as pd
from sqlite3 import connect
import warnings
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#Pandas Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True)
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Define a function to create the lagged feature
def add_lagged_features(group, lags):
for lag in lags
group[f'lagged_sales_{lag}'] = group['sales'].shift(lag)
return group
# Create a lagged variable for sales
grouped_data = sales_data.groupby(['store_nbr', 'family']).apply(add_lagged_features, lags=[1, 2])
# Merge the lagged features back into the original DataFrame
sales_data = pd.merge(sales_data, grouped_data[['store_nbr', 'family', 'date', 'lagged_sales_1', 'lagged_sales_2']], on=['store_nbr', 'family', 'date'])
# Print the first 10 rows of the dataset with the lagged variable
print(sales_data[['store_nbr','family', 'date','sales', 'lagged_sales_1', 'lagged_sales_2']].tail(10))
#SQL Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True)
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)
query = """
SELECT
store_nbr, family, date, sales,
LAG(sales, 1) OVER (PARTITION BY store_nbr, family ORDER BY date) AS lagged_sales_1,
LAG(sales, 2) OVER (PARTITION BY store_nbr, family ORDER BY date) AS lagged_sales_2
FROM
sales_data_sql
ORDER BY
store_nbr, family, date
"""
sales_data_lagged = pd.read_sql(query, conn)
sales_data_lagged.tail(10)
\
By executing this code, users can manually build features for The sales on the prior 1 day and the sales 2 days prior.
store_nbr | family | date | sales | lagged_sales_1 | lagged_sales_2 | |
---|---|---|---|---|---|---|
3000878 | 54 | SEAFOOD | 2017-08-06 00:00:00 | 3.0 | 3.0 | 0.0 |
3000879 | 54 | SEAFOOD | 2017-08-07 00:00:00 | 0.0 | 0.0 | 3.0 |
3000880 | 54 | SEAFOOD | 2017-08-08 00:00:00 | 0.0 | 0.0 | 0.0 |
3000881 | 54 | SEAFOOD | 2017-08-09 00:00:00 | 12.0 | 12.0 | 0.0 |
3000882 | 54 | SEAFOOD | 2017-08-10 00:00:00 | 4.0 | 4.0 | 12.0 |
3000883 | 54 | SEAFOOD | 2017-08-11 00:00:00 | 2.0 | 2.0 | 4.0 |
3000884 | 54 | SEAFOOD | 2017-08-12 00:00:00 | 0.0 | 0.0 | 2.0 |
3000885 | 54 | SEAFOOD | 2017-08-13 00:00:00 | 1.0 | 1.0 | 0.0 |
3000886 | 54 | SEAFOOD | 2017-08-14 00:00:00 | 2.0 | 2.0 | 1.0 |
3000887 | 54 | SEAFOOD | 2017-08-15 00:00:00 | 0.0 | 0.0 | 2.0 |
Rolling window features are a widely used technique for feature engineering in time-series analysis and machine learning. By calculating summary statistics, such as the mean or standard deviation, over a sliding window of previous values, rolling window features can capture trends and patterns in the data, providing valuable insights and improving the accuracy of predictive models.
In particular, rolling window features are useful in identifying trends and changes in the data over time, as they can highlight important features such as seasonality, cyclical patterns, or sudden shifts in the data. They are commonly used in fields such as finance, economics, and healthcare, where accurate predictions of future values are crucial.
The choice of window size and summary statistics will depend on the specific application and the characteristics of the data. For example, a smaller window size may capture short-term fluctuations in the data, while a larger window size may capture longer-term trends. Similarly, summary statistics such as the mean or standard deviation may capture different aspects of the data and be more appropriate for different applications.
In this example, we will create features as a rolling mean for the past 14 and 28 days for each segment of product and store.
# Pandas Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True)
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Define a function to compute the rolling mean for each group
def add_rolling_mean(group, window_sizes):
for window_size in window_sizes:
group[f'rolling_mean_{window_size}_days'] = group['sales'].rolling(window_size).mean()
return group
# Define the window sizes to use
# Create a rolling mean variables for sales
grouped_data = sales_data.groupby(['store_nbr', 'family']).apply(add_rolling_mean, window_sizes = [14, 28])
# Merge the lagged features back into the original DataFrame
sales_data = pd.merge(sales_data, grouped_data[['store_nbr', 'family', 'date', 'rolling_mean_14_days', 'rolling_mean_28_days']], on=['store_nbr', 'family', 'date'])
# Print the first 10 rows of the dataset with the rolling mean variable
sales_data[['store_nbr','family', 'date','sales', 'rolling_mean_14_days', 'rolling_mean_28_days']].tail(10)
#SQL Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True)
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)
query = """
SELECT
store_nbr, family, date, sales,
AVG(sales) OVER (PARTITION BY store_nbr, family ORDER BY date ROWS BETWEEN 13 PRECEDING AND CURRENT ROW) AS rolling_mean_14_days,
AVG(sales) OVER (PARTITION BY store_nbr, family ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW) AS rolling_mean_28_days
FROM
sales_data_sql
"""
sales_data_rolling = pd.read_sql(query, conn)
sales_data_rolling.tail(10)
By executing this code, users can manually build features for a rolling mean for the past 14 and 28 days for each segment of product and store.
store_nbr | family | date | sales | rolling_mean_14_days | rolling_mean_28_days | |
---|---|---|---|---|---|---|
300878 | 54 | SEAFOOD | 2017-08-06 | 0.0 | 3.071429 | 3.000000 |
300879 | 54 | SEAFOOD | 2017-08-07 | 0.0 | 3.00000 | 2.642857 |
300880 | 54 | SEAFOOD | 2017-08-08 | 12.0 | 3.357143 | 2.964286 |
300881 | 54 | SEAFOOD | 2017-08-09 | 4.0 | 3.428571 | 3.071429 |
300882 | 54 | SEAFOOD | 2017-08-10 | 2.0 | 3.428571 | 3.142857 |
300883 | 54 | SEAFOOD | 2017-08-11 | 0.0 | 3.142857 | 3.142857 |
300884 | 54 | SEAFOOD | 2017-08-12 | 1.0 | 2.928571 | 3.107143 |
300885 | 54 | SEAFOOD | 2017-08-13 | 2.0 | 2.785714 | 3.000000 |
300886 | 54 | SEAFOOD | 2017-08-14 | 0.0 | 2.50000 | 2.892857 |
300887 | 54 | SEAFOOD | 2017-08-15 | 3.0 | 2.50000 | 2.785714 |
DateTime feature engineering is the process of creating new features from date and time information in order to improve predictive accuracy in machine learning models. This technique is especially useful in time-series analysis, where time-related patterns are important, such as in finance, e-commerce, healthcare, and social media.
One common example of datetime feature engineering is extracting the day of the week, hour of the day, or month of the year. These features can capture the seasonality and temporal patterns in the data. For example, a time-series dataset of e-commerce sales may have a higher sales volume on weekends than weekdays or holidays such as Christmas.
In this example, we will create features such as “day_of_the_week” and “is_weekend”.
# Pandas Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Sort the dataset by date in ascending order
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True)
# Extract the day of the week from the date column
sales_data['day_of_week'] = sales_data['date'].dt.day_name()
# Create a new binary column for whether the day is a weekend or not
sales_data['is_weekend'] = sales_data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
# Print the first 10 rows of the dataset with the is_weekend variable
sales_data[['store_nbr','family', 'date','sales', 'day_of_week','is_weekend']].tail(10)
#SQL Version
sales_data = pd.read_csv("data/train.csv")
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Sort the dataset by date in ascending order
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True)
conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)
query = """
SELECT
store_nbr, family, date, sales, strftime('%w', date) as day_of_week
FROM
sales_data_sql
"""
sales_data_day_of_week = pd.read_sql(query, conn)
sales_data_day_of_week.to_sql('sales_data_day_of_week_sql', conn)
query = """
SELECT
store_nbr, family, date, sales,
day_of_week,
CASE
WHEN day_of_week IN ('0', '6') THEN 1
ELSE 0
END AS is_weekend
FROM
sales_data_day_of_week_sql
"""
sales_data_weekend = pd.read_sql(query, conn)
sales_data_weekend.tail(10)
By executing this code, users can manually build features for “day_of_the_week” and “is_weekend”
store_nbr | family | date | sales | day_of_week | is_weekend | |
---|---|---|---|---|---|---|
1285214 | 54 | SEAFOOD | 2017-08-06 | 0.0 | Sunday | 1 |
1286996 | 54 | SEAFOOD | 2017-08-07 | 0.0 | Monday | 0 |
1288778 | 54 | SEAFOOD | 2017-08-08 | 12.0 | Tuesday | 0 |
1290560 | 54 | SEAFOOD | 2017-08-09 | 4.0 | Wednesday | 0 |
1292342 | 54 | SEAFOOD | 2017-08-10 | 2.0 | Thursday | 0 |
1294124 | 54 | SEAFOOD | 2017-08-11 | 0.0 | Friday | 0 |
1295906 | 54 | SEAFOOD | 2017-08-12 | 1.0 | Saturday | 1 |
1297688 | 54 | SEAFOOD | 2017-08-13 | 2.0 | Sunday | 1 |
1299470 | 54 | SEAFOOD | 2017-08-14 | 0.0 | Monday | 0 |
1301252 | 54 | SEAFOOD | 2017-08-15 | 3.0 | Tuesday | 0 |
For a real-world use case, the important features are often complex that combine the previous 3 types of features. Consider the following feature:
Average weekly (rolling window) sales in the same week number (date time) last year (lag)
# Pandas Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Extract week number and year from date column
sales_data['week_number'] = sales_data['date'].dt.week
sales_data['year'] = sales_data['date'].dt.year
sales_data['last_year'] = sales_data['year'] - 1
# Compute average sales for each week number and year combination
weekly_sales = sales_data.groupby(['store_nbr', 'family', 'week_number', 'year'])['sales'].mean().reset_index().rename(columns={'sales': 'avg_sales_last_year_week'})
# Compute week number and year for same week last year
weekly_sales['join_year'] = weekly_sales['year']
weekly_sales['last_week_number'] = weekly_sales['week_number']
# Merge average sales for same week last year to original data
sales_data = pd.merge(sales_data, weekly_sales[['store_nbr', 'family', 'join_year', 'last_week_number', 'avg_sales_last_year_week']], how='left',
left_on=['store_nbr', 'family', 'week_number', 'last_year'], right_on=['store_nbr', 'family', 'last_week_number', 'join_year'])
sales_data.tail(10)
#SQL Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Extract week number and year from date column
sales_data['week_number'] = sales_data['date'].dt.week
sales_data['year'] = sales_data['date'].dt.year
sales_data['last_year'] = sales_data['year'] - 1
conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)
query = """
SELECT
s.*,
t.avg_sales_last_year_week
FROM
sales_data_sql s
INNER JOIN (
SELECT
sales_data_sql.store_nbr,
sales_data_sql.family,
strftime('%Y', sales_data_sql.date) as year1,
strftime('%W', sales_data_sql.date) as week,
AVG(s_prev.sales) AS avg_sales_last_year_week
FROM
sales_data_sql
INNER JOIN sales_data_sql s_prev ON sales_data_sql.store_nbr = s_prev.store_nbr AND sales_data_sql.family = s_prev.family
AND
strftime('%Y',date( sales_data_sql.date, '-1 year')) = strftime('%Y',s_prev.date)
AND
strftime('%W',sales_data_sql.date) = strftime('%W',s_prev.date)
GROUP BY
sales_data_sql.store_nbr,
sales_data_sql.family,
strftime('%Y', sales_data_sql.date),
strftime('%W', sales_data_sql.date)
) t ON s.store_nbr = t.store_nbr AND s.family = t.family AND strftime('%Y', s.date) = t.year1 AND strftime('%W', s.date) = t.week
ORDER BY
s.store_nbr,
s.family,
s.date
"""
avg_sales_last_year_week = pd.read_sql(query, conn)
avg_sales_last_year_week.tail(10)
By executing this code, users can manually build features for “Average weekly (rolling window) sales in the same week number (date time) last year (lag)”
id | date | store_nbr | family | sales | onpromotion | week_number | year | last_year | join_year | last_week_number | avg_sales_last_year_week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1703633 | 1703633 | 2015-08-17 | 10 | DAIRY | 218.0 | 1 | 34 | 2015 | 2014 | 2104.0 | 34.0 | 244.571429 |
1705415 | 1705415 | 2015-08-18 | 10 | DAIRY | 200.0 | 0 | 34 | 2015 | 2014 | 2104.0 | 34.0 | 244.571429 |
1707197 | 1707197 | 2015-08-19 | 10 | DAIRY | 190.0 | 50 | 34 | 2015 | 2014 | 2104.0 | 34.0 | 244.571429 |
1708979 | 1708979 | 2015-08-20 | 10 | DAIRY | 183.0 | 0 | 34 | 2015 | 2014 | 2104.0 | 34.0 | 244.571429 |
1710761 | 1710761 | 2015-08-21 | 10 | DAIRY | 178.0 | 0 | 34 | 2015 | 2014 | 2104.0 | 34.0 | 244.571429 |
1712543 | 1712543 | 2015-08-22 | 10 | DAIRY | 227.0 | 0 | 34 | 2015 | 2014 | 2104.0 | 34.0 | 244.571429 |
1514325 | 1514325 | 2015-08-23 | 10 | DAIRY | 268.0 | 0 | 34 | 2015 | 2014 | 2104.0 | 34.0 | 244.571429 |
As you can see, developing this single feature requires fairly complex pandas and SQL programming. When developing complex features like this, we need to ensure we are doing proper testing of the pandas and SQL codes before we implement them into machine learning models.
This tutorial demonstrates the creation of lag features, rolling window statistics, and time-based features on the Kaggle Time Series Dataset. Lagged versions of the target variable can be utilized to predict future values, such as using yesterday’s sales to forecast today’s sales. Rolling window statistics, such as moving averages, enable us to capture trends and minimize noise in the data. Additionally, time-based features, such as the hour, day, month, or season, offer valuable information related to timing and can be leveraged to uncover patterns, like higher average sales of specific products on weekends. By employing these techniques, data scientists can enhance their predictive models and gain a deeper understanding of their datasets.
Data science teams often find themselves caught in the time-consuming trap of manual feature engineering. The process of feature discovery is notorious for being slow and repetitive, demanding extensive domain knowledge. However, there is a revolutionary solution to this predicament in the form of dotData’s Feature Factory, an innovative feature development workbench. Feature Factory significantly expedites this process by automatically suggesting feature spaces derived from relational data. It jump-starts the exploration phase, saving valuable time for data scientists. Request a product walkthrough today.