Practical Guide for Feature Engineering of Time Series Data

Introduction

Time series modeling is one of the most impactful machine learning use cases with broad applications across industries. Traditional time series modeling techniques, such as ARIMA, often automatically incorporate the time component of the data by using lagged values of the target variable as model inputs. While these techniques provide interpretable coefficients that aid in understanding the contribution of each variable to the forecast, they can be sensitive to outliers, missing data, and changes in the underlying data-generating process over time. As a result, their accuracy may be compromised.

On the other hand, machine learning combined with feature engineering offers a more robust approach to time series modeling. This approach can handle complex, non-linear relationships and is well-suited for large relational datasets with more complex relationships and intricate interdependencies.

Feature engineering plays a crucial role in time series modeling, as it involves selecting and transforming raw data into meaningful features that can enhance the accuracy of predictive statistical models. The importance of feature engineering in time series modeling cannot be overstated.

By carefully selecting and transforming relevant features, statistical models become more adept at capturing the underlying patterns and relationships within the data. This ultimately leads to improved forecasting accuracy. Moreover, feature engineering enables the incorporation of domain knowledge and intuition, allowing the model to leverage human expertise and enhance performance.

A typical approach to feature engineering in time series forecasting involves the following types of features:

lagged variables. By incorporating previous time series values as features, patterns such as seasonality and trends can be captured. For example, if we want to predict today’s sales, using lagged variables like yesterday’s sales can provide valuable information about the ongoing trend.
moving window statistics. This involves aggregating the time series values over a rolling window. By doing so, noise is smoothed out, shifting the focus to the underlying trends. Moving windows can help identify patterns that may not be immediately apparent in the raw data.
Time-based features such as the day of the week, the month of the year, holiday indicators, seasonality, and other time-related patterns can be valuable for predictions. For instance, if certain products tend to have higher average sales on weekends, incorporating the day of the week as a feature can improve the accuracy of the forecasting model.

In this tutorial, we will walk through examples of these feature types for a well-known time series dataset and discuss using pandas and SQL to manually create these features. The feature definitions we are discussing are common in many time series problems, and the code we provide can certainly be leveraged for other time series modeling projects.

Dataset

We are using the Store Sales – Time Series Forecasting competition on Kaggle (https://www.kaggle.com/competitions/store-sales-time-series-forecasting). The competition aims to build a model to predict the sales for the items in the test dataset.

The Train dataset contains the following columns:

Date: The date of the sale.
Store: The store where the sale took place.
Item: The item that was sold.
Promotion: Whether or not a promotion was active for the item on the date of the sale.
Sales: gives the total sales for a product family at a particular store at a given date.

The Kaggle competition data consists of several data source CSV files, including stores.csv, oil.csv, and holidays_events.csv.

Stores:

Contains metadata about different stores.
Includes city, state, type, and cluster information.
The cluster represents a grouping of similar stores.

Oil:

Provides daily oil prices.
Includes values for both the train and test data timeframes.
The dataset mentions that Ecuador, the country in focus, is highly dependent on oil, and its economic health is vulnerable to shocks in oil prices.

Holidays_events:

Contains information about holidays and events and associated metadata.
Pay attention to the “transferred” column.
If a holiday is marked as transferred, it means it officially falls on a specific calendar day but was moved to another date by the government.

We aim to build a machine-learning model to predict the Sales column for the items in the test dataset. We must first develop time series features from this complex relational dataset to do that.

Four Important Types of Time-Series Features

In the following sections, we will create four key features to enhance forecasting models in this Store Sales – Time Series use case.

First, we will incorporate lag features, including the sales figures from the prior day and two days prior. These features capture the influence of past sales on current values and help identify seasonality and trends.
Next, we will utilize window features by calculating rolling means for the past 14 and 28 days. This approach, performed separately for each product and store segment, enables us to smooth out noise and highlight underlying trends, providing valuable insights into long-term patterns and potential shifts.
Time-based features are crucial for this type of sales forecasting model, given the impact of weekends and holidays on store sales. In this case, we will explore the presence of higher sales volumes on weekends by creating a feature that identifies the day of the week and flags weekends. This time-based pattern recognition allows us to capture temporal dynamics influencing sales behavior.
Lastly, we will combine all three approaches: lag, window, and time-based features, to create a comprehensive set of inputs for our forecasting model. By integrating these features, we aim to capture a wide range of patterns and relationships within the data, maximizing the accuracy of our predictions and facilitating more informed decision-making.

These four features: lag, window, time-based, and their combined effects, will significantly enhance our time series forecasting capabilities. By incorporating the influence of past sales, identifying underlying trends, capturing time-based patterns, and leveraging the synergy between these approaches, we are equipped to uncover valuable insights and optimize our forecasting strategies.

1. Lag Features

Using lagged features is a common technique in time-series analysis and machine-learning applications. By shifting the values of a variable backward or forward in time by a certain number of time periods, lagged features can capture temporal dependencies and trends in the data, providing valuable insights and improving the accuracy of predictive models.

In particular, lagged features are useful in predicting future values of a variable, as they can help identify patterns and relationships between variables over time. They are often used in fields such as finance, economics, and weather forecasting, where accurate predictions of future values are critical.

Several methods exist for generating lagged features, including shifting the values of a variable by a fixed number of time periods or by using a rolling window of previous values. The choice of method will depend on the specific application and the data characteristics.
In this example, we will create two lagged features that may be helpful in prediction: The sales on the prior 1 day and the sales 2 days prior. The pandas and SQL code that can be used to generate these features are given below.

Python

import pandas as pd
from sqlite3 import connect
import warnings
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

#Pandas Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Define a function to create the lagged feature
def add_lagged_features(group, lags):
    for lag in lags
        group[f'lagged_sales_{lag}'] = group['sales'].shift(lag)
    return group

# Create a lagged variable for sales
grouped_data = sales_data.groupby(['store_nbr', 'family']).apply(add_lagged_features, lags=[1, 2])

# Merge the lagged features back into the original DataFrame
sales_data = pd.merge(sales_data, grouped_data[['store_nbr', 'family', 'date', 'lagged_sales_1', 'lagged_sales_2']], on=['store_nbr', 'family', 'date'])

# Print the first 10 rows of the dataset with the lagged variable
print(sales_data[['store_nbr','family', 'date','sales',  'lagged_sales_1', 'lagged_sales_2']].tail(10))


#SQL Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, 
  LAG(sales, 1) OVER (PARTITION BY store_nbr, family ORDER BY date) AS lagged_sales_1,
  LAG(sales, 2) OVER (PARTITION BY store_nbr, family ORDER BY date) AS lagged_sales_2
FROM 
  sales_data_sql
ORDER BY 
  store_nbr, family, date
"""


sales_data_lagged = pd.read_sql(query, conn)
sales_data_lagged.tail(10)

By executing this code, users can manually build features for The sales on the prior 1 day and the sales 2 days prior.

	store_nbr	family	date	sales	lagged_sales_1	lagged_sales_2
3000878	54	SEAFOOD	2017-08-06 00:00:00	3.0	3.0	0.0
3000879	54	SEAFOOD	2017-08-07 00:00:00	0.0	0.0	3.0
3000880	54	SEAFOOD	2017-08-08 00:00:00	0.0	0.0	0.0
3000881	54	SEAFOOD	2017-08-09 00:00:00	12.0	12.0	0.0
3000882	54	SEAFOOD	2017-08-10 00:00:00	4.0	4.0	12.0
3000883	54	SEAFOOD	2017-08-11 00:00:00	2.0	2.0	4.0
3000884	54	SEAFOOD	2017-08-12 00:00:00	0.0	0.0	2.0
3000885	54	SEAFOOD	2017-08-13 00:00:00	1.0	1.0	0.0
3000886	54	SEAFOOD	2017-08-14 00:00:00	2.0	2.0	1.0
3000887	54	SEAFOOD	2017-08-15 00:00:00	0.0	0.0	2.0

2. Rolling Window Statistics

Rolling window features are a widely used technique for feature engineering in time-series analysis and machine learning. By calculating summary statistics, such as the mean or standard deviation, over a sliding window of previous values, rolling window features can capture trends and patterns in the data, providing valuable insights and improving the accuracy of predictive models.

In particular, rolling window features are useful in identifying trends and changes in the data over time, as they can highlight important features such as seasonality, cyclical patterns, or sudden shifts in the data. They are commonly used in fields such as finance, economics, and healthcare, where accurate predictions of future values are crucial.

The choice of window size and summary statistics will depend on the specific application and the characteristics of the data. For example, a smaller window size may capture short-term fluctuations in the data, while a larger window size may capture longer-term trends. Similarly, summary statistics such as the mean or standard deviation may capture different aspects of the data and be more appropriate for different applications.
In this example, we will create features as a rolling mean for the past 14 and 28 days for each segment of product and store.

Python

# Pandas Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Define a function to compute the rolling mean for each group
def add_rolling_mean(group, window_sizes):
    for window_size in window_sizes:
        group[f'rolling_mean_{window_size}_days'] = group['sales'].rolling(window_size).mean()
    return group

# Define the window sizes to use

# Create a rolling mean variables for sales
grouped_data = sales_data.groupby(['store_nbr', 'family']).apply(add_rolling_mean, window_sizes = [14, 28]) 

# Merge the lagged features back into the original DataFrame
sales_data = pd.merge(sales_data, grouped_data[['store_nbr', 'family', 'date', 'rolling_mean_14_days', 'rolling_mean_28_days']], on=['store_nbr', 'family', 'date'])

# Print the first 10 rows of the dataset with the rolling mean variable
sales_data[['store_nbr','family', 'date','sales', 'rolling_mean_14_days', 'rolling_mean_28_days']].tail(10)

#SQL Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, 
  AVG(sales) OVER (PARTITION BY store_nbr, family ORDER BY date ROWS BETWEEN 13 PRECEDING AND CURRENT ROW) AS rolling_mean_14_days,
  AVG(sales) OVER (PARTITION BY store_nbr, family ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW) AS rolling_mean_28_days
FROM 
  sales_data_sql
"""

sales_data_rolling = pd.read_sql(query, conn)
sales_data_rolling.tail(10)

By executing this code, users can manually build features for a rolling mean for the past 14 and 28 days for each segment of product and store.

	store_nbr	family	date	sales	rolling_mean_14_days	rolling_mean_28_days
300878	54	SEAFOOD	2017-08-06	0.0	3.071429	3.000000
300879	54	SEAFOOD	2017-08-07	0.0	3.00000	2.642857
300880	54	SEAFOOD	2017-08-08	12.0	3.357143	2.964286
300881	54	SEAFOOD	2017-08-09	4.0	3.428571	3.071429
300882	54	SEAFOOD	2017-08-10	2.0	3.428571	3.142857
300883	54	SEAFOOD	2017-08-11	0.0	3.142857	3.142857
300884	54	SEAFOOD	2017-08-12	1.0	2.928571	3.107143
300885	54	SEAFOOD	2017-08-13	2.0	2.785714	3.000000
300886	54	SEAFOOD	2017-08-14	0.0	2.50000	2.892857
300887	54	SEAFOOD	2017-08-15	3.0	2.50000	2.785714

3. Time-based features

DateTime feature engineering is the process of creating new features from date and time information in order to improve predictive accuracy in machine learning models. This technique is especially useful in time-series analysis, where time-related patterns are important, such as in finance, e-commerce, healthcare, and social media.

One common example of datetime feature engineering is extracting the day of the week, hour of the day, or month of the year. These features can capture the seasonality and temporal patterns in the data. For example, a time-series dataset of e-commerce sales may have a higher sales volume on weekends than weekdays or holidays such as Christmas.

In this example, we will create features such as “day_of_the_week” and “is_weekend”.

Python

# Pandas Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Sort the dataset by date in ascending order
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Extract the day of the week from the date column
sales_data['day_of_week'] = sales_data['date'].dt.day_name()

# Create a new binary column for whether the day is a weekend or not
sales_data['is_weekend'] = sales_data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)

# Print the first 10 rows of the dataset with the is_weekend variable
sales_data[['store_nbr','family', 'date','sales', 'day_of_week','is_weekend']].tail(10)

#SQL Version
sales_data = pd.read_csv("data/train.csv")

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Sort the dataset by date in ascending order
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, strftime('%w', date) as day_of_week
FROM 
  sales_data_sql
"""

sales_data_day_of_week = pd.read_sql(query, conn)
sales_data_day_of_week.to_sql('sales_data_day_of_week_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, 
  day_of_week,
  CASE 
    WHEN day_of_week IN ('0', '6') THEN 1 
    ELSE 0 
  END AS is_weekend
FROM 
  sales_data_day_of_week_sql
"""

sales_data_weekend = pd.read_sql(query, conn)
sales_data_weekend.tail(10)

By executing this code, users can manually build features for “day_of_the_week” and “is_weekend”

	store_nbr	family	date	sales	day_of_week	is_weekend
1285214	54	SEAFOOD	2017-08-06	0.0	Sunday	1
1286996	54	SEAFOOD	2017-08-07	0.0	Monday	0
1288778	54	SEAFOOD	2017-08-08	12.0	Tuesday	0
1290560	54	SEAFOOD	2017-08-09	4.0	Wednesday	0
1292342	54	SEAFOOD	2017-08-10	2.0	Thursday	0
1294124	54	SEAFOOD	2017-08-11	0.0	Friday	0
1295906	54	SEAFOOD	2017-08-12	1.0	Saturday	1
1297688	54	SEAFOOD	2017-08-13	2.0	Sunday	1
1299470	54	SEAFOOD	2017-08-14	0.0	Monday	0
1301252	54	SEAFOOD	2017-08-15	3.0	Tuesday	0

4. Complex Feature

For a real-world use case, the important features are often complex that combine the previous 3 types of features. Consider the following feature:

Average weekly (rolling window) sales in the same week number (date time) last year (lag)

Python

# Pandas Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Extract week number and year from date column
sales_data['week_number'] = sales_data['date'].dt.week
sales_data['year'] = sales_data['date'].dt.year
sales_data['last_year'] = sales_data['year'] - 1

# Compute average sales for each week number and year combination
weekly_sales = sales_data.groupby(['store_nbr', 'family', 'week_number', 'year'])['sales'].mean().reset_index().rename(columns={'sales': 'avg_sales_last_year_week'})

# Compute week number and year for same week last year
weekly_sales['join_year'] = weekly_sales['year']
weekly_sales['last_week_number'] = weekly_sales['week_number']

# Merge average sales for same week last year to original data
sales_data = pd.merge(sales_data, weekly_sales[['store_nbr', 'family', 'join_year', 'last_week_number', 'avg_sales_last_year_week']], how='left', 
left_on=['store_nbr', 'family', 'week_number', 'last_year'], right_on=['store_nbr', 'family', 'last_week_number', 'join_year'])

sales_data.tail(10)

#SQL Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Extract week number and year from date column
sales_data['week_number'] = sales_data['date'].dt.week
sales_data['year'] = sales_data['date'].dt.year
sales_data['last_year'] = sales_data['year'] - 1

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)


query = """
SELECT 
    s.*, 
    t.avg_sales_last_year_week
FROM 
    sales_data_sql s
    INNER JOIN (
        SELECT 
            sales_data_sql.store_nbr, 
            sales_data_sql.family, 
            strftime('%Y', sales_data_sql.date) as year1, 
            strftime('%W', sales_data_sql.date) as week, 
            AVG(s_prev.sales) AS avg_sales_last_year_week
        FROM 
            sales_data_sql
            INNER JOIN sales_data_sql s_prev ON sales_data_sql.store_nbr = s_prev.store_nbr AND sales_data_sql.family = s_prev.family
             AND 
                strftime('%Y',date( sales_data_sql.date, '-1 year')) = strftime('%Y',s_prev.date) 
             AND
                strftime('%W',sales_data_sql.date) = strftime('%W',s_prev.date)
        GROUP BY 
            sales_data_sql.store_nbr, 
            sales_data_sql.family, 
            strftime('%Y', sales_data_sql.date), 
            strftime('%W', sales_data_sql.date)
    ) t ON s.store_nbr = t.store_nbr AND s.family = t.family AND strftime('%Y', s.date) = t.year1 AND strftime('%W', s.date) = t.week
ORDER BY 
    s.store_nbr, 
    s.family, 
    s.date
"""

avg_sales_last_year_week = pd.read_sql(query, conn)
avg_sales_last_year_week.tail(10)

By executing this code, users can manually build features for “Average weekly (rolling window) sales in the same week number (date time) last year (lag)”

	id	date	store_nbr	family	sales	onpromotion	week_number	year	last_year	join_year	last_week_number	avg_sales_last_year_week
1703633	1703633	2015-08-17	10	DAIRY	218.0	1	34	2015	2014	2104.0	34.0	244.571429
1705415	1705415	2015-08-18	10	DAIRY	200.0	0	34	2015	2014	2104.0	34.0	244.571429
1707197	1707197	2015-08-19	10	DAIRY	190.0	50	34	2015	2014	2104.0	34.0	244.571429
1708979	1708979	2015-08-20	10	DAIRY	183.0	0	34	2015	2014	2104.0	34.0	244.571429
1710761	1710761	2015-08-21	10	DAIRY	178.0	0	34	2015	2014	2104.0	34.0	244.571429
1712543	1712543	2015-08-22	10	DAIRY	227.0	0	34	2015	2014	2104.0	34.0	244.571429
1514325	1514325	2015-08-23	10	DAIRY	268.0	0	34	2015	2014	2104.0	34.0	244.571429

As you can see, developing this single feature requires fairly complex pandas and SQL programming. When developing complex features like this, we need to ensure we are doing proper testing of the pandas and SQL codes before we implement them into machine learning models.

Conclusion

This tutorial demonstrates the creation of lag features, rolling window statistics, and time-based features on the Kaggle Time Series Dataset. Lagged versions of the target variable can be utilized to predict future values, such as using yesterday’s sales to forecast today’s sales. Rolling window statistics, such as moving averages, enable us to capture trends and minimize noise in the data. Additionally, time-based features, such as the hour, day, month, or season, offer valuable information related to timing and can be leveraged to uncover patterns, like higher average sales of specific products on weekends. By employing these techniques, data scientists can enhance their predictive models and gain a deeper understanding of their datasets.

Data science teams often find themselves caught in the time-consuming trap of manual feature engineering. The process of feature discovery is notorious for being slow and repetitive, demanding extensive domain knowledge. However, there is a revolutionary solution to this predicament in the form of dotData’s Feature Factory, an innovative feature development workbench. Feature Factory significantly expedites this process by automatically suggesting feature spaces derived from relational data. It jump-starts the exploration phase, saving valuable time for data scientists. Request a product walkthrough today.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Dive Deeper

Products

Our On-Demand Webinars

Case Studies

Industry

Need

News

News

Events

News

Case Study: Sumitomo Mitsui Trust Bank Increases Close Rates by 20X with AI

Practical Guide for Feature Engineering of Time Series Data

Introduction

Dataset

Four Important Types of Time-Series Features

1. Lag Features

2. Rolling Window Statistics

3. Time-based features

4. Complex Feature

Conclusion

dotData's AI Platform

dotData Feature Factory Boosting ML Accuracy through Feature Discovery

dotData Insight Unlocking Hidden Patterns

dotData Ops Self-Service Deployment of Data and Prediction Pipelines

dotData Cloud Eliminate Infrastructure Hassles with Fully Managed SaaS

Related Articles

Types of Predictive Models (& How They Work)

Boost Time-Series Modeling with Effective Temporal Feature Engineering – Part 3

Maintain Model Robustness: Strategies to Combat Feature Drift in Machine Learning

Dive Deeper

Products

Our On-Demand Webinars

Case Studies

Industry

Need

News

News

Events

News

Case Study: Sumitomo Mitsui Trust Bank Increases Close Rates by 20X with AI

Practical Guide for Feature Engineering of Time Series Data

Join Our Newsletter

Introduction

Dataset

Four Important Types of Time-Series Features

1. Lag Features

2. Rolling Window Statistics

3. Time-based features

4. Complex Feature

Conclusion

dotData's AI Platform

dotData Feature Factory Boosting ML Accuracy through Feature Discovery

dotData Insight Unlocking Hidden Patterns

dotData Ops Self-Service Deployment of Data and Prediction Pipelines

dotData Cloud Eliminate Infrastructure Hassles with Fully Managed SaaS

Related Articles

Types of Predictive Models (& How They Work)

Boost Time-Series Modeling with Effective Temporal Feature Engineering – Part 3

Maintain Model Robustness: Strategies to Combat Feature Drift in Machine Learning