fbpx

Practical Guide for Feature Engineering of Time Series Data

  • Technical Posts

Introduction

Time series modeling is one of the most impactful machine learning use cases with broad applications across industries. Traditional time series modeling techniques, such as ARIMA, often automatically incorporate the time component of the data by using lagged values of the target variable as model inputs. While these techniques provide interpretable coefficients that aid in understanding the contribution of each variable to the forecast, they can be sensitive to outliers, missing data, and changes in the underlying data-generating process over time. As a result, their accuracy may be compromised. 

On the other hand, machine learning combined with feature engineering offers a more robust approach to time series modeling. This approach can handle complex, non-linear relationships and is well-suited for large relational datasets with more complex relationships and intricate interdependencies.

Feature engineering plays a crucial role in time series modeling, as it involves selecting and transforming raw data into meaningful features that can enhance the accuracy of predictive statistical models. The importance of feature engineering in time series modeling cannot be overstated.

By carefully selecting and transforming relevant features, statistical models become more adept at capturing the underlying patterns and relationships within the data. This ultimately leads to improved forecasting accuracy. Moreover, feature engineering enables the incorporation of domain knowledge and intuition, allowing the model to leverage human expertise and enhance performance.

A typical approach to feature engineering in time series forecasting involves the following types of features:

  • lagged variables. By incorporating previous time series values as features, patterns such as seasonality and trends can be captured. For example, if we want to predict today’s sales, using lagged variables like yesterday’s sales can provide valuable information about the ongoing trend. 
  • moving window statistics. This involves aggregating the time series values over a rolling window. By doing so, noise is smoothed out, shifting the focus to the underlying trends. Moving windows can help identify patterns that may not be immediately apparent in the raw data. 
  • Time-based features such as the day of the week, the month of the year, holiday indicators, seasonality, and other time-related patterns can be valuable for predictions. For instance, if certain products tend to have higher average sales on weekends, incorporating the day of the week as a feature can improve the accuracy of the forecasting model.

In this tutorial, we will walk through examples of these feature types for a well-known time series dataset and discuss using pandas and SQL to manually create these features. The feature definitions we are discussing are common in many time series problems, and the code we provide can certainly be leveraged for other time series modeling projects. 

Dataset 

We are using the Store Sales – Time Series Forecasting competition on Kaggle (https://www.kaggle.com/competitions/store-sales-time-series-forecasting). The competition aims to build a model to predict the sales for the items in the test dataset. 

The Train dataset contains the following columns:

  • Date: The date of the sale.
  • Store: The store where the sale took place.
  • Item: The item that was sold.
  • Promotion: Whether or not a promotion was active for the item on the date of the sale.
  • Sales: gives the total sales for a product family at a particular store at a given date.

The Kaggle competition data consists of several data source CSV files, including stores.csv, oil.csv, and holidays_events.csv. 

Stores:

  • Contains metadata about different stores.
  • Includes city, state, type, and cluster information.
  • The cluster represents a grouping of similar stores.

Oil:

  • Provides daily oil prices.
  • Includes values for both the train and test data timeframes.
  • The dataset mentions that Ecuador, the country in focus, is highly dependent on oil, and its economic health is vulnerable to shocks in oil prices.

Holidays_events:

  • Contains information about holidays and events and associated metadata.
  • Pay attention to the “transferred” column.
  • If a holiday is marked as transferred, it means it officially falls on a specific calendar day but was moved to another date by the government.

We aim to build a machine-learning model to predict the Sales column for the items in the test dataset. We must first develop time series features from this complex relational dataset to do that.

Four Important Types of Time-Series Features 

In the following sections, we will create four key features to enhance forecasting models in this Store Sales – Time Series use case.  

  1. First, we will incorporate lag features, including the sales figures from the prior day and two days prior. These features capture the influence of past sales on current values and help identify seasonality and trends.
  2. Next, we will utilize window features by calculating rolling means for the past 14 and 28 days. This approach, performed separately for each product and store segment, enables us to smooth out noise and highlight underlying trends, providing valuable insights into long-term patterns and potential shifts.
  3. Time-based features are crucial for this type of sales forecasting model, given the impact of weekends and holidays on store sales. In this case, we will explore the presence of higher sales volumes on weekends by creating a feature that identifies the day of the week and flags weekends. This time-based pattern recognition allows us to capture temporal dynamics influencing sales behavior.
  4. Lastly, we will combine all three approaches: lag, window, and time-based features, to create a comprehensive set of inputs for our forecasting model. By integrating these features, we aim to capture a wide range of patterns and relationships within the data, maximizing the accuracy of our predictions and facilitating more informed decision-making.

These four features: lag, window, time-based, and their combined effects, will significantly enhance our time series forecasting capabilities. By incorporating the influence of past sales, identifying underlying trends, capturing time-based patterns, and leveraging the synergy between these approaches, we are equipped to uncover valuable insights and optimize our forecasting strategies.

1. Lag Features

Using lagged features is a common technique in time-series analysis and machine-learning applications. By shifting the values of a variable backward or forward in time by a certain number of time periods, lagged features can capture temporal dependencies and trends in the data, providing valuable insights and improving the accuracy of predictive models.

In particular, lagged features are useful in predicting future values of a variable, as they can help identify patterns and relationships between variables over time. They are often used in fields such as finance, economics, and weather forecasting, where accurate predictions of future values are critical.

Several methods exist for generating lagged features, including shifting the values of a variable by a fixed number of time periods or by using a rolling window of previous values. The choice of method will depend on the specific application and the data characteristics. 
In this example, we will create two lagged features that may be helpful in prediction:  The sales on the prior 1 day and the sales 2 days prior. The pandas and SQL code that can be used to generate these features are given below.

Python
import pandas as pd
from sqlite3 import connect
import warnings
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

#Pandas Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Define a function to create the lagged feature
def add_lagged_features(group, lags):
    for lag in lags
        group[f'lagged_sales_{lag}'] = group['sales'].shift(lag)
    return group

# Create a lagged variable for sales
grouped_data = sales_data.groupby(['store_nbr', 'family']).apply(add_lagged_features, lags=[1, 2])

# Merge the lagged features back into the original DataFrame
sales_data = pd.merge(sales_data, grouped_data[['store_nbr', 'family', 'date', 'lagged_sales_1', 'lagged_sales_2']], on=['store_nbr', 'family', 'date'])

# Print the first 10 rows of the dataset with the lagged variable
print(sales_data[['store_nbr','family', 'date','sales',  'lagged_sales_1', 'lagged_sales_2']].tail(10))


#SQL Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, 
  LAG(sales, 1) OVER (PARTITION BY store_nbr, family ORDER BY date) AS lagged_sales_1,
  LAG(sales, 2) OVER (PARTITION BY store_nbr, family ORDER BY date) AS lagged_sales_2
FROM 
  sales_data_sql
ORDER BY 
  store_nbr, family, date
"""


sales_data_lagged = pd.read_sql(query, conn)
sales_data_lagged.tail(10)

\

By executing this code, users can manually build features for  The sales on the prior 1 day and the sales 2 days prior.

store_nbrfamilydatesaleslagged_sales_1lagged_sales_2
300087854SEAFOOD2017-08-06 00:00:003.03.00.0
300087954SEAFOOD2017-08-07 00:00:000.00.03.0
300088054SEAFOOD2017-08-08 00:00:000.00.00.0
300088154SEAFOOD2017-08-09 00:00:0012.012.00.0
300088254SEAFOOD2017-08-10 00:00:004.04.012.0
300088354SEAFOOD2017-08-11 00:00:002.02.04.0
300088454SEAFOOD2017-08-12 00:00:000.00.02.0
300088554SEAFOOD2017-08-13 00:00:001.01.00.0
300088654SEAFOOD2017-08-14 00:00:002.02.01.0
300088754SEAFOOD2017-08-15 00:00:000.00.02.0

2. Rolling Window Statistics

Rolling window features are a widely used technique for feature engineering in time-series analysis and machine learning. By calculating summary statistics, such as the mean or standard deviation, over a sliding window of previous values, rolling window features can capture trends and patterns in the data, providing valuable insights and improving the accuracy of predictive models.

In particular, rolling window features are useful in identifying trends and changes in the data over time, as they can highlight important features such as seasonality, cyclical patterns, or sudden shifts in the data. They are commonly used in fields such as finance, economics, and healthcare, where accurate predictions of future values are crucial.

The choice of window size and summary statistics will depend on the specific application and the characteristics of the data. For example, a smaller window size may capture short-term fluctuations in the data, while a larger window size may capture longer-term trends. Similarly, summary statistics such as the mean or standard deviation may capture different aspects of the data and be more appropriate for different applications.
In this example, we will create features as a rolling mean for the past 14 and 28 days for each segment of product and store.

Python
# Pandas Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Define a function to compute the rolling mean for each group
def add_rolling_mean(group, window_sizes):
    for window_size in window_sizes:
        group[f'rolling_mean_{window_size}_days'] = group['sales'].rolling(window_size).mean()
    return group

# Define the window sizes to use

# Create a rolling mean variables for sales
grouped_data = sales_data.groupby(['store_nbr', 'family']).apply(add_rolling_mean, window_sizes = [14, 28]) 

# Merge the lagged features back into the original DataFrame
sales_data = pd.merge(sales_data, grouped_data[['store_nbr', 'family', 'date', 'rolling_mean_14_days', 'rolling_mean_28_days']], on=['store_nbr', 'family', 'date'])

# Print the first 10 rows of the dataset with the rolling mean variable
sales_data[['store_nbr','family', 'date','sales', 'rolling_mean_14_days', 'rolling_mean_28_days']].tail(10)

#SQL Version
# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Sort the data
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, 
  AVG(sales) OVER (PARTITION BY store_nbr, family ORDER BY date ROWS BETWEEN 13 PRECEDING AND CURRENT ROW) AS rolling_mean_14_days,
  AVG(sales) OVER (PARTITION BY store_nbr, family ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW) AS rolling_mean_28_days
FROM 
  sales_data_sql
"""

sales_data_rolling = pd.read_sql(query, conn)
sales_data_rolling.tail(10)

By executing this code, users can manually build features for a rolling mean for the past 14 and 28 days for each segment of product and store.

store_nbrfamilydatesalesrolling_mean_14_daysrolling_mean_28_days
30087854SEAFOOD2017-08-060.03.0714293.000000
30087954SEAFOOD2017-08-070.03.000002.642857
30088054SEAFOOD2017-08-0812.03.3571432.964286
30088154SEAFOOD2017-08-094.03.4285713.071429
30088254SEAFOOD2017-08-102.03.4285713.142857
30088354SEAFOOD2017-08-110.03.1428573.142857
30088454SEAFOOD2017-08-121.02.9285713.107143
30088554SEAFOOD2017-08-132.02.7857143.000000
30088654SEAFOOD2017-08-140.02.500002.892857
30088754SEAFOOD2017-08-153.02.500002.785714

3. Time-based features

DateTime feature engineering is the process of creating new features from date and time information in order to improve predictive accuracy in machine learning models. This technique is especially useful in time-series analysis, where time-related patterns are important, such as in finance, e-commerce, healthcare, and social media.

One common example of datetime feature engineering is extracting the day of the week, hour of the day, or month of the year. These features can capture the seasonality and temporal patterns in the data. For example, a time-series dataset of e-commerce sales may have a higher sales volume on weekends than weekdays or holidays such as Christmas.

In this example, we will create features such as “day_of_the_week” and “is_weekend”.

Python
# Pandas Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Sort the dataset by date in ascending order
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

# Extract the day of the week from the date column
sales_data['day_of_week'] = sales_data['date'].dt.day_name()

# Create a new binary column for whether the day is a weekend or not
sales_data['is_weekend'] = sales_data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)

# Print the first 10 rows of the dataset with the is_weekend variable
sales_data[['store_nbr','family', 'date','sales', 'day_of_week','is_weekend']].tail(10)

#SQL Version
sales_data = pd.read_csv("data/train.csv")

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Sort the dataset by date in ascending order
sales_data.sort_values(by=['store_nbr','family', 'date'], inplace=True) 

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, strftime('%w', date) as day_of_week
FROM 
  sales_data_sql
"""

sales_data_day_of_week = pd.read_sql(query, conn)
sales_data_day_of_week.to_sql('sales_data_day_of_week_sql', conn)

query = """
SELECT 
  store_nbr, family, date, sales, 
  day_of_week,
  CASE 
    WHEN day_of_week IN ('0', '6') THEN 1 
    ELSE 0 
  END AS is_weekend
FROM 
  sales_data_day_of_week_sql
"""

sales_data_weekend = pd.read_sql(query, conn)
sales_data_weekend.tail(10)

By executing this code, users can manually build features for “day_of_the_week” and “is_weekend”

store_nbrfamilydatesalesday_of_weekis_weekend
128521454SEAFOOD2017-08-060.0Sunday1
128699654SEAFOOD2017-08-070.0Monday0
128877854SEAFOOD2017-08-0812.0Tuesday0
129056054SEAFOOD2017-08-094.0Wednesday0
129234254SEAFOOD2017-08-102.0Thursday0
129412454SEAFOOD2017-08-110.0Friday0
129590654SEAFOOD2017-08-121.0Saturday1
129768854SEAFOOD2017-08-132.0Sunday1
129947054SEAFOOD2017-08-140.0Monday0
130125254SEAFOOD2017-08-153.0Tuesday0

4. Complex Feature

For a real-world use case, the important features are often complex that combine the previous 3 types of features. Consider the following feature: 

Average weekly (rolling window) sales in the same week number (date time) last year (lag)

Python
# Pandas Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")

# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Extract week number and year from date column
sales_data['week_number'] = sales_data['date'].dt.week
sales_data['year'] = sales_data['date'].dt.year
sales_data['last_year'] = sales_data['year'] - 1

# Compute average sales for each week number and year combination
weekly_sales = sales_data.groupby(['store_nbr', 'family', 'week_number', 'year'])['sales'].mean().reset_index().rename(columns={'sales': 'avg_sales_last_year_week'})

# Compute week number and year for same week last year
weekly_sales['join_year'] = weekly_sales['year']
weekly_sales['last_week_number'] = weekly_sales['week_number']

# Merge average sales for same week last year to original data
sales_data = pd.merge(sales_data, weekly_sales[['store_nbr', 'family', 'join_year', 'last_week_number', 'avg_sales_last_year_week']], how='left', 
left_on=['store_nbr', 'family', 'week_number', 'last_year'], right_on=['store_nbr', 'family', 'last_week_number', 'join_year'])

sales_data.tail(10)

#SQL Version

# Load the Sales dataset
sales_data = pd.read_csv("data/train.csv")
# Convert the date column to a datetime type
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Extract week number and year from date column
sales_data['week_number'] = sales_data['date'].dt.week
sales_data['year'] = sales_data['date'].dt.year
sales_data['last_year'] = sales_data['year'] - 1

conn = connect(':memory:')
sales_data.to_sql('sales_data_sql', conn)


query = """
SELECT 
    s.*, 
    t.avg_sales_last_year_week
FROM 
    sales_data_sql s
    INNER JOIN (
        SELECT 
            sales_data_sql.store_nbr, 
            sales_data_sql.family, 
            strftime('%Y', sales_data_sql.date) as year1, 
            strftime('%W', sales_data_sql.date) as week, 
            AVG(s_prev.sales) AS avg_sales_last_year_week
        FROM 
            sales_data_sql
            INNER JOIN sales_data_sql s_prev ON sales_data_sql.store_nbr = s_prev.store_nbr AND sales_data_sql.family = s_prev.family
             AND 
                strftime('%Y',date( sales_data_sql.date, '-1 year')) = strftime('%Y',s_prev.date) 
             AND
                strftime('%W',sales_data_sql.date) = strftime('%W',s_prev.date)
        GROUP BY 
            sales_data_sql.store_nbr, 
            sales_data_sql.family, 
            strftime('%Y', sales_data_sql.date), 
            strftime('%W', sales_data_sql.date)
    ) t ON s.store_nbr = t.store_nbr AND s.family = t.family AND strftime('%Y', s.date) = t.year1 AND strftime('%W', s.date) = t.week
ORDER BY 
    s.store_nbr, 
    s.family, 
    s.date
"""

avg_sales_last_year_week = pd.read_sql(query, conn)
avg_sales_last_year_week.tail(10)

By executing this code, users can manually build features for “Average weekly (rolling window) sales in the same week number (date time) last year (lag)”

iddatestore_nbrfamilysalesonpromotionweek_numberyearlast_yearjoin_yearlast_week_numberavg_sales_last_year_week
170363317036332015-08-1710DAIRY218.0134201520142104.034.0244.571429
170541517054152015-08-1810DAIRY200.0034201520142104.034.0244.571429
170719717071972015-08-1910DAIRY190.05034201520142104.034.0244.571429
170897917089792015-08-2010DAIRY183.0034201520142104.034.0244.571429
171076117107612015-08-2110DAIRY178.0034201520142104.034.0244.571429
171254317125432015-08-2210DAIRY227.0034201520142104.034.0244.571429
151432515143252015-08-2310DAIRY268.0034201520142104.034.0244.571429

As you can see, developing this single feature requires fairly complex pandas and SQL programming. When developing complex features like this, we need to ensure we are doing proper testing of the pandas and SQL codes before we implement them into machine learning models.  

Conclusion

This tutorial demonstrates the creation of lag features, rolling window statistics, and time-based features on the Kaggle Time Series Dataset. Lagged versions of the target variable can be utilized to predict future values, such as using yesterday’s sales to forecast today’s sales. Rolling window statistics, such as moving averages, enable us to capture trends and minimize noise in the data. Additionally, time-based features, such as the hour, day, month, or season, offer valuable information related to timing and can be leveraged to uncover patterns, like higher average sales of specific products on weekends. By employing these techniques, data scientists can enhance their predictive models and gain a deeper understanding of their datasets.

Data science teams often find themselves caught in the time-consuming trap of manual feature engineering. The process of feature discovery is notorious for being slow and repetitive, demanding extensive domain knowledge. However, there is a revolutionary solution to this predicament in the form of dotData’s Feature Factory, an innovative feature development workbench. Feature Factory significantly expedites this process by automatically suggesting feature spaces derived from relational data. It jump-starts the exploration phase, saving valuable time for data scientists. Request a product walkthrough today. 

Joshua Gordon
Joshua Gordon

Dr. Joshua Gordon is a data scientist specializing in statistical modeling, machine learning, and deep learning technologies. While earning his Ph.D. in Statistics from the University of California Los Angeles, he published six articles focused on geospatial time series algorithms for earthquake prediction as well as residual analysis for spatial temporal models. Joshua has over 10 years of experience working as a Data Scientist in the insurance and manufacturing industries prior to joining dotData as a Senior Data Scientist. In dotData, he leads customer-facing data science projects and is responsible for customers’ success with dotData technologies.

dotData's AI Platform

dotData Feature Factory Boosting ML Accuracy through Feature Discovery

dotData Feature Factory provides data scientists to develop curated features by turning data processing know-how into reusable assets. It enables the discovery of hidden patterns in data through algorithms within a feature space built around data, improving the speed and efficiency of feature discovery while enhancing reusability, reproducibility, collaboration among experts, and the quality and transparency of the process. dotData Feature Factory strengthens all data applications, including machine learning model predictions, data visualization through business intelligence (BI), and marketing automation.

dotData Insight Unlocking Hidden Patterns

dotData Insight is an innovative data analysis platform designed for business teams to identify high-value hyper-targeted data segments with ease. It provides dotData's hidden patterns through an intuitive, approachable interface. Through the powerful combination of AI-driven data analysis and GenAI, Insight discovers actionable business drivers that impact your most critical key performance indicators (KPIs). This convergence allows business teams to intuitively understand data insights, develop new business ideas, and more effectively plan and execute strategies.

dotData Ops Self-Service Deployment of Data and Prediction Pipelines

dotData Ops offers analytics teams a self-service platform to deploy data, features, and prediction pipelines directly into real business operations. By testing and quickly validating the business value of data analytics within your workflows, you build trust with decision-makers and accelerate investment decisions for production deployment. dotData’s automated feature engineering transforms MLOps by validating business value, diagnosing feature drift, and enhancing prediction accuracy.

dotData Cloud Eliminate Infrastructure Hassles with Fully Managed SaaS

dotData Cloud delivers each of dotData’s AI platforms as a fully managed SaaS solution, eliminating the need for businesses to build and maintain a large-scale data analysis infrastructure. This minimizes Total Cost of Ownership (TCO) and allows organizations to focus on critical issues while quickly experimenting with AI development. dotData Cloud’s architecture, certified as an AWS "Competency Partner," ensures top-tier technology standards and uses a single-tenant model for enhanced data security.