Solving the Challenge of Time-Series Feature Engineering with Automation
Predictive analytics using time-series data is a widespread Machine Learning (ML) problem for real-world applications like churn prediction, demand forecasting, and preventative maintenance. This problem is challenging and often requires many data scientists and data engineers and countless hours spent on handling data. The data challenge is more pronounced when dealing with time-series transactional data combined with additional dimensional/master data sets.
KKBox’s Churn Prediction Challenge is one of the most popular time-series machine learning challenges on Kaggle developed from a real-world use case. KKBox is a leading music streaming service in Asia, similar to Spotify. Because it’s easier to retain clients than to acquire new ones, KKBox is interested in predicting which customers are likely to cancel their service or “churn.” Customers likely to churn may be offered incentives like coupons, a free month of service, upgraded functionality, etc., to minimize the overall churn volume for KKBox.
One challenge for this use case is the extremely large volume of time series data which is inherently difficult to analyze. Additionally, although incorporating dimensional information with time-series data may provide greater predictability for the predictions, this process is very time-consuming and requires a lot of domain expertise to properly leverage.
The typical approach to solve this data science problem is manual and requires weeks, often months of intensive work. We have applied dotData automation technology to accelerate the solution timeline and improve results to accelerate the process. In this case, dotData spent just one day using modern automated feature engineering (autoFE) and automated machine learning (autoML) technology. We automatically generated an accurate prediction pipeline using dotData’s autoFE/ML Python module from multiple, complex, unprepared input tables. dotData explored thousands of difficult, intelligent, and interpretable features, including automatically discovered aggregations and time windows as part of the feature engineering process.
dotData then explored and optimized 23 different ML algorithms without any user inputs. Ultimately, dotData generated a solution that ranks 12th out of 575 competitors, with only a few hours of manual work – all performed in just one day. Below we describe the problem and the process involved. This competition lasted for three months, with participants, by comparison, often spending weeks or months developing their predictions.
Why is this use case particularly challenging?
- The data set includes multiple tables with large volumes of transactional data, making the resultant feature engineering process highly complex. There were four tables for this dataset:
- Target table (Historical churn table)
The core table contains historical data for prediction targets – user IDs and a churn flag indicating whether a particular customer has churned. The number of samples is about 1 million records for training and test data, respectively, representing one month of churn history.
- Transaction data
Each record represents a financial transaction for a given user, containing nine columns such as payment date, amount, etc. The number of samples is 22 million records, covering two years of history.
- User log
Each record represents user activity on the streaming service, including eight columns such as the visit timestamp, length of stay, % of songs completed, etc. The number of samples is 410 million records to cover two years of history.
- Membership data
The membership table includes demographic information with six columns, including age, city, gender, and signup date. The number of samples is about 7 million records to cover for 14 years of history.
Before applying machine learning algorithms, we must join the large tables into a single flat table, a process commonly known as feature engineering. However, because these joins are not one-to-one join, creating the joins can be difficult and often requires specific domain knowledge to understand how to properly combine the tables for the most appropriate context and value.
- Data leakage due to time series data: It is important to avoid including data that would not be available at the prediction time. While excluding some data is crucial, we want to use as much relevant information as possible for training.
How data scientists manually solve this problem
Data scientists address these issues in three steps:
- Perform extensive exploratory data analysis to understand the data and generate feature hypotheses
- Build complex ETL pipelines to join multiple tables and aggregate time-series
- Validate the time-related feature hypotheses by building ML models
The manual process described above may seem simple but is, in fact, iterative, time-consuming, and error-prone – especially when combining multiple tables with different time ranges or time resolutions. Another challenge is that while completing the sequence of steps, data scientists often run into a small subset of features that are too predictive to be practically useful for the problem of interest – this could be due to data leakage. Identifying and avoiding leaky features, especially for time series data is often hard, which makes the data science process more complex. Obviously, this type of manual approach in solving data science problems is difficult to scale and standardize.
How dotData solves this problem using automation
dotData addresses these challenges through an intelligent automation process:
- Transform the target table by adding a DateTime column for time series feature engineering:
dotData uses a time-entity-relationship concept to deal with complex time series problems. This time-entity-relationship ensures two things: (1) that we can use all available data with no leakage (2) features are generated and aggregated in intelligent ways relative to the time of the prediction (e.g., The mean/max/median/standard deviation of a user’s length of logins over the last week, month, 10 weeks or year). To leverage dotData’s DateTime capabilities, it requires a DateTime column added to the original target table to represent the time when we evaluate whether a particular user had churned (i.e., did user 1231 churn in Jan 2017).
- Establish relationships between source and target tables
For dotData to automatically perform data combination across different tables and data aggregation along time series attributes, dotData needs the relationship between different tables, especially between target tables and individual source tables. As illustrated below, establishing connections between the target table and other tables will define the essential logic for table joins.
For tables related to each other through categorical attributes, we must define a shared key used to connect the tables for tables related to each other through categorical attributes. For time-entity-relationships, we also have to define a time window that we want to use when searching for features. For example, we might want to search for data in the last three years before the prediction date for each record in weekly increments to find the optimal time-based features.
dotData automatically generates features and builds a model without human intervention with shared keys and time windows defined. During this process, dotData will automatically perform five critical tasks:
- Join, clean, normalize, null-input, and encode the data
- Evaluate 20k+ interpretable features
- Chose the 500 best-performing features
- Explore and tune 574 models and 15 ML algorithms, choosing the best 28 for review
- Compute metrics and statistics for all features and models
Results and conclusions
With just six hours of combined learning, configuration, and software operation, dotData placed 12th out of 575 participants:
In addition to providing highly accurate modeling results, dotData automatically built highly complex time-series features without requiring domain knowledge or inputs from domain experts. Below is a small sample of dotData derived features:
Users that use payment method “39” for most of their transactions in the past 60 days are more likely to churn.
If users sign up for the service through campaign “7” during the past 90 days, they will not churn this month.
To manually build these features, a data scientist would need to write SQL queries similar to the ones shown below:
SQL Query for Feature 1
SQL Query for Feature 2
It’s important to note that, while the features shown above seem intuitive and straightforward, dotData has automatically explored thousands of features similar to these. A significant benefit of the automated discovery and evaluation of features is that it minimizes the chances that potentially important features will go undiscovered simply because the data scientist overlooked the possible impact of the sign-up channel on churn or the 60-day time range and its association to payment type.
Equally important is the impact of the manual coding process on feature output. Suppose a data scientist has to generate features manually. In that case, not only do they need to write the SQL queries, they must do so while hypothesizing the value of appropriate time ranges (one week, one month, half a year, etc.), the type of aggregation, and determining the optimal combination for a model. Of course, validating these hypotheses requires building the model and looking at feature importance metrics. This lengthy and repetitive process is equally likely to result in valuable results as it is to provide a dead end.
Automation technology can significantly boost performance, productivity, and efficiency when solving highly complex time series problems and achieving expert-level accuracy. The process required only minimal data preparation, and no subject matter expertise was sought or provided. The automation of the entire pipeline from processing multi-table to generating features to building machine learning models occurred in just six hours.
The end goal of automation is not to displace or replace the data scientist. However, when properly applied, automation provides a decisive breakthrough in helping data scientists productively explore a vast array of possibilities available in any given data set in just a few hours. Through automation, the manual and iterative experimentation process inherent in data science can be minimized while maintaining a high level of accuracy performance. As a result data scientists can focus more efforts on identifying worthy business problems and delivering impacts.
Learn more about how dotData can help automate your Feature Engineering process with a personalized demo