The Hard Truth about Manual Feature Engineering
- Technical Posts
The past decade has seen rapid adoption of Artificial Intelligence (AI) and Machine Learning (ML) across different industries and for many successful use cases. Beyond AI’s “cute” factor that can automatically differentiate between a dog and a cat in photos, these new technologies are already deployed in real-world applications to generate impactful business outcomes.
AI and Machine Learning predict lending risk, provide product recommendations, analyze customer churn behaviors, and manage inventories. While technology giants like Google and Amazon continue harvesting the great benefits of AI, many traditional businesses are still struggling to adopt AI. One key challenge businesses face is that data is often not ready for AI/ML, and preparing it would take too much time and effort, something they cannot afford.
ML input data must be a single flat table, but real-world data is often distributed across different tables and multiple databases. Combining data from disparate tables, generating new attributes from this combination process, and selecting the essential attributes for downstream AI/ML modeling is known as feature engineering.
To best understand what feature engineering is and its complexity, we will walk through the concept of feature engineering using a real-world example.
While the specific data tables are related to a banking churn case, the challenges we are exploring associated with manual feature engineering are universal. In the example we will use, the datasets that are relevant for the analysis are shown below across ten different tables, which include:
A realistic set of tables typically used in Feature Engineering
The first task is to identify our target table. In these ten tables, the “credit card churn” table has historical churn information, which can be used as a prediction target. The other nine tables are considered source information that can be used to build features to construct such a prediction model.
As we all know for this dataset, if we were to pass these ten tables directly to a machine learning algorithm, no matter if it is a traditional statistical learning model or a state-of-the-art deep learning model, it won’t work simply because machine learning requires input information going into the model to be represented in a single flat table format, where the prediction target information and predictors (features) have to be assembled in the same table.
Combining these ten tables to formulate this single flat table which can be used directly by machine learning, is feature engineering. Now let’s see how we can generate features from this dataset.
Given the complexity of this data, we do not directly combine these ten tables all at once. Instead, we start with two tables – one target table and one source table. In this example, we will first analyze the customer’s demographic information, which is why the ‘Customer Information’ table was used as the first source table.
By examining the relationship between the target table (credit card churn) and the ‘Customer Information’ table, we see that for every customer in the target table, there is only one record containing demographic information in the ‘Customer Information’ table. The “one-to-one” relationship between these two tables makes this two-table feature engineering process fairly straightforward. We can join these two tables together via the “Customer ID” key. In doing so, we copy customer demographic information onto the target table according to each “Customer ID,” and the feature table is easily constructed.
Two tables joined into a feature table
Building features with only two tables are accessible. However, as you can easily imagine, if you were to build a machine learning model using only these features, the model’s predictive performance would not be good. At this point, we have entirely ignored customers’ credit card usage, bank account information, and other helpful information.
Next, we will add some additional information related to credit card usage, which is the “Credit Card Payment” information, into our features. Now let us combine the “Credit Card Payment” table with the previous two tables and move on to three-table feature engineering.
A more complex three-table join
The “Credit Card Payment” table is distinctly different from the “Customer Info” table because it contains time series data. The presence of time series data creates a “one-to-one-to-many” relationship among these three tables, making the three-table feature engineering more complex. For the same “Customer ID,” every time a credit card payment happens, a new record will be inserted into this table. Because of this data’s time series nature, it is impossible to directly copy the “Credit Card Payment” table to the “credit card churn” table. Instead, we have to reduce the number of records in this “Credit Card Payment” table into a single record for a given customer before it can be joined to the “credit card churn” table.
The most common way to reduce time series data is by applying aggregation functions based on a specific time range. The aggregation function types include traditional statistical functions like sum, std, min, max, avg, etc.. The time range can be the previous one day, the previous one week, three weeks, one month, etc. However, selecting the correct aggregation function and time range is not trivial.
Another unique challenge of time series data is time leakage. As seen from the data below, if we are to predict if a customer with “customer ID” 13515 will churn on 3/22, we can only use information from the “Credit Card Payment” table up till 3/22 for that prediction. However, by examining the “Credit Card Payment” table, we see that for customer 13515, the available records even include entries with a due date of 4/31/2022. If we were to have this record in our features, we would be using information unavailable at the time of prediction, creating time leakage. To avoid leaky time information, we have to carefully select the correct time range for each customer before applying time aggregation, adding a layer of complexity to our manual effort beyond aggregations.
Combining three tables requires a fair amount of manual work and domain knowledge
If a user has a lot of domain knowledge on this dataset, the initial steps to begin feature engineering for this three-table situation is to think through the business context of the data and ask relevant questions like the examples below:
We can then carefully convert each business question into a set of SQL statements to generate features like the ones shown in the table below. Even though this is a very complex and manual process, three-table feature engineering dealing with “one-to-one-to-many” relationships is still possible.
A possible feature table based on three tables
To further improve the model performance, we should consider additional customer-specific information, like each customer’s overall financial situation, which is captured in the “Account Balance” table.
The “Account Balance” table is another time series dataset. Three-table feature engineering involving one time-series data is complex enough; adding another time-series data will make this four-table feature engineering even more complex. When dealing with two time-series datasets simultaneously, the possible ways of doing time aggregation across different tables and avoiding time leakage would all become highly messy.
Adding tables makes the model more useful but increases complexity exponentially
We can still attempt doing complete four-table feature engineering manually by asking business questions, such as:
However, given the complexity of these four tables, systematically exploring different data combinations and developing statistically correct (non-leaky) features for a predictive model is practically impossible for any data scientist.
The real challenge is that we are provided with ten tables that capture a broader spectrum of information.
Each set of information is conceivably beneficial for us to predict customer behavior, but in what exact way this information should be combined with the previous four tables is unclear. The sheer complexity and scale of using the richness of as many as ten tables in feature engineering is the real challenge of complex feature engineering.
Building feature tables from a real-world data set like the one above are practically impossible manually
We often deal with complex relational data for real-world data science problems. Manually developing features from this data type is usually a lot more complex than one might have expected. Automated feature engineering is a new technology invented to solve this problem. We encourage data scientists to explore programmatic feature engineering when dealing with complex feature engineering.
Learn more about how your organization could benefit from the powerful features of dotData by signing up for a demo.