fbpx
Manual Feature Engineering is hard

The Hard Truth about Manual Feature Engineering

  • Technical Posts

The past decade has seen rapid adoption of Artificial Intelligence (AI) and Machine Learning (ML) across different industries and for many successful use cases. Beyond AI’s “cute” factor that can automatically differentiate between a dog and a cat in photos, these new technologies are already deployed in real-world applications to generate impactful business outcomes.

AI and Machine Learning predict lending risk, provide product recommendations, analyze customer churn behaviors, and manage inventories. While technology giants like Google and Amazon continue harvesting the great benefits of AI, many traditional businesses are still struggling to adopt AI. One key challenge businesses face is that data is often not ready for AI/ML, and preparing it would take too much time and effort, something they cannot afford. 

ML input data must be a single flat table, but real-world data is often distributed across different tables and multiple databases. Combining data from disparate tables, generating new attributes from this combination process, and selecting the essential attributes for downstream AI/ML modeling is known as feature engineering.

To best understand what feature engineering is and its complexity, we will walk through the concept of feature engineering using a real-world example. 

While the specific data tables are related to a banking churn case, the challenges we are exploring associated with manual feature engineering are universal. In the example we will use, the datasets that are relevant for the analysis are shown below across ten different tables, which include:

  • Credit card churn 
  • Banking account information
  • Bank account balance information
  • Credit account information 
  • Credit card payment information
  • Web Browsing Log
  • Credit score
  • Customer information
  • Branch information
  • Marketing Promotion information
A Ten-Table Data Set for Feature Engineering

A realistic set of tables typically used in Feature Engineering

The first task is to identify our target table. In these ten tables, the “credit card churn” table has historical churn information, which can be used as a prediction target. The other nine tables are considered source information that can be used to build features to construct such a prediction model. 

As we all know for this dataset, if we were to pass these ten tables directly to a machine learning algorithm, no matter if it is a traditional statistical learning model or a state-of-the-art deep learning model, it won’t work simply because machine learning requires input information going into the model to be represented in a single flat table format, where the prediction target information and predictors (features) have to be assembled in the same table.

Manual Feature Engineering Process by Number of Tables

Combining these ten tables to formulate this single flat table which can be used directly by machine learning, is feature engineering. Now let’s see how we can generate features from this dataset.

Start with two tables 

Given the complexity of this data, we do not directly combine these ten tables all at once. Instead, we start with two tables – one target table and one source table. In this example, we will first analyze the customer’s demographic information, which is why the ‘Customer Information’ table was used as the first source table.

By examining the relationship between the target table (credit card churn) and the ‘Customer Information’ table, we see that for every customer in the target table, there is only one record containing demographic information in the ‘Customer Information’ table. The “one-to-one” relationship between these two tables makes this two-table feature engineering process fairly straightforward. We can join these two tables together via the “Customer ID” key. In doing so, we copy customer demographic information onto the target table according to each “Customer ID,” and the feature table is easily constructed.

Joining Two Tables Into One Feature Table

Two tables joined into a feature table

Building features with only two tables are accessible. However, as you can easily imagine, if you were to build a machine learning model using only these features, the model’s predictive performance would not be good. At this point, we have entirely ignored customers’ credit card usage, bank account information, and other helpful information.  

Move onto three tables

Next, we will add some additional information related to credit card usage, which is the “Credit Card Payment” information, into our features. Now let us combine the “Credit Card Payment” table with the previous two tables and move on to three-table feature engineering.  

Adding Time-Series Data to Two Tables

A more complex three-table join

The “Credit Card Payment” table is distinctly different from the “Customer Info” table because it contains time series data. The presence of time series data creates a “one-to-one-to-many” relationship among these three tables, making the three-table feature engineering more complex.  For the same “Customer ID,” every time a credit card payment happens, a new record will be inserted into this table. Because of this data’s time series nature, it is impossible to directly copy the “Credit Card Payment” table to the  “credit card churn” table. Instead, we have to reduce the number of records in this “Credit Card Payment” table into a single record for a given customer before it can be joined to the “credit card churn” table.

The most common way to reduce time series data is by applying aggregation functions based on a specific time range. The aggregation function types include traditional statistical functions like sum, std, min, max, avg, etc.. The time range can be the previous one day, the previous one week, three weeks, one month, etc. However, selecting the correct aggregation function and time range is not trivial. 

Another unique challenge of time series data is time leakage. As seen from the data below, if we are to predict if a customer with “customer ID” 13515 will churn on 3/22, we can only use information from the “Credit Card Payment” table up till 3/22 for that prediction. However, by examining the “Credit Card Payment” table, we see that for customer 13515, the available records even include entries with a due date of 4/31/2022. If we were to have this record in our features, we would be using information unavailable at the time of prediction, creating time leakage. To avoid leaky time information, we have to carefully select the correct time range for each customer before applying time aggregation, adding a layer of complexity to our manual effort beyond aggregations.

Combining Time Series Data Requires Mathematical Transformations

Combining three tables requires a fair amount of manual work and domain knowledge

If a user has a lot of domain knowledge on this dataset, the initial steps to begin feature engineering for this three-table situation is to think through the business context of the data and ask relevant questions like the examples below:

  • Are they active credit card users?
  • How much do they spend per month? 
  • Are they missing payments? 
  • What’s their most recent behavior compared to a couple of months ago?

We can then carefully convert each business question into a set of SQL statements to generate features like the ones shown in the table below. Even though this is a very complex and manual process, three-table feature engineering dealing with “one-to-one-to-many” relationships is still possible.

A Feature Table Based on Three Tables

A possible feature table based on three tables

Add a 4th table

To further improve the model performance, we should consider additional customer-specific information, like each customer’s overall financial situation, which is captured in the “Account Balance” table. 

The “Account Balance” table is another time series dataset. Three-table feature engineering involving one time-series data is complex enough; adding another time-series data will make this four-table feature engineering even more complex. When dealing with two time-series datasets simultaneously, the possible ways of doing time aggregation across different tables and avoiding time leakage would all become highly messy. 

Four+ Tables for Feature Engineering, Very Hard!

Adding tables makes the model more useful but increases complexity exponentially

We can still attempt doing complete four-table feature engineering manually by asking business questions, such as: 

  • Someone that has much money in the bank and has not used our credit card in the last months compared to 1 year ago
  • Someone that has no money in the bank and has had a large credit card debt for the past six month
  • Someone that has had a monthly deposit of over $5000 over the past two months and who is an entrepreneur
  • Someone that has withdrawn more money than the deposit and missed three consecutive credit card payments in the last five month

However, given the complexity of these four tables, systematically exploring different data combinations and developing statistically correct (non-leaky) features for a predictive model is practically impossible for any data scientist. 

How about ten tables?

The real challenge is that we are provided with ten tables that capture a broader spectrum of information.

  • Credit score information
  • Customer’s access to their online banking accounts
  • How a customer reacted to a promotion campaign in the past
  • Etc…

Each set of information is conceivably beneficial for us to predict customer behavior, but in what exact way this information should be combined with the previous four tables is unclear. The sheer complexity and scale of using the richness of as many as ten tables in feature engineering is the real challenge of complex feature engineering.

Ten Tables for Feature Engineering is a Practical Impossibility by Hand

Building feature tables from a real-world data set like the one above are practically impossible manually

Simplify the Process with Automated Feature Engineering

We often deal with complex relational data for real-world data science problems. Manually developing features from this data type is usually a lot more complex than one might have expected.  Automated feature engineering is a new technology invented to solve this problem. We encourage data scientists to explore programmatic feature engineering when dealing with complex feature engineering.

Learn more about how your organization could benefit from the powerful features of dotData by signing up for a demo.

Aaron Cheng
Aaron Cheng

Aaron is currently the Vice President of Data Science and Solutions at dotData. As a data science practitioner with 14 years of research and industrial experience, he has held various leadership positions in spearheading new product development in the fields of data science and business intelligence. At dotData, Aaron leads the data science team in working directly with clients and solving their most challenging problems. Prior to joining dotData, he was a Data Science Principle Manager with Accenture Digital, responsible for architecting data science solutions and delivering business values for the tech industry on the West Coast. He was instrumental in the strategic expansion of Accenture Digital’s footprint in the data science market in North America. Aaron received his Ph.D. degree in Applied Physics from Northwestern University.

dotData's AI Platform

dotData Feature Factory Boosting ML Accuracy through Feature Discovery

dotData Feature Factory provides data scientists to develop curated features by turning data processing know-how into reusable assets. It enables the discovery of hidden patterns in data through algorithms within a feature space built around data, improving the speed and efficiency of feature discovery while enhancing reusability, reproducibility, collaboration among experts, and the quality and transparency of the process. dotData Feature Factory strengthens all data applications, including machine learning model predictions, data visualization through business intelligence (BI), and marketing automation.

dotData Insight Unlocking Hidden Patterns

dotData Insight is an innovative data analysis platform designed for business teams to identify high-value hyper-targeted data segments with ease. It provides dotData's hidden patterns through an intuitive, approachable interface. Through the powerful combination of AI-driven data analysis and GenAI, Insight discovers actionable business drivers that impact your most critical key performance indicators (KPIs). This convergence allows business teams to intuitively understand data insights, develop new business ideas, and more effectively plan and execute strategies.

dotData Ops Self-Service Deployment of Data and Prediction Pipelines

dotData Ops offers analytics teams a self-service platform to deploy data, features, and prediction pipelines directly into real business operations. By testing and quickly validating the business value of data analytics within your workflows, you build trust with decision-makers and accelerate investment decisions for production deployment. dotData’s automated feature engineering transforms MLOps by validating business value, diagnosing feature drift, and enhancing prediction accuracy.

dotData Cloud Eliminate Infrastructure Hassles with Fully Managed SaaS

dotData Cloud delivers each of dotData’s AI platforms as a fully managed SaaS solution, eliminating the need for businesses to build and maintain a large-scale data analysis infrastructure. This minimizes Total Cost of Ownership (TCO) and allows organizations to focus on critical issues while quickly experimenting with AI development. dotData Cloud’s architecture, certified as an AWS "Competency Partner," ensures top-tier technology standards and uses a single-tenant model for enhanced data security.