Automated Feature Selection Methods

Automated Feature Selection Methods to Assess Client Risk

June 9, 2022

In industries like Financial Services, Housing, and Insurance, the automated scoring of client risk can provide critical benefits. By quantifying the potential hazards for an organization, risk profiling helps organizations build sustainable long-term growth and minimizes losses in economic downturns. However, traditional client risk scoring methodologies are often not sufficient but can be augmented with machine learning-based models.

Client Risk Profile Examples

Popular use cases across different domains include:

  • Which of my customers might default on loan payments? (Finance)
  • Can a supplier deliver goods on time? (Manufacturing)
  • Will my new tenant pay the rent on time? (Housing)
  • What are the risks associated with a specific property or driver? (Insurance)

Challenges with Traditional Risk Profiling Methods

Traditionally, businesses solved these use-cases with rule-based approaches. A banker might have used a credit score and household income to determine loan eligibility.

Simple rule-based logic is easy to implement but does not provide enough scalability and robustness as available data increases. Rule-based logic also tends to become tedious and difficult to manage as the number of variables increases.

In 2021 IDC estimated that global data creation and replication volume will grow by 23% through 2025 to a staggering 181 zettabytes by 2025. 181 Zettabytes is roughly equivalent to 250,000,000,000 DVDs. Simple rule-based models have become unwieldy and antiquated with both structured and unstructured data growth. A well-developed ML algorithm can build better risk profiles than a rules-based approach with better data quality.

Developing ML Models that Generate Business Value:

While ML-based solutions for risk profiling outweigh rules-based approaches, there are challenges associated with building ML solutions. Problem formulation, model development, resource availability, domain expertise, validating model metrics, and customer adoption can impede successful ML projects.

Let’s consider a use case based on a widely available data set from a Czech bank to understand the challenges. The challenge is to predict if a customer is likely to default on a loan payment before the loan is approved.

The Dataset: Loans issued by a Czech bank

This dataset from the Czech Bank contains historical information about transactions before and after loans are provided. In addition, it also includes the details related to the order made, the number of accounts held by the customer, the number of clients tied to the main customer account, and more.

Relational Data Set for Feature Engineering Problem

Out of 682 loans in the historical data, 76 defaulted on their loan, and 606 have paid their dues on time.

People who defaulted on the loan

The data in our sample data set looks as follows:

TableRecord Count
Loan682
Order6,471
Account4,500
Disposition5,369
District77
Transaction1,050,000
Client5,369
Card892

The Challenge: Avoiding defaulters

Despite all the due diligence performed by the bank before providing loans, more than 11% of approved loans defaulted. The challenge for the bank is to be able to identify the likelihood that a borrower will default on the loans to reduce the 11% to a more acceptable number.

How a data scientist would solve this problem

The objective of our ML model is to predict if a customer account id would end up defaulting on the issued loan. For 682 unique loans provided, there are 1.05 Million transactions available. Below is a snapshot of available transaction data:

AccountDateTypeOperationTrans_amtBalance
16951/3/1993creditCash Credit200200
16951/8/1993creditCash Credit1183212032
16951/31/1993credit38.812070.8
16952/2/1993creditCash Credit750019570.8
16952/28/1993credit50.119620.9
16953/4/1993creditCash Credit520024820.9

A data scientist will need to extract features to build a model, for example:

  • Overall withdrawals made before the loan opening date
  • Number of times the account had a negative balance
  • Duration of the account before the loan start date
  • The last 60 day moving average of balance for each operation type
  • The previous 60-day moving standard deviation of balances
  • Number of transactions made before holidays
  • The proportion of credit vs. withdrawals made

The list is not exhaustive. The derived features vary based on data scientist knowledge and experience solving the risk-profiling or similar use cases. In a typical manual ML development cycle, the data scientist, often with the assistance of other developers like data engineers, would have to perform a series of steps as shown in the figure below:

A Typical Manual ML Development Cycle

Limitations of a manual ML development process

One of the most significant limitations of a manual approach is that as data is aggregated from multiple tables into a flat file, it could ignore valuable data since time and human resource constraints put limitations on how many table relationships and iterations are explored. It’s also not always easy to determine the time window based on which feature extraction needs to be done.

In a traditional ML development process, Data Scientists might rely on business domain experts to address some of these challenges and increase the likelihood that relevant data is used in the models. There are inherent limitations to this approach. Domain experts can only provide guidance based on personal experience and might miss patterns that, while not evident, might offer some valuable insights.

In addition, there are always more uncertainties: Is the correct model type used? Is the proper hyperparameter used? Etc…

Depending on the time constraints, resource availability, and impact of the problem, an experienced data practitioner could spend weeks to months solving these risk profiling use cases.

Accelerated Model Building with Feature Discovery & Evaluation

dotData is a platform that helps to accelerate problem formulation, model building, and deployment. dotData’s patented technology does the heavy-lifting for data practitioners to focus more on problem-solving.

At the heart of dotData’s technology is an AI-based system that connects directly to relational enterprise data – whether in a data warehouse or data lake – to automatically discover and analyze insights from millions of records across hundreds of tables and thousands of columns of data.

With dotData’s technology, feature engineering goes beyond simple permutations of existing features by giving the users a way of automatically discovering relevant patterns that might be of value to a model.

Automated Feature Engineering: Three Levels of Feature Engineering

dotData’s Automated Feature Engineering provides three critical ways of assisting the data scientist in the feature discovery and evaluation process. First, it allows the user to import multiple data tables and define the entity relationships. dotData then automatically discovers and evaluates possible features from the available tables without manually forcing the user to build a flat table. 

Second, dotData also provides a more traditional means of feature engineering by creating new features by combining existing features.

Finally, dotData’s platform is ideal for time-series-based problems because it’s capable of intelligent aggregation, automatically creating aggregations by day, week, month, etc., based on simple user inputs.

This powerful trifecta of feature engineering allows data scientists to focus their efforts on algorithm selection and optimization rather than feature development. It also accelerates time to value by giving subject matter experts insights into possible data patterns they may not know to make building high-value features fast and more accurate.

Auto Machine Learning

The auto machine learning functionality in the platform combines the features defined by the user and the features generated through Automated Feature Engineering. It efficiently applies those features for model building and runs several different models such as Linear (Linear Regression, Logistic Regression, SVM), Tree-Based (Light GBM, XGBoost, Decision Tree), and Neural Network-based models (Tensorflow, PyTorch) and report the summary metrics. The user can choose and deploy the best model based on their business needs.

By leveraging automation, the workflow becomes simplified as shown below:

The ML Workflow as Automated With dotData

Predicting loan defaults without manually creating a flat file

Once the user imports the loan, transactions, orders, and tables, they can define joins between them. For example, Loan and Account tables can be joined based on Account Id, card, and disposition table based on DispId and so on.

The use of joined tables eliminates the need to create a single flat file allowing the user to build and rebuild table relationships according to the nature of the problem.

A Complex Table Relationship in dotData

Extracting time-based features:

Similar to how we defined the relationships (or ‘joins’) between tables based on keys, dotData also has the functionality to relate tables based on a time range. Let’s assume that we need to predict loan default 30 days before approval for this use case. Leveraging time ranges can be carried out in two steps:

Define the lead time for prediction:

The user provides the lead time to prediction as 30 days. With time-based “joins” between tables, dotData automatically extracts only the transactions 30 days before loan approval. Using 30 days reduces data leakage, so no transactions after the loan are included.

Define the maximum time window for aggregation:

Let’s assume that transactions six months before the loan opening date contain helpful information related to default prediction. Even if we are unsure of the time window to consider for aggregation, we could let dotData extract multiple time-based features within the search range of six months. dotData’s automated option automatically defines a search range and identifies features to help define the search range.

Time Series Aggregation With dotData

Execution:

After defining the use case, database schema, and relationships, running the model is just a click away.

Results from the dotData ML model

In 90 minutes,

dotData explored thousands of possible features and hundreds of Machine Learning models to predict individuals at risk of defaulting on their loans.

dotData identified the XGBoost algorithm as the best way to address this challenge and used automatic categorical and time-based features created by dotData.

Insights from the Features

This chart shows the top features that have a higher association with model performance. Many of the features are automatically built by dotData:

Features Generated by dotData

The discovered features identified some critical patterns for clients that have a higher risk of defaulting on their loans:

  • Customers who have frequent withdrawals during the weekday
  • Customers who maintain a lower balance before loan approval
  • Customers who have higher variation in withdrawal during the last six months
  • Customers who have made more orders (transactions) for household purposes

Model metrics:

The confusion matrix helps monitor the performance of the classification model. It gives an understanding of the number of correctly identified, missed, and incorrect predictions:

Confusion Matrix

Deriving value from this prediction:

The confusion matrix showed that dotData successfully identified twelve of them before providing the loan of the fourteen actual defaults in the validation dataset. Out of 175 records in the validation dataset, 53 have completed the loan tenure. The rest of the loan belongs to the active type. We have considered loans that have completed the term for determining the cost avoidance of loss prevention.

Number of Paid and Defaulted Accounts

Assessing the potential savings:

dotData ML model successfully identified 8 out of 10 defaults before loan approval. By preventing loan approval on the eight account IDs predicted to default on loans, the bank could have saved over 388,000 Korunas (Czech currency). The two default loan account-ids that the model missed lost around 20,000 Korunas (Czech currency).

.

Default Accounts Identified by dotData

Applying ML for this use case clearly showed significant financial benefit, identifying 80% of likely defaults.

Conclusions

In this article, we discussed the need for applying ML to solve risk profiling use cases. We also discussed the challenges faced by data scientists that cause delays in delivering value and the limitations involved in building ML models manually. It also highlighted how some helpful information could get lost while aggregating multiple tables and the risks of limiting features based on domain knowledge.

We demonstrated how dotData helps in accelerating the process of model development. In essence, the product helps narrow down the focus of data professionals toward Delivering value by taking care of the heavy lifting associated with feature engineering and model development. 

Finally, we showed how our award-winning features, such as key-based and time-based relationships across tables, auto time-based feature extraction, and automated model development process, could have saved Czech Bank from losing a significant amount of money – in just a few clicks!

Learn more about how your organization could benefit from the powerful features of dotData by signing up for a demo.

Share On

Hari Naryanan

Hari Naryanan

Hari is currently a Senior Data Scientist at dotData. As a customer-facing data scientist, he works in use cases involving the automotive, healthcare, retail, and insurance industries. Before joining dotData, he worked for top-tier OEMs including General Motors and Ford. He has published patents and publications and is also a recipient of the Technical Excellence Award at Ford. Together, he brings eight years of experience in the Automotive and Software industry. He has developed solutions for use cases involving Predictive Maintenance, Content Optimization, Inventory Optimization, Machine Life Optimization, and Customer Risk Prediction. Hari has received his Master’s degree in Industrial Engineering from Clemson University, specializing in Operations Research and Advanced Statistics.