fbpx
Blurred banner image of a corporate office

Fast and More Robust Permutation Importance for Black-box Model Transparency on Imbalanced and High-Dimensional Data

  • Technical Posts

In the first part of this blog, Basic Concepts and Techniques of AI Model Transparency, we reviewed a few common techniques for AI model transparency such as linear coefficients, local linear approximation, and permutation importance. In particular, the permutation importance is applicable to any black-box models, any accuracy/error functions, and more robust against high-dimensional data (because it handles each feature one by one rather than all features at the same time).

One of the drawbacks of the permutation importance is its high computation cost. We have to repeat the evaluation process by (the number of features) * (the number of random shuffling to repeat) * (the number of models). To reduce the computation time, a common approach is to apply downsampling that works well when the positive and negative classes are balanced. However, such naive downsampling makes permutation importance extremely unreliable. 

Permutation Importance Under Class Imbalance

Let us first see how unstable the permutation importance becomes using a naive downsampling based on the Kaggle’s Credit Card Fraud Detection dataset. The dataset, that has been upvoted on the platform almost 6000 times, is a famous example of a classification use case with imbalance classes. The objective was to detect fraudulent transactions building a model on data that contained only 492 frauds out of 284,807 transactions that occurred in two days (0.17% positive samples). There are 30 features, namly V1, V2, …, V28, Time and Amount. The experimental settings are summarized as follows: 

  • The experiments below were performed on a standard laptop computer with Intel Core i7-9750H CPU (6 cores);
  • We used eli5 and XGBoost; 
  • We used AUC as the accuracy metric (Area Under the Curve of Receiver Operating Characteristic); 
  • The number of shuffles was set to 5 (the default value of eli5).

First, Fig.4 shows the computation time of the permutation importance. The “baseline” used eli5 without downsampling. As we see, it took about 75 seconds to compute permutation importance. Although you may feel 75 seconds is acceptable, imagine the dataset contains 300 features rather than 30 and you have 10 models on which you want to compare permutation importance. Then, the computation time to calculate the permutation importance becomes 7,500 seconds which is more than 2 hours. A reasonable way to reduce computational time is to apply downsampling. The “random” applied random sampling which randomly extracted 5,000 samples (~1.7% of the original population). As can be seen in Fig.4, the computation time decreased proportionally. 
 

Fig.4 : Computation time of permutation importance on the Credit Fraud Detection dataset. The “baseline” method just used eli5 without downsampling while the “random” method applied 1.7% down sampling (5,000 samples).

Fig.4 : Computation time of permutation importance on the Credit Fraud Detection dataset. The “baseline” method just used eli5 without downsampling while the “random” method applied 1.7% down sampling (5,000 samples).


Next, Fig.5 shows the estimated permutation importance for “baseline” (blue) and “random” (green). As shown, the permutation importance values of the “random” method are very different from those of the “baseline” method. Moreover, the estimation variance (standard deviation across 5 random shuffles) is extremely large and the permutation importance estimated using the “random” method is unreliable.

Fig.5 : Top-10 permutation importance values estimated using the “baseline” and “random” methods.
Fig.5 : Top-10 permutation importance values estimated using the “baseline” and “random” methods.

Fast and Stable Permutation Importance

The main reason for this instability is the lack of positive samples after downsampling. In the Kaggle’s Credit Card Fraud Detection dataset, only 8-9 positive samples (1.7% of 492 positive samples) are included on average after downsampling. Thus, every random shuffle is evaluated based on only 8-9 positive samples and you can imagine how unstable the permutation importance becomes. You may consider to apply balanced (or stratified) sampling to have more positive samples. However, such balanced (or stratified) sampling changes the class distribution. In other words, the positive and negative sample ratio changes before and after the sampling and eventually makes accuracy calculation inappropriate.
At dotData, we implement our own fast and stable algorithm to compute permutation importance under class imbalance. Although this blog does not dive into technical details of the algorithm, the idea is summarized as follow: 

  • Apply stratified sampling to extract more positive samples while reducing the size of the entire population. 
  • Apply theoretical adjustment of the distribution change induced by the stratified sampling to calculate accuracy.

Fig.6 compares the permutation importance values calculated using the “baseline” method and the “new” method. As can be seen, the “new” method returned very similar estimation results as the “baseline” method and the estimation variance is much smaller than that of the “random” method (see Fig.5). Fig.7 shows the computation time of the “new” method. The computation time of the “new” method is only 3% of the “baseline” method and achieved more than 30x speed-up with stable and reliable estimation.

Fig.6 : Top-10 permutation importance values estimated using the “baseline” and “new” methods.

Fig.6 : Top-10 permutation importance values estimated
using the “baseline” and “new” methods.

Fig.7 : Computation time of permutation importance using the “baseline” and “new” methods.
Fig.7 : Computation time of permutation importance
using the “baseline” and “new” methods.

Key Takeaways

In this two-part blog series, we have discussed how to improve transparency of black-box models on unbalanced data. The key takeaways are: 

  • Model transparency is one of the most fundamental and important aspects of any AI/ML models in enterprise due to various reasons such as explainability for model consumers, accountability in regulated industries, and so on (see our white paper for more discussions.)
  • For many practical classification applications, class labels are highly unbalanced. On such imbalanced data, many common model transparency methods become highly unreliable (or computationally too demanding to be practical).
  • dotData developed a proprietary algorithm that significantly improves both stability and computation time to provide better model transparency on imbalanced and high-dimensional data (the algorithm was implemented in dotData Enterprise 1.6 / dotData Py 1.2 or later.)
Michal Zak
Michal Zak

dotData's AI Platform

dotData Feature Factory Boosting ML Accuracy through Feature Discovery

dotData Feature Factory provides data scientists to develop curated features by turning data processing know-how into reusable assets. It enables the discovery of hidden patterns in data through algorithms within a feature space built around data, improving the speed and efficiency of feature discovery while enhancing reusability, reproducibility, collaboration among experts, and the quality and transparency of the process. dotData Feature Factory strengthens all data applications, including machine learning model predictions, data visualization through business intelligence (BI), and marketing automation.

dotData Insight Unlocking Hidden Patterns

dotData Insight is an innovative data analysis platform designed for business teams to identify high-value hyper-targeted data segments with ease. It provides dotData's hidden patterns through an intuitive, approachable interface. Through the powerful combination of AI-driven data analysis and GenAI, Insight discovers actionable business drivers that impact your most critical key performance indicators (KPIs). This convergence allows business teams to intuitively understand data insights, develop new business ideas, and more effectively plan and execute strategies.

dotData Ops Self-Service Deployment of Data and Prediction Pipelines

dotData Ops offers analytics teams a self-service platform to deploy data, features, and prediction pipelines directly into real business operations. By testing and quickly validating the business value of data analytics within your workflows, you build trust with decision-makers and accelerate investment decisions for production deployment. dotData’s automated feature engineering transforms MLOps by validating business value, diagnosing feature drift, and enhancing prediction accuracy.

dotData Cloud Eliminate Infrastructure Hassles with Fully Managed SaaS

dotData Cloud delivers each of dotData’s AI platforms as a fully managed SaaS solution, eliminating the need for businesses to build and maintain a large-scale data analysis infrastructure. This minimizes Total Cost of Ownership (TCO) and allows organizations to focus on critical issues while quickly experimenting with AI development. dotData Cloud’s architecture, certified as an AWS "Competency Partner," ensures top-tier technology standards and uses a single-tenant model for enhanced data security.