fbpx
Categorical Data for ML

Feature Engineering from Multi-Valued Categorical Data

  • Technical Posts

Introduction

Multi-valued categorical data is ubiquitous and critical to businesses across multiple industries. Most real-world data that businesses store is not numerical but categorical in nature. Examples of categorical data include the level of education (elementary, middle school, bachelor’s degree, master’s degree, Ph.D.), subscriptions types (basic, premium, business), countries, and countless other examples, that are prevalent in business and necessary for daily operation and analysis of business performance. Although businesses need to rely on categorical data to capture real-world trends, working with such data can be challenging – but there are ways to address such challenges.

Overview

Categorical columns represent data that can be divided into several different groups.  There are two types of categorical columns: ordinal and nominal, which can be uni-valued (regular) and multi-valued. Multi-valued categorical columns can contain different types of information in a single column. In Fig.1, the payment_type column is a regular categorical column, whereas the category_name column is a multi-valued categorical column since each record (row) contains multiple entries within the same column.

Fig1

id payment_type
0 Cash
1 Credit Card
2 Debit Card
3 Loan
id category_name
0 Men / Tops / T-Shirts
1 Electronics / Computers & Tablets / Components & Parts
2 Women / Tops & Blouses / Blouse
3 Children / Toys / Accessories

Multi-valued categorical data is commonplace in business across diverse industries. Businesses deal with large amounts of multi-valued categorical data daily. Examples include product pricing, job applications, feedback surveys, and competitive analysis results. This type of data is prevalent, and knowing how to work with it is also critical for the success of any business. The importance of multi-valued categorical columns is that they represent real-world data. Capturing real-world data is not simple, but it enables us to build good ML models. Multi-valued categorical variables can hide and mask useful information within them. It is necessary to know the correct methods to work with them; otherwise, we will miss out on finding relevant information that could help us improve our business. From an analytical point of view, it gives us the ability to perform different kinds of statistical analysis that we otherwise wouldn’t be able to perform. From a technical point of view, it is challenging to work with them.

Approach to multi-valued categorical data

Different features can be extracted to work with multi-valued categorical columns and use them to build models. Still, the column cannot be used as a feature without preprocessing. Feature extraction can be performed with several different techniques.

One-hot encoding

The simplest technique to handle multi-valued categorical columns is one-hot encoding which treats a multi-valued categorical value as a normal categorical value and generates binary-valued features, as illustrated in Fig 2. The limitation of this approach is that it cannot capture structure among categorical values. For example, “Women/Tops/T-Shirts” and “Woman/Tops/T-Shirts” are both “Tops.” However, this encoding will treat them as different products.

Fig 2.

Table A

id category_name
0 Men / Tops / T-Shirts
1 Women / Tops / Accessories
2 Children / Tops / Accessories

Table B

id Men / Tops / T-Shirts Women / Tops / Accessories Children / Tops / Accessories
0 1 0 0
1 0 1 0
2 0 0 1

Binary encoding

Another technique to handle multi-valued categorical columns is binary encoding. It expands the original multi-valued categorical column to binary-valued columns after splitting each multi-valued categorical value into multiple categorical values, as illustrated in Fig 3. One of the limitations of the binary encoding approach is that it cannot capture complex dependencies among categorical values. In Fig. 2, for example, “T-Shirts” is always under “Tops” while “Accessories” may exist both under “Tops” and “Bottoms.” Such structures that often exist in multi-valued categorical columns are important to incorporate more domain context into the model for better accuracy and transparency  

Fig 3.

Table A

id category_name
0 Men / Tops / T-Shirts
1 Women / Tops / Accessories
2 Children / Tops / Accessories

->

Table B

id Men Tops T-Shirts Women Accessories Children
0 1 1 1 0 0 0
1 0 1 0 1 1 0
2 0 1 0 0 1 1

N-gram Encoding

A more advanced approach to extract features from multi-valued categorical columns is using ngrams. Ngram is a sequence of “n” items. For multi-valued categorical columns, ngram represents a set (sequence) of categorical values, and ngram encoding transforms the original multi-valued categorical columns into binary n-gram features, as illustrated in Fig.3. N-gram encoding naturally extends the binary encoding (1-gram features are equivalent to the binary encoding features) and can capture dependencies among categorical values and hence improve the drawbacks of the binary encoding approach. Ngram is a well-established technique in text and speech recognition, so we can leverage many state-of-the-art n-gram types of research in the research community as well as open-source libraries such as categorical encoding and category encoders.  

Fig.3

Table A

id category_name
0 Men / Tops / T-Shirts
1 Women / Tops / Accessories
2 Children / Tops / Accessories

->

Table B

id Men Men-Tops Men-Tops-TShirts Tops-Accessories Women-Tops
0 1 1 1 0 0
1 0 0 0 1 1
2 0 0 0 1 0

Experiment

In this experiment, we compared one-hot encoding (the most naive approach) and n-gram encoding (the most advanced approach) for multi-valued categorical columns, using Kaggle price prediction data for an EC site (Mercari Price Prediction). The data contains 1.3M rows and a numeric column to predict (“price”), three categorical columns (“brand,” “condition,” and “shipment”), two text columns(“name” and “description”), and a multi-valued categorical column (“product_category”), as shown in Fig.4.

Fig.4 Mercari Price Prediction data

price product_category brand name condition description shipment
Integer Multi-Valued Category Category Text Category Text Category
$20 Women / Tops / TShirts Nike Nike men’s dri-fit tee shirt 1 Basically brand new. No wear or tear, great condition, great and soft shirt. True
$25 Men / Tops / TShirts unfeatured Unfeatured T-shirts 2 Cleaning out closet Brand new, never worn False
$130 Men / Tops / Button-Front Burberry AEROPASTALE L WHITE BUTTON DOWN   No description yet True
  • For the three categorical columns (“brand,” “condition,” and “shipment”), we applied one-hot encoding and target encoding
  • We applied a standard topic modeling technique for the two text columns and used topic vectors as features. 
  • For the multi-valued categorical column (“product_category”), we compared one-hot encoding (Approach-1) and N-gran encoding (Approach-2). 
  • For both Approach-1 and Approach-2, we applied LightGBM and tuned its hyper-parameter using grid-search. 
  • We evaluated the accuracy based on RMSLE. 

Fig. 5 shows the RMSLEs on the training and validation sets. As can be seen, Approach-2 (N-gram features) reduced the validation RMSLE by 3.4% (0.531 vs. 0.513). We sorted features based on permutation importance scores. We discovered several N-gram features such as “Electronics” (1-gram feature that is equivalent to binary encoding), “Women-Jewery” (2-gram feature), etc., that are highly effective in predicting product prices (electronics and women’s jewelry products are often more expensive, so the N-gram features extracted such trend via multi-valued categorical columns.).  

Fig.5. 

Approach Training RMSLE Validation RMSLE
Approach-1 (one-hot encoding) 0.517 0.531
Approach-2 (N-gram encoding) 0.499 0.513

dotData’s Approach

The dotData’s feature discovery platform treats multi-valued categorical columns in the following steps: 

  • Auto-detection: dotData automatically analyzes all categorical columns and automatically detects multi-valued categorical columns. 
  • N-gram feature extraction: dotData extracts N-gram features from the detected multi-valued categorical columns.
  • Feature selection: dotData assesses the extracted N-gram features and selects important ones (note that N-gram features tend to be very high-dimensional features, so this selection step is practically important)

Summary

Multi-valued categorical data often exist in business data. N-gram encoding is a promising approach to extract multi-valued categorical relationships as features and improve the model performance and deeper insights. You can automatically handle multi-valued categorical columns using dotData’s feature discovery platform.

dotData
dotData

dotData Automated Feature Engineering powers our full-cycle data science automation platform to help enterprise organizations accelerate ML and AI projects and deliver more business value by automating the hardest part of the data science and AI process - feature engineering and operationalization. Learn more at dotdata.com, and join us on Twitter and LinkedIn.

dotData's AI Platform

dotData Feature Factory Boosting ML Accuracy through Feature Discovery

dotData Feature Factory provides data scientists to develop curated features by turning data processing know-how into reusable assets. It enables the discovery of hidden patterns in data through algorithms within a feature space built around data, improving the speed and efficiency of feature discovery while enhancing reusability, reproducibility, collaboration among experts, and the quality and transparency of the process. dotData Feature Factory strengthens all data applications, including machine learning model predictions, data visualization through business intelligence (BI), and marketing automation.

dotData Insight Unlocking Hidden Patterns

dotData Insight is an innovative data analysis platform designed for business teams to identify high-value hyper-targeted data segments with ease. It provides dotData's hidden patterns through an intuitive, approachable interface. Through the powerful combination of AI-driven data analysis and GenAI, Insight discovers actionable business drivers that impact your most critical key performance indicators (KPIs). This convergence allows business teams to intuitively understand data insights, develop new business ideas, and more effectively plan and execute strategies.

dotData Ops Self-Service Deployment of Data and Prediction Pipelines

dotData Ops offers analytics teams a self-service platform to deploy data, features, and prediction pipelines directly into real business operations. By testing and quickly validating the business value of data analytics within your workflows, you build trust with decision-makers and accelerate investment decisions for production deployment. dotData’s automated feature engineering transforms MLOps by validating business value, diagnosing feature drift, and enhancing prediction accuracy.

dotData Cloud Eliminate Infrastructure Hassles with Fully Managed SaaS

dotData Cloud delivers each of dotData’s AI platforms as a fully managed SaaS solution, eliminating the need for businesses to build and maintain a large-scale data analysis infrastructure. This minimizes Total Cost of Ownership (TCO) and allows organizations to focus on critical issues while quickly experimenting with AI development. dotData Cloud’s architecture, certified as an AWS "Competency Partner," ensures top-tier technology standards and uses a single-tenant model for enhanced data security.