Feature Engineering from Multi-Valued Categorical Data
Introduction
Multi-valued categorical data is ubiquitous and critical to businesses across multiple industries. Most real-world data that businesses store is not numerical but categorical in nature. Examples of categorical data include the level of education (elementary, middle school, bachelor’s degree, master’s degree, Ph.D.), subscriptions types (basic, premium, business), countries, and countless other examples, that are prevalent in business and necessary for daily operation and analysis of business performance. Although businesses need to rely on categorical data to capture real-world trends, working with such data can be challenging – but there are ways to address such challenges.
Overview
Categorical columns represent data that can be divided into several different groups. There are two types of categorical columns: ordinal and nominal, which can be uni-valued (regular) and multi-valued. Multi-valued categorical columns can contain different types of information in a single column. In Fig.1, the payment_type column is a regular categorical column, whereas the category_name column is a multi-valued categorical column since each record (row) contains multiple entries within the same column.
Fig1
id | payment_type |
---|---|
0 | Cash |
1 | Credit Card |
2 | Debit Card |
3 | Loan |
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Electronics / Computers & Tablets / Components & Parts |
2 | Women / Tops & Blouses / Blouse |
3 | Children / Toys / Accessories |
Multi-valued categorical data is commonplace in business across diverse industries. Businesses deal with large amounts of multi-valued categorical data daily. Examples include product pricing, job applications, feedback surveys, and competitive analysis results. This type of data is prevalent, and knowing how to work with it is also critical for the success of any business. The importance of multi-valued categorical columns is that they represent real-world data. Capturing real-world data is not simple, but it enables us to build good ML models. Multi-valued categorical variables can hide and mask useful information within them. It is necessary to know the correct methods to work with them; otherwise, we will miss out on finding relevant information that could help us improve our business. From an analytical point of view, it gives us the ability to perform different kinds of statistical analysis that we otherwise wouldn’t be able to perform. From a technical point of view, it is challenging to work with them.
Approach to multi-valued categorical data
Different features can be extracted to work with multi-valued categorical columns and use them to build models. Still, the column cannot be used as a feature without preprocessing. Feature extraction can be performed with several different techniques.
One-hot encoding
The simplest technique to handle multi-valued categorical columns is one-hot encoding which treats a multi-valued categorical value as a normal categorical value and generates binary-valued features, as illustrated in Fig 2. The limitation of this approach is that it cannot capture structure among categorical values. For example, “Women/Tops/T-Shirts” and “Woman/Tops/T-Shirts” are both “Tops.” However, this encoding will treat them as different products.
Fig 2.
Table A
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Women / Tops / Accessories |
2 | Children / Tops / Accessories |
Table B
id | Men / Tops / T-Shirts | Women / Tops / Accessories | Children / Tops / Accessories |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 |
Binary encoding
Another technique to handle multi-valued categorical columns is binary encoding. It expands the original multi-valued categorical column to binary-valued columns after splitting each multi-valued categorical value into multiple categorical values, as illustrated in Fig 3. One of the limitations of the binary encoding approach is that it cannot capture complex dependencies among categorical values. In Fig. 2, for example, “T-Shirts” is always under “Tops” while “Accessories” may exist both under “Tops” and “Bottoms.” Such structures that often exist in multi-valued categorical columns are important to incorporate more domain context into the model for better accuracy and transparency
Fig 3.
Table A
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Women / Tops / Accessories |
2 | Children / Tops / Accessories |
->
Table B
id | Men | Tops | T-Shirts | Women | Accessories | Children |
---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 1 | 1 | 0 |
2 | 0 | 1 | 0 | 0 | 1 | 1 |
N-gram Encoding
A more advanced approach to extract features from multi-valued categorical columns is using ngrams. Ngram is a sequence of “n” items. For multi-valued categorical columns, ngram represents a set (sequence) of categorical values, and ngram encoding transforms the original multi-valued categorical columns into binary n-gram features, as illustrated in Fig.3. N-gram encoding naturally extends the binary encoding (1-gram features are equivalent to the binary encoding features) and can capture dependencies among categorical values and hence improve the drawbacks of the binary encoding approach. Ngram is a well-established technique in text and speech recognition, so we can leverage many state-of-the-art n-gram types of research in the research community as well as open-source libraries such as categorical encoding and category encoders.
Fig.3
Table A
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Women / Tops / Accessories |
2 | Children / Tops / Accessories |
->
Table B
id | Men | Men-Tops | Men-Tops-TShirts | Tops-Accessories | Women-Tops |
---|---|---|---|---|---|
0 | 1 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 1 |
2 | 0 | 0 | 0 | 1 | 0 |
Experiment
In this experiment, we compared one-hot encoding (the most naive approach) and n-gram encoding (the most advanced approach) for multi-valued categorical columns, using Kaggle price prediction data for an EC site (Mercari Price Prediction). The data contains 1.3M rows and a numeric column to predict (“price”), three categorical columns (“brand,” “condition,” and “shipment”), two text columns(“name” and “description”), and a multi-valued categorical column (“product_category”), as shown in Fig.4.
Fig.4 Mercari Price Prediction data
price | product_category | brand | name | condition | description | shipment |
---|---|---|---|---|---|---|
Integer | Multi-Valued Category | Category | Text | Category | Text | Category |
$20 | Women / Tops / TShirts | Nike | Nike men’s dri-fit tee shirt | 1 | Basically brand new. No wear or tear, great condition, great and soft shirt. | True |
$25 | Men / Tops / TShirts | unfeatured | Unfeatured T-shirts | 2 | Cleaning out closet Brand new, never worn | False |
$130 | Men / Tops / Button-Front | Burberry | AEROPASTALE L WHITE BUTTON DOWN | No description yet | True |
- For the three categorical columns (“brand,” “condition,” and “shipment”), we applied one-hot encoding and target encoding.
- We applied a standard topic modeling technique for the two text columns and used topic vectors as features.
- For the multi-valued categorical column (“product_category”), we compared one-hot encoding (Approach-1) and N-gran encoding (Approach-2).
- For both Approach-1 and Approach-2, we applied LightGBM and tuned its hyper-parameter using grid-search.
- We evaluated the accuracy based on RMSLE.
Fig. 5 shows the RMSLEs on the training and validation sets. As can be seen, Approach-2 (N-gram features) reduced the validation RMSLE by 3.4% (0.531 vs. 0.513). We sorted features based on permutation importance scores. We discovered several N-gram features such as “Electronics” (1-gram feature that is equivalent to binary encoding), “Women-Jewery” (2-gram feature), etc., that are highly effective in predicting product prices (electronics and women’s jewelry products are often more expensive, so the N-gram features extracted such trend via multi-valued categorical columns.).
Fig.5.
Approach | Training RMSLE | Validation RMSLE |
---|---|---|
Approach-1 (one-hot encoding) | 0.517 | 0.531 |
Approach-2 (N-gram encoding) | 0.499 | 0.513 |
dotData’s Approach
The dotData’s feature discovery platform treats multi-valued categorical columns in the following steps:
- Auto-detection: dotData automatically analyzes all categorical columns and automatically detects multi-valued categorical columns.
- N-gram feature extraction: dotData extracts N-gram features from the detected multi-valued categorical columns.
- Feature selection: dotData assesses the extracted N-gram features and selects important ones (note that N-gram features tend to be very high-dimensional features, so this selection step is practically important)
Summary
Multi-valued categorical data often exist in business data. N-gram encoding is a promising approach to extract multi-valued categorical relationships as features and improve the model performance and deeper insights. You can automatically handle multi-valued categorical columns using dotData’s feature discovery platform.
Share On