Feature Engineering from Multi-Valued Categorical Data
- Technical Posts
Multi-valued categorical data is ubiquitous and critical to businesses across multiple industries. Most real-world data that businesses store is not numerical but categorical in nature. Examples of categorical data include the level of education (elementary, middle school, bachelor’s degree, master’s degree, Ph.D.), subscriptions types (basic, premium, business), countries, and countless other examples, that are prevalent in business and necessary for daily operation and analysis of business performance. Although businesses need to rely on categorical data to capture real-world trends, working with such data can be challenging – but there are ways to address such challenges.
Categorical columns represent data that can be divided into several different groups. There are two types of categorical columns: ordinal and nominal, which can be uni-valued (regular) and multi-valued. Multi-valued categorical columns can contain different types of information in a single column. In Fig.1, the payment_type column is a regular categorical column, whereas the category_name column is a multi-valued categorical column since each record (row) contains multiple entries within the same column.
Fig1
id | payment_type |
---|---|
0 | Cash |
1 | Credit Card |
2 | Debit Card |
3 | Loan |
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Electronics / Computers & Tablets / Components & Parts |
2 | Women / Tops & Blouses / Blouse |
3 | Children / Toys / Accessories |
Multi-valued categorical data is commonplace in business across diverse industries. Businesses deal with large amounts of multi-valued categorical data daily. Examples include product pricing, job applications, feedback surveys, and competitive analysis results. This type of data is prevalent, and knowing how to work with it is also critical for the success of any business. The importance of multi-valued categorical columns is that they represent real-world data. Capturing real-world data is not simple, but it enables us to build good ML models. Multi-valued categorical variables can hide and mask useful information within them. It is necessary to know the correct methods to work with them; otherwise, we will miss out on finding relevant information that could help us improve our business. From an analytical point of view, it gives us the ability to perform different kinds of statistical analysis that we otherwise wouldn’t be able to perform. From a technical point of view, it is challenging to work with them.
Different features can be extracted to work with multi-valued categorical columns and use them to build models. Still, the column cannot be used as a feature without preprocessing. Feature extraction can be performed with several different techniques.
The simplest technique to handle multi-valued categorical columns is one-hot encoding which treats a multi-valued categorical value as a normal categorical value and generates binary-valued features, as illustrated in Fig 2. The limitation of this approach is that it cannot capture structure among categorical values. For example, “Women/Tops/T-Shirts” and “Woman/Tops/T-Shirts” are both “Tops.” However, this encoding will treat them as different products.
Fig 2.
Table A
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Women / Tops / Accessories |
2 | Children / Tops / Accessories |
Table B
id | Men / Tops / T-Shirts | Women / Tops / Accessories | Children / Tops / Accessories |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 |
Another technique to handle multi-valued categorical columns is binary encoding. It expands the original multi-valued categorical column to binary-valued columns after splitting each multi-valued categorical value into multiple categorical values, as illustrated in Fig 3. One of the limitations of the binary encoding approach is that it cannot capture complex dependencies among categorical values. In Fig. 2, for example, “T-Shirts” is always under “Tops” while “Accessories” may exist both under “Tops” and “Bottoms.” Such structures that often exist in multi-valued categorical columns are important to incorporate more domain context into the model for better accuracy and transparency
Fig 3.
Table A
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Women / Tops / Accessories |
2 | Children / Tops / Accessories |
->
Table B
id | Men | Tops | T-Shirts | Women | Accessories | Children |
---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 1 | 1 | 0 |
2 | 0 | 1 | 0 | 0 | 1 | 1 |
A more advanced approach to extract features from multi-valued categorical columns is using ngrams. Ngram is a sequence of “n” items. For multi-valued categorical columns, ngram represents a set (sequence) of categorical values, and ngram encoding transforms the original multi-valued categorical columns into binary n-gram features, as illustrated in Fig.3. N-gram encoding naturally extends the binary encoding (1-gram features are equivalent to the binary encoding features) and can capture dependencies among categorical values and hence improve the drawbacks of the binary encoding approach. Ngram is a well-established technique in text and speech recognition, so we can leverage many state-of-the-art n-gram types of research in the research community as well as open-source libraries such as categorical encoding and category encoders.
Fig.3
Table A
id | category_name |
---|---|
0 | Men / Tops / T-Shirts |
1 | Women / Tops / Accessories |
2 | Children / Tops / Accessories |
->
Table B
id | Men | Men-Tops | Men-Tops-TShirts | Tops-Accessories | Women-Tops |
---|---|---|---|---|---|
0 | 1 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 1 |
2 | 0 | 0 | 0 | 1 | 0 |
In this experiment, we compared one-hot encoding (the most naive approach) and n-gram encoding (the most advanced approach) for multi-valued categorical columns, using Kaggle price prediction data for an EC site (Mercari Price Prediction). The data contains 1.3M rows and a numeric column to predict (“price”), three categorical columns (“brand,” “condition,” and “shipment”), two text columns(“name” and “description”), and a multi-valued categorical column (“product_category”), as shown in Fig.4.
Fig.4 Mercari Price Prediction data
price | product_category | brand | name | condition | description | shipment |
---|---|---|---|---|---|---|
Integer | Multi-Valued Category | Category | Text | Category | Text | Category |
$20 | Women / Tops / TShirts | Nike | Nike men’s dri-fit tee shirt | 1 | Basically brand new. No wear or tear, great condition, great and soft shirt. | True |
$25 | Men / Tops / TShirts | unfeatured | Unfeatured T-shirts | 2 | Cleaning out closet Brand new, never worn | False |
$130 | Men / Tops / Button-Front | Burberry | AEROPASTALE L WHITE BUTTON DOWN | No description yet | True |
Fig. 5 shows the RMSLEs on the training and validation sets. As can be seen, Approach-2 (N-gram features) reduced the validation RMSLE by 3.4% (0.531 vs. 0.513). We sorted features based on permutation importance scores. We discovered several N-gram features such as “Electronics” (1-gram feature that is equivalent to binary encoding), “Women-Jewery” (2-gram feature), etc., that are highly effective in predicting product prices (electronics and women’s jewelry products are often more expensive, so the N-gram features extracted such trend via multi-valued categorical columns.).
Fig.5.
Approach | Training RMSLE | Validation RMSLE |
---|---|---|
Approach-1 (one-hot encoding) | 0.517 | 0.531 |
Approach-2 (N-gram encoding) | 0.499 | 0.513 |
The dotData’s feature discovery platform treats multi-valued categorical columns in the following steps:
Multi-valued categorical data often exist in business data. N-gram encoding is a promising approach to extract multi-valued categorical relationships as features and improve the model performance and deeper insights. You can automatically handle multi-valued categorical columns using dotData’s feature discovery platform.