Video: dotData AutoFE Presentation at ODSC
See our CEO, Ryohei Fujimaki, PhD as he discusses Automated Feature Engineering at the virtual ODSC East Conference.
Video Transcript: dotData AutoFE Presentation at ODSC
Hello, everyone, this is Ryohei here, CEO of dotData. Today I’m going to talk about automated feature engineering for enterprise machine learning. Before my presentation, let me briefly introduce myself and dotData.
I have a Ph.D. in machine learning. In my previous career at NEC Corporation, I had mainly two roles. One was to invent transparent and explainable machine learning as an AI researcher. Another role was to develop AI solutions, as a customer-facing data scientist. Over those years at NEC, I have led more than 100 machine learning projects across global clients.
In our customer-facing product, feature engineering was always a bottleneck. It was difficult and time-consuming. Very importantly, as a data science team, we had to develop features that made sense for business units. The dotData project started back in 2015 as a seed research project at NEC to automate the feature engineering to support our data science team.
In 2018, we spun out from NAC and became an independent startup. Today, our software is recognized as one of the most powerful feature engineering tools for data science teams and is widely used by global customers.
So let me start my presentation with this statement. Great machine learning algorithms are not equal to great machine learning models. There are many great machine learning algorithms such as xgboost, lightGBM, deep learning, neural network, support vector machine, and so on so forth.
However, a great machine learning algorithm alone does not guarantee you to deliver a great machine learning model. Machine learning model can be only as good as the input data, which is called a Feature Table. To develop a great machine learning model, you need great features.
So let’s first Overview A feature engineering process. It starts from hypothesizing features based on your domain knowledge, what data patterns are relevant to the problem you are trying to solve? Then you have to transform your complex data into a single flat table. This requires knowledge about the complex data structure and the skill to handle very large-scale source data before aggregation.
You don’t know which features are more important to build machine learning models in the box. So you must validate the statistical significance of these features. It is very common sense for data science practitioners that this feature engineering is a most manual error-prone and time-consuming part of machine learning model development.
This slide illustrates what is feature engineering in practice. It is not just a simple mathematical transformation, like a column A times column B, or just a one-hot encoding of categorical variable. You have many tables with complex relationships. Different tables have different natures, such as transaction, temporal or time series, geolocation, text, and so on and so forth.
Based on your domain and data knowledge, you must come up with important features for a given machine learning problem that you must solve. So far, the story is conceptual. So let me use an example of actual feature engineering.
This is a whole mortgage cross-sell prediction problem at a Bank. There are customers who have standard checking and savings accounts. And we want to identify customers who are likely to purchase a home mortgage loan to run the promotion campaign. This is a standard binary classification problem. And you can use any machine learning algorithm or your preference, but you need the features for your machine learning algorithm. So let’s take a look at the data set.
First, this whole mortgage this mortgage application table contains an application column which is your target variable. So this is a flag to buy a home mortgage or not. Then you have five tables. The customer attribute table contains basic customer information like gender or age or city, each customer has multiple accounts in this bank, and account information is stored in this account master table.
Activities on each account are stored in this balance history table like always drawing deposit balance, and so on and so forth. And online banking log access logs are stored in this web transaction table. And there’s one more store master table. As you can see, the Feature Table in fact does not exist yet in your dataset. And you have to build a Feature Table by combining and aggregating these five source tables.
So, you have to first come up with a feature hypothesis does that makes sense for your business unit. In this example, let us consider this hypothesis: A customer who withdraw more than $50,000 Over the past six months and whose occupation is engineer probably this is a high-income engineer who has an expensive lifestyle then such a customer is more likely to be interested in purchasing a new home in the mortgage loan. Like this example, it is very important that you are going to tell a store business story to hypothesize features.
Now, to generate these features, you have to understand the data structure withdraw more than $50,000 Over the past six months so, first of all, this withdrawal or information is contained in the balance history table and we have to perform a temporal aggregation over the past six months. So we have to map the date column in the mortgage application and the balance timestamp column in the balance history table. But these two tables balance history tables or the mortgage application tables are connected with each other directly, parents history table does not contain customer ID. So you have to map the customer ID and account ID to connect these two tables.
The second condition occupation is engineer this information is stored in this customer attribute table and this occupation column. So, to generate this feature, you have to combine the customer attribute table home mortgage application table and balance a balance history table and because they are not directly connected with each other, this account master table is bridging the information and connecting these three tables.
So now you understand the data structure. And the next step you have to implement a query to transform and generate this feature. So here’s an example query to produce these features as you can see, the query is fairly complex requiring data engineering knowledge and in particular, this balance history table is transaction data and contains billions of records. So, you are often required to handle large-scale data using distributed computation platform.
And this is very, as you can see, you may understand how complex to generate the features, but this is just feature hypothesis number one. And typically you must develop hundreds of these features to build a great machine learning model or given machine learning problems that you must solve. So that’s why this feature engineering is so complex, time-consuming, and very, very difficult.
Until now, I have explained what is feature engineering in why feature engineering is difficult. Let me tell my personal story. In 2014. We had a project with a telecom client in New Zealand, as far as my memory is correct. More than 60% of mobile customers’ mobile users are prepaid in the chat late was very very high for such prepaid users.
The churn prediction problem was one of the unique challenges was that customer attribute information was not available because it’s prepaid users. So we had to detect charge signals Based on the user’s behavior, information behavior patterns, like prepayment charge, call history, call relationship, and so on so forth. In this New Zealand churn project, we manually developed about 800 features in total, it took more than five months to develop these features and get acceptance from the client. As you can imagine, manually writing 800 Feature Queries, was very error-prone in which a lot of reworking happened.
This experience made us believe that the data science team needs intelligence support to quickly explore more features from the complex enterprise data set, and then pretend a feature idea in front of the business team to get to their earlier feedback, and makes the process more agile. That’s why we started this automated feature engineering project in the next year 2015.
So dotData, our key innovation, and our key offering are to automate the feature engineering process, transaction data, temporal data, geolocation, data, text data, and so on so forth. Our AI engine automatically could take this data and hypothesize transform and validate features in preparing AI-generated features. So what are the benefits of automated feature engineering for the data science team? First, it allows you to quickly test new data set without writing a lot of complex Feature Queries upfront. This makes the trial and error process to build a machine learning model much, much faster. Second, you can plot AI-generated features into features that you develop. AI features, expand your feature space, and improve your model accuracy. As you know, the only principal way to improve machine learning model accuracy is to leverage more data and leverage more features.
With automated feature engineering, you can quickly explore 10 times more data and 100 times more features. Just to be fair, automated feature engineering is not going to replace the core features that you developed. Your features are deeper because you can never achieve your domain knowledge and domain expertise. Ai-generated features are kind of wider because they can explore much more patterns than you can manually do. By combining your deeper feature and wider AI features, you can develop a greater machine learning model.
Let me explain what kind of features that data can explore. For simple flat data like the user attribute table in the home mortgage example. We explored different types of categorical encoding features such as one-hot encoding and target encoding, particularly target encoding is often very powerful for high dimensional categorical columns. Also, numerical time columns are featureized used using techniques like histogram binarization, piecewise, linear transformation, holiday plugging in, so on so forth.
When it comes to transactional data, the cardinality between target and source tables is different. For the mortgage application example, you want to produce features based on customer ID. But there are many records under the same customer ID in a web transaction table. So to generate the features from the web transaction table, we need different types of merge record aggregation for numeric columns, it can be as simple as a max-min type of operation. For categorical column timestamp columns, we support more advanced aggregations. Also, suppose you are building a product demand forecasting model in retail stores. Then your target has a store ID in on product ID. You want to forecast the demand for a particular product in a particular store. So the aggregation is possible in store-level product level in their composite levels. automated feature engineering support such as merge record aggregation and multi-level aggregation
Because the real-world data are eventually all temporary data, dotData supports priors temporal features such as standard lock type of features, features based on temporal recency, periodic patterns, or temporal changes of certain attributes. For temporal data, it is important to determine the appropriate temporal range to aggregate information. For example, if we want to detect a monthly trend, we should aggregate data at a monthly level, while also we should aggregate the data in a daily level if more short-term trends short-term behaviors are important. Our older Fe auto feature engineering not only explored different types of temporal features but also optimize the temporal aggregation range for each feature, geolocational data and text data become increasingly more important in many, many industries.
Our feature engineering explores the features based on Geolocational recency grid encoding, distance encoding, topic modeling, text tokenization, and so forth. And very very importantly, auto feature engineering analyzed and combined merge for relational tables, we often discover very interesting features behind such table relationships. Obviously, the brute force approach brute force search does not work because the feature space is too big. You know, that is our key innovation, our auto feature engineering has very intelligent ways to explore efficiently explore this very broad feature space, validate their relevance scores, and output the most promising feature candidate in just in minutes, to hours of computation depending on the size of your data set.
Our auto feature engineering includes more powerful functionalities. For example, in the auto feature engineering pipeline, thought data performs automated data cleansing, to prevent data leakage canonicalize categorical values, remove record duplications, record outliers, handle missing values in so on so forth. Particularly though data automatically identifies data records that cause data leakage and remove such records from generating features. This is very, very important and useful when we handle the temporal data and generate the temporal features. Auto categorical canonicalization is also very useful when we handle data set with 30 categorical values.
For AI-generated features, it is very important to make sure features are explainable both quantitatively and qualitatively. For quantitative transparency dotData computes many feature importance metrics such as permutation importance in feature wise AUC for qualitative transparency, it automatically produces feature explanation in blueprint so that we can understand the meaning of each feature. It is a dotData core commitment to producing features that business users can understand.
The last point I want to highlight is data and feature scalability. When we handle temporal transactional data, the original data have much more records than the Feature Table itself. Therefore, auto feature engineering is designed to process billions of records using a distributed algorithm. In addition for such transactional or temporal data, data skewness typically becomes a headache issue to process up to process. Therefore, dotData uses an internal algorithm to estimate such data skew, and process data skew skewed data appropriately.
With these additional functionalities are auto feature engineering is simple to use, transparent and scalable. Now let me overview a few case studies to show the impact of auto feature engineering in a real project.
Our payment SaaS client has been struggling with $279 million declined payments – over 100 million transactions annually. They applied our auto feature engineering to improve their dining prediction model by analyzing end users’ payment patterns. This table compares their previous model without auto feature engineering and the new model using auto feature engineering with With dot data auto feature engineering, they could explore seven transactional tables in 100 to 122,000 features just two in three hours, which is 100 times more features than they have manually developed. Previously they had only 112 features. With Auto discovered features, the dining prediction accuracy improved from 75.6% to 90.9%. In AUC This is a very significant improvement. As a result, they could recover over 35% have declined the payment using their smart dunning solution.
This is a case study that applies auto feature engineering to marketing automation. This customer is a mobile telecommunicator having more than 50 million users. Their challenge was to improve a variety of marketing automation models like upsell cross-sells new device promotion in the loyalty program, they are operating in more than 500 marketing automation models. In digital marketing theory, it is very important to offer the optimal campaign within 24 hours of certain events. Therefore, they expected auto feature engineering to discover critical user behaviors with daily or even hourly level temporal aggregations. This table compares the prediction accuracy of cross-selling for online store cross-sell for offline store new mobile device promotion and loyalty program. Without Autofit changing this column is their previous model based on features they have manually developed.
They applied auto feature engineering to discover weekly and daily behavior protons from 12 additional tables of mobile and payment transactions having more than 10 billion records. As you can see, these results were extremely impressive for almost all models. So model accuracy has significantly improved. For example, this cross-sells for offline store the previous model without auto feature engineering kind of static feature table, it achieved 82% accuracy. But with auto feature engineering particularly daily user behavior products accuracy improved to 89%. So we demonstrated that this type of short-term user behavior is critical to their marketing automation use case. And they have developed 50 marketing automation models and kind of refreshed their previous model with new models using our auto feature engineering and eventually increase the annual sales by more than $10 million.
The last case study is Walmart’s retail good demand forecasting at Kaggle. The objective of this competition was to forecast daily unit sales for over 3000 unique products across 10 Walmart locations in the United States. The task is to forecast SKU level product demand which is critical for retailers to optimize inventory level. The data available data include item level, department product categories calendar selling price in store details, and so forth. The competition involved over 5500 teams from all over the world during a period of performances. As you can see in this ranking and score distribution, the distribution is a fairly long tail, which indicates the task was challenging. And there’s a big peak in this area, which means the majority of participants achieved around this score. So this is a kind of average participant level. Our process was very straightforward. We applied our auto feature engineering to generate many temporal features than just input our AI-generated features to xgboost model. The parameter of xgboost was optimized using cross-validation and these are the types of features that are auto feature engineering discovered. The first one is kind of simple sales in the last 14 days. This is just a simple temporal aggregation feature. And typically this type of time series forecasting problem such features is has a very strong power to predict.
The second feature looks slightly different and more complex sales or household items in the last 28 days. So, this requires some Kuroko aggregation based on not only the sale of one column but the differing Product Category column and particularly this household items a very important category for watermark and this feature is providing machine learning model to make a better prediction for this household product category. And it is interesting to notice that temporal aggregation lanes are different between the two features and dotData automatically discovered that simple temporal aggregation features two weeks aggregation is good. On the other hand for these household items, it should be longer so, 28 days is preferred but the third feature is sales in the same week or previous year. So, this is the annual periodic pattern, and this type of retail forecasting problem such as seasonal patterns seasonal periodic pattern is very, very important. And the last one is sales in the last 90 days in the no snap event. So, this is kind of very interesting feature of temporal aggregation conditioned by a specific event flat it is, it is important to highlight like a traditional or even recent time-series modeling like an autoregressive model or recurrent neural network, it is not very easy to incorporate this type of event information into the model. But our auto feature engineering is flexible enough to incorporate this type of event information in temporal features. As a result of all these features, our feature engineering generated hundreds of these types of features. And as a result, our model performance was blank in the top 1.8% amongst all participants. Obviously, our feature could not compete against the world’s best data scientist of the week as a winner of this competition. However, you can feel how powerful AI general features are to improve your model performance very quickly.
Let me take a break here. I have explained our automated feature engineering and its impact on real-world applications. So less than part of this presentation will explain how you can use dot data’s auto feature engineering in your Python workflow.
So you have your Python workbench, probably many of you are using Jupiter as your front end. And then you have machine learning tools and libraries psylearn xgboost PyTorch, TensorFlow, or whatever library you want. Machine learning needs features. So you should have some sort of data on the feature engine to like a spark data frame, Panda’s data frame dusk, data are stored in Data Lake like s3 Azure data lake gen-two or Hadoop HDFS dot data we have a product called dotDataPy which is automated feature engineering for Python data scientist. And the dotData Py is going to fit in your tool ecosystem like this. So spark pandas dataframe is going to provide you with a way to manually develop features and data pi is going to automatically generate the features and all those data pi generate the features are interfaced by spark data frame or a panda’s data frame. So, you can easily combine your features and our features and supply the features to your machine learning algorithm. And this is an interface of dotData Py.
So this second line from dot data core import is loading dotData Py as a library on your Jupiter environment. Let me see show how you can use automated feature engineering and data PI on your Python workflow. This is a simple example of a single simple example of learning like GBM model. The first step is to load your features in the target variable. This is just two lines of Python code, but in practice, your features dot CSV to prepare this feature table, you have to write a lot to these types of queries in advance behind this notebook.
And then once feature in the target already as x and y, you supply the X and the Y to the LightGBM model and then train the model. In practice, you may want to optimize hyperparameters. So, the code in the second part may be a little bit more complex, but how you can utilize automated feature engineering and dotDataPy in this example. So, this is a Python code, integrating our automated feature engineering into this LightGBM model training. So, let me follow step by step. The first step is the same we load features and target variable X and the Y. Second step is to load the source tables, so mortgage application account, master web transaction, and so on, so forth. And then read this data from your data source. In this case, it’s stored in CSV. So just to see us load the CSV, CSV, and store in a data frame dictionary, and then one more step we need is to specify the schema for this table and the schema inference is automated by dotData, so infer analytic schema from data frame functions. And once we load on the set source tables, we set a target for automated feature engineering, you have already loaded the target as y. So it’s just a specify this y as the target for automated feature engineering and also infer the schema.
This is the basic configuration we need we provide this configuration to this design feature functions prediction type its classification target is this pie target I created source is also Py data I created in a second step one more thing I have to specify is entity duration prepared in JSON. And then the thing is I specified passed to the input.
And by running these designed features, features hundreds of thousands even millions of features are explored and validated for which features are most promising for the given problem, and a feature engineering model is constructed. Once you get the feature engineering model, now you can just get features, so generate feature function provides you with data frame format features and now here I created X as source data.
The last step is to use our features for training the XGBoost model. So this x is your features and then x.merge this code is going to integrate your features and dotData features and you can easily use both features to train your XGBoost model.
Overall this is the entire workflow again. This is very easy to integrate automated feature engineering with your existing machine learning workflow with very minimal change in code.
This is the last slide of my presentation – the deployment model of dotData Py. There are two different deployment models. The first one is a desktop deployment for your local experiments. You can launch dotData Py in your local environment with just a single docker command. Both dotData Py client and dotData Py Compute are launched inside the Docker container and you can just get started. Once you find promising features in your local experiment, you can publish such feature queries to the server-side to generate the feature at scale in production. This deployment is called dotData Py Cluster Deployment.
For the cluster deployment, dotData Python Cube is deployed across distributed computation nodes. So to handle larger scale, production scale data you can just add more computation nodes with no code change. In this way, you can quickly explore features in your local environment with a subset of data and then expand as necessary on the cluster side for production.
One more important point that I want to highlight is that dotData Py can be deployed as your library. We have dotDqata Py hosting solution in case you don’t want to manage your environment. However, if your data is sensitive and you cannot use a SaaS solution, you can deploy dotData Py on your AWS, Azure, or even on-premise as a library of your Python ecosystem.
That’s all for my presentation. Today I have presented automated feature engineering for machine learning and its key benefits. It accelerates your trial and error process to explore new data sets and new features. WIth Auto feature engineering you can explore 10X more data and 10X more features in a matter of hours by expanding your feature space with AI-generated features you can improve your model performance quickly. dotData Py is an auto feature engineering tool integrated into your Python workflow. It is very easy to deploy and to get started.