Video: Ai4 Finance- AutoML & Beyond Part 1

Developer code

From the 2019 Ai4 Finance New York conference, Ryohei Fujimaki (PhD and CEO/dotData) discusses AutoML in the financial industry. Part 1 of 2

Video Transcript: AI4 Finance: AutoML & Beyond – Part 1

[Question] Which best describes your organization’s use of Machine Learning?

One 3% … only 3% answered that [they are] using machine learning across business

 So this call, this 3% [level] four. 20% are using machine learning in some areas of the business. This corresponds to level 3. 60%, more than six, 60% almost two-thirds of CIOs answered that they are still in level 1 and level 2. 

Basically, people are saying that it will take three to five years to shift from level 1-2 to 3 and 4. So these [are] currently stats in the industry, how much people are using machine learning. People are starting to use machine learning, but [adoption] is not good enough. 

We are here today to help to accelerate this movement, to adopt AI machine learning data science to your business. 

[Question] Upfront project preparation and domain expertise

In machine learning and data science projects, there is a lot of work to be done before. That upfront effort is very very challenging. 

And also typically, domain expertise is one of the keys to the success of a data science project. You know, business people are very, very busy, we may ask the question of the business people and the answer may come after one week and the project cannot move as fast as we expect. 

[Question] How is AutoML progressing in the market?

Many people are interested in it but do not yet implement it. Some people are doing some evaluation [or] pilot. Some people, 20% of people don’t know what AutoML is. I can talk a little bit about what is AutoML for this audience today. 

What is AutoML? So basically this is my understanding of the basic concept of AutoML. So machine learning basically requires a flat table for the input, but a flat table does not necessarily mean that [the data] is ready for machine learning, or will guarantee good accuracy. So there is feature cleansing, which means missing value imputation or outlier filtering or data normalization or standardization. This kind of cleansing is a certain kind of effort that is required. 

Feature selection, dimension reduction – when your original table is a hundred of even thousands of columns, so you have to reduce the dimensionality. Or sometimes you want to boost the number of columns, right? One-hot encoding is a very simple method to transform a categorical variable into a binary vector. binominal transformation is something like A+B or A*B or A/B type of transformation to generate non-linear features. Temporal aggregation is a standard way of handling time series data like moving average method etc.

And then once you’ve cleaned up a feature table, then comes machine learning. So you have to select an algorithm. There [are] different types of algorithms like boosting or neural networks. These are typically the most accurate in most applications. On the other hand, these are very highly non-linear and complex. Particularly in financial applications, prefer transparent models like maybe a linear model or decision tree. 

For hyperparameter search, a machine learning algorithm is not just a box, but you have to carefully tune the parameters with training data. [unintelligeable] there are a lot of ways to automatically search hyperparameters. And of course, you have to validate your accuracy. Accuracy is not just a single measure, there are a lot of different aspects to accuracy. You have to validate the model in many different aspects and choose the model that satisfies your business requirements.

So this is not everything, but the major part of what AutoML can automate for you. You can try different types of cleansing methods and then different types of machine learning methods automatically using distributed computing and then assess the model accuracy and produce the best one to use.

So this is the basic concept of AutoML. So. This is a picture. You have a business case, you have the problem, you have machine learning, visualization and then production.  

[Question] AutoML can shorten the time to producing an acceptable model. 

Is this true?

So in our experience, this is not necessarily true in many projects. in our experience. There is still a very big gap between your business use case and machine learning. So, what is this? So basically this is a work around the data. So machine learning is a statistical and mathematical method but even – you are in financial services – I believe you are storing the data in a very good manner because you are regulated.

But still, you know, even if that’s the case, there is a lot of customization work required to make your data available for machine learning. 

Collecting data, or we call “Last Mile ETL” we are distinguishing between Master Data Management, but every analytics use case needs a customized ETL work. And also [with]  feature engineering, you have to generate a lot of hypotheses to combine different types of tables to come up with the input for machine learning.

So,AutoML is great, but in our experience, there is still a lot of work you have to do before machine learning. So. a little bit more conceptual figure, the gap between AutoML and real data.

So on the left hand side is what we usually observe in the enterprise. So, is there different, different data? Maybe, customer online banking data, transactional data. There is basic customer demographic information, there is a lot of data around the customer. This data is never a flat single aggregated table and in every machine learning use case you need a lot of aggregation combinations to come up with this feature table. 

So, in the poll, many people, I think ⅓ of the people in this group said that the high cost of domain expertise is a key pain. This is typically the stage where you have to put the domain expertise first. Because you have to first understand what these tables are, sometimes you have hundreds of columns, thousands of columns, you have to understand this for any given use-case what are the patterns you have to generate. So this is the part where you have to invoke your domain expertise to come up with a feature table. So, this is why this is one of the biggest gaps we are seeing beyond AutoML. 

In addition to complex deal data, there is a different type of data. So it’s not just, everything is numeric data or categorical data, which is fairly easy to handle. But data can be relational structure, normalized data, star schema or OLAP. What we see is data architecture is not necessarily very clean in the enterprise. It is called structured data. What we typically refer to as not well-structured data. It’s structured. Not really a good way. 

Well, you have that all the transactional data, of course all enterprise data are eventually temporal. There is a timestamp. 

Recently? People are interested in geo locational data or even geo-temporal data. To understand the spatial behavior of customers’ behavior or category behavior. 

Text data is also very popular in the enterprise. For example you have a call center with voice data mostly from customers and it’s text data. How can you extract it meaningfully? 

So this is mostly the process before machine learning. This is the very big gap between your data, your problem and AutoML. 

Well, let me get a bit more into describing a concrete case. Well, this is a very simple case study. So in a bank, so. So there are, uh, customers who have a checking account and savings account. So you want to identify which customers are going to sign up for a home mortgage loan. This is a very typical upsell/cross-sell case. 

THen, this is your data. This is a very simplified model. An actual project has tens of tables, in this case only six tables. 

So there’s one table which is called the target table – mortgage application – and this table is basically something that you want to predict. Customer ID and application flag – this is a one if we sign up and a zero if we don’t.

Then, there are five customer attribute tables, this is basic customer information. Web transaction tables, of course, there is online banking. One customer can have multiple accounts with the bank, so there is an account master and each account is associated with a balance history. So this is the type of data we are typically handling in the enterprise. 

ANd maybe some data is just exported from CRM systems, so CSV format. This transactional data is typically stored in Hadoop or big data storage. So it’s maybe stored in Hive. Or some traditional data is stored in Oracle or a relational database system. So data can be stored in different places. 

So in this case, first of all, to start AutoML. You have half the first process of this data, transform this data and come up with this flat single table.

[Question] What are features actually?

So there are features that we can create after aggregation and there are features we have to create before aggregation. So features, after aggregation usually look like this. So the customer attributes table is usually a flat table, so there is ID name, gender, age, income. Number of family members on so-and-so forth. 

So we can apply a certain mathematical transformation. So age times exponential income divided by number of family members. So this kind of nonlinear transformation is sometimes very useful to improve accuracy. On the other hand, 

Why does the data engineering process take time? Which is mainly the feature generation before aggregation. Actually to aggregate the data. So for example, the feature in the right handside is a customer who has withdrawn a lot of money for the past half year and occupation is engineer. This is an engineer customer who spent a lot of money recently, maybe a young engineer who is not buying a home. So this is a type of customer who will not sign up for a home mortgage loan.

So these are the types of features we have to generate before Machine Learning. But these futures are actually not very easy to create.

So for example, This is how we are going to generate the future. From these four tables. You don’t need to read all the queries. Uh, you don’t need to validate if this query is correct or not, but. I want you to sense how complex this query is, which means that we are talking about [the simple] feature, you can write this with one line of Python code, as long as you have the data. But even if you have the data, generating this type of feature before aggregation to come up with an input for machine learning requires a certain amount of work and this is just feature one. And in many projects you have to generate the futures 2, 3, 4, 200. 

Actually, this is the future too. So you have to write a very different type of query. So actually, why we started this project, I had a project in New Zealand back in 2000, a churn prediction project with a very large telecom customer. In that project, my data scientist and I, the two of us, wrote 3,400 queries to generate features from more than 10 tables. That work itself took months and that time we believed that this is something we have to help because a lot of people have the same pain before the Machine Learning process.

So overall, beyond machine learning we strongly believe there is a strong need required for accelerating this feature engineering and data engineering process. So these are some examples. For example, the data formatting issues. Oh, by the way we are not saying, like, I’m not saying like addressing all these issues. We are solving quite a good amount of these issues.

I’m listening to all the challenges shere. So the data formatting issue is a raw data cleansing issue. This is different from machine learning data cleansing. Raw data cleansing is not missing values or outliers, but raw data cleansing for example is where the value itself is not very well-formatted. Or the value needs to be aggregated, for example upper character or lower character issues or these kinds of issues. The value itself is dirty. Or data integration or manipulation. As I showed in a previous slide, data can come from different sources. In that case, data metadata information is different. Oracle, PostgreSQL, CSV, their data metadata information is all very different, so to handle them in a unified way we need a certain data integration work.

To come to the feature engineering part, we have to do the data rearchitecting in many phases, because as I said, the data is not well structured and it’s not very easy to even join the tables. For hypothesis search, this is, um, most of the difficult part, because this is basically relying on domain expertise.

So how can we automate this part without a lot of intervention by domain experts? Combination, aggregation, segmentation, ther are a lot of ways to do that. And query generation is basically just generating this type of query, but it’s not that simple. If you know data engineering, you know about data sparseness issues or data scale is also very different. When it comes to machine learning it’s already aggregated, so the data is fairly small. But if we are talking about raw tables, maybe you are talking about billions of records or even tens of billions of records in one transactional table. 

So data scale issue – or data skewness – is a very well known issue in the data engineering world. So just writing the query itself is not a very easy job in this world – because the data is so complex and so large. And, of course, which hypothesis is good, because there [are] a number of hypotheses so we have to validate and select them. 

So automated feature engineering. In our view is to help you to automate this process and accelerate the entire data science flow.

So…basically this is our product offering, on this one slide. Basically we have the automl component, but our unique strength, and our unique component that I want you to try is AI-Powered Feature Engineering. So we had a basic problem in understanding this data and feature engineering part and achieved quite significant success to automate quite a good amount of the work before machine learning. I don’t say we automate everything, but we automate a lot of work before machine learning.

Continued in part 2