Video: How a BI Analyst Beat Data Scientists in 6 Hours

How a BI guy beat a bunch of data scientists at their own game

Join our on-demand webinar and see how using dotData’s AI Automation platform, we were able to build a predictive churn model in less than 6 hours with just one person and still managed to finish in the top 2% of global competition of over 500 teams.

Join our webinar to learn:

  • What the data science process looks like, and why it takes so long;
  • The power of automation, including a live demo of our latest AutoML 2.0 platform;
  • How to build a predictive churn model in less than 6 hours.

Video Transcript: How a BI Analyst Beat Data Scientists in 6 Hours

Hi, everyone. Thank you for joining our webinar. I’m hero Hiroshi a customer data scientist at dotData. Today’s focus is on how a bi-professional can beat most data scientists by leveraging dotData software. 

To best illustrate this, let me introduce one of the most popular use cases, churn prediction, and explain why this is important to enterprise organizations. The second is, we will definitely help you to overcome the problem that you might face. And I will explain why dotData does what we are doing. There are a tough couple of typical challenges when you want to work on this problem. And actually, we found the best to fit example, in the world-class data science competition. So we can show the exact result and when we apply dotData and what kind of result you can get. And you will see how amazing our performance is. 

Without human beings and the coding. Our result is actually ranked at the top 2% And the beast beat most of the data scientists then at the end of this session, I will visit our product and walk you through the steps to achieve that performance in our product. 

So let’s look at the use case itself. When you say customer churn prediction, it typically means that you’d like to predict and identify which customer is more likely to churn away or stop using our services for a while, like 30 days. And if you can predict such behavior in advance, what you can do is take a lot of products proactive actions, such as sending discounts or launch new features or invite closed group upset group session to reengage with them it is easier to retain the existing customer and actually cost-effective rather than acquiring new customers. 

That is why this customer churn prediction is one of the most popular use cases. And for example, we have if we have a customer illustrating the left-hand side we have recurrent services like fortify or Netflix and we have a subscription service. So using that, customer a renewing in a month, monthly basis. On the other hand, Customer B seems to renew twice, but they don’t like the services and then they stop using them for more than 30 days or longer. 

In this case, we can indicate this customer as a churned customer. In this context, if we are the data scientists needed to predict this behavior, what kind of pattern we can come up with what is the common behavior among these two groups. One of the simplest ideas is to look at the attributes or demographics of these customers, any difference in terms of gender age, or location occupations. The second element if you have some information about financial transactions, you can look at how they pay annually or monthly, how much they pay or how they pay by credit card or cash, and so on. 

The last element is if you have usage or operational log about these customers through the product, like for example frequency of usage in a month, or how many hours is they are actually actively using? What is the favorite service or product and so on? By looking at these, you might be able to capture the common pattern in churn or know churn customers. However, you have a lot of hypotheses here, there are a lot of challenges as well. Now, I just listed a few ideas, but you can even for a given hypothesis you can look at the past one month or two months or years, or even minute hours, what is the right time resolution? What is the time, right time range? We don’t know.

So we need to test and try and error a lot of times. The second portion is when you’re working on this kind of prediction, there’s one risk you might introduce some data, which is not available when you make a prediction, this is called data leakage in the machine learning world. And if you feed such a lot of wrong information, then the machine learning model will be kind of faulty.

The third element, when you collect this usage or operational type of data, it tends to be high frequency and high volume, just take taking care of this like volumes, it’s a lot of processes required and simply it takes more time. And you need to also combine this different frequency like one financial transaction, another side is the high-frequency log, which is produced apart from a few records in a second. 

So, this is a challenge that most of the customers also faced when you working on the customer champion problems. So here, dotData will help to overcome this process. To illustrate this, how we help you let me also start with the traditional data sciences science process. Now, we define the problem, we identify the sub-challenges. After that, what you will need to go through is this kind of process. 

First, you need to identify the data, what is the data we can use, then you might profile each data source and start to combine the data so that you can apply some hypotheses. Hypotheses mean that just I listed some idea, what is the average usage for given customers based on that idea, you can implement as a code and you can compute and then once you have a lot of set of the features, then you need to also select otherwise, you might face with another challenge, which is called overfitting. 

Once we have a good amount of features, then this is a time to feed this feature into machine learning. Once you train the model and tune the hyperparameters for each model, you might also visualize and report what is the outcome. And then once you are comfortable with your best model, you will make it in production. So this is the entire journey from data correction to production. But without data, we automate the entire journey. We directly work with the source data. And we automate the profiling and reactivating the data, joining the tables, and automating the feature engineering pieces. Of course, we also automate machine learning hyper-parameter tuning to the productions. 

Specifically for this problem, customer churn prediction, I want to also highlight two key features. As I said, have an automatic feature engineering. But we also especially have a special feature engineering piece, considering the time we explore different time resolutions from second minutes to the ears. And we can also try different time range from when to when and we slice and dice this time element and apply some filters. If you have a lot of combinations, then you need to actually explore a lot more combinations. In this case, you need to have the right resources and infrastructure. But the data come with our distributed systems. So we already have a scalable and efficient computational mechanism. So that’s why you don’t have to take care of these infrastructure parts. As I mentioned, another risk is data leakage. But we have some mechanisms to avoid happening data leakage. We only look at the available data for a given sample on a given date. So, if you look at the one specific sample, one customer given date, they are labeled as the churn or chance not to churn. And then we want to build a model only looking at the past data. But each customer has different timing, whether they are turned or not. In this case, you need to fetch the right data according to the prediction date given date, and also some in some setting, you have some prediction lead time, which is me, which means you cannot access this data when you are making a prediction. 

One of the customers prepares the data and makes it ready one week later when they actually collect the data because simply they collect the data from across divisions and branches. And they have a process to centralize it. So, in this case, you can not access the data immediately. But we also have a mechanism to consider that kind of blank or gap time. So, all of these settings are designed to make a prediction for the future. 

Let me also explain what exactly you have to do and what data so process out that you need to go through outside dotData that is very simple. One is you need to prepare the data and tell dotData where is the location of the database. We have a lot of data connectors, so you can simply connect and then tell the locations so that you can fetch the data from your databases. The second step is to define the problems, you need to tell dotData what is the problem you want to solve? What are the columns, which store the historical behavior of the charge the customer customers, which is a table that you want to use for the feature generation? The last element that you need to provide is what is the relationship between the target behavior and the other tables. Once it’s provided, then you can wait and you can see the result. 

So in the middle of this process, what dotData does behind the scene is as follows. We do data engineering, we do normalize and join those tables and the profile columns, and do quality checks. Feature engineering, we also generate time-based aggregation. As I mentioned, we try different time resolutions and time ranges. We perform outlier filtering, normalization, and the missing value imputations. Last but not least, we also perform a machine learning modeling process along with optimizing hyperparameters. 

Once it’s done, then you can review our models. By the way, in the wild, one of the most popular data science competitions, we found a best-fit example. Same best-fit example for this customer churn prediction. So let me also show a bit this example. In this competition, there is a competition named the KKbox churn customer prediction challenge. And in this competition, we are provided the following dataset. There is a churn history table. There’s a 1 million customer record. And then we have also provided some relevant information, which is the membership which contains some demographic information, transactional history, which is more about the financial data, how they pay, what is the amount they paid. And so the last piece is user log information. So this competition is actually conducted by this music streaming company. And the data collected log, what kind of song they listen, how many hours they listened, and so on. And just note that the volume of this data, if you look at the usage dog, they have 400 more than 400 million records in total 30 More than 30 gigabytes. So to process these data and join the tables, a lot of time and If it’s involved trial and error, you can imagine how this process is how complex processes. 

So with this setting, we simply configure our setting. With a standard computational setting. We generate 500 models, and each of them is tuned. In terms of hyperparameters, we found more than 10,000 interpretable features combining these two-time series. But note that once you kick out our process, you can just see our modeling performance and the feature itself within six hours. Without, you doing a lot of data exploration, you can see it within six hours. And this is a sample dashboard that we generate. And on the right-hand side, these are the key features we also generate with this model, if we submit the prediction, let’s see how much we can get. 

So among 575 teams, we actually ranked 12th. This is a top-two percentile, without doing a lot of manual feature engineering, manual data exploration, or it’s not that complex supervision. In the product demo, I will also walk you through how easy to configure our product. 

So let me go to the product. And by the way, this is actually a newly released version, we launched this new version this July. So what I’m going to show today is basically three steps. 

  1. Step one is the data preparation, you can simply drag and drop your local data set, or simply connect to your databases. 
  2. The second step is defining the problems, what kind of problem you want to solve, where are the tables and the columns that you are predicting. 
  3. The last step once you kick out you’re the modeling process, wait for a few hours and then come back to review your model, how to perform how well your model performing. 

So this is a simple step that I’m going to show today. So now we’d like to show you how we can exactly execute these KKbox use cases. Now, this is a logging screen for dotData enterprise, which is hosted in our environment. And let’s login here. So this is a landing page. And you can also see a lot of projects, which is the sociated with the set of the data set. Now, I will also show around basically three steps. The first step is data preparation, we will tell the data, where’s the data, what data you want to use to solve your problems. The second step is, we need to define the problem. So we will tell the data, what kind of problem we want to solve. Where’s the column in which tables store your predicted values or columns. The third step is reviewing the models. So after you define the problem, we also kick the modeling process behind the scene or dotData automatically process and react in your data, come up with a lot of hypotheses for features computing and selecting a feature. And also I built the models and the tuning the hyperparameters. There are a lot of things behind the scene and you can just sit and or wait for a few hours and then come back. The third step is reviewing the models that dotData explores and develops.

So have you gone through these steps So now the first step is about data preparation. So let’s see how we can prepare the data. There are two ways to prepare the data. One is you can ingest the data from your database. And another way is you can simply drag and drop and then you can use your CSV on your local laptop. 

So you can bring your CSV or you can register the databases. And you can also see what kind of data is available in your databases. You can just simply click and then start the import Same for CSV as well. Once it’s done, now, I already have ingested data here we have three tables under the same schema, which is about which is store the customer churn behavior about the users. Then there are also three other tables membership attributes and transactions for financial information and user log. So, these are the table that we will use for this analysis. 

The second step is defining the problem. So, let’s see how we can configure our data. Again, this problem is falling to the classification problem, since B we are predicting whether one of the customers is turned or not, we will need to provide a piece of information to define the problem. So these are the steps that we will go through. First, the information we want to tell is what is the target schema. So we have four different schemas and we will pick this target is containing the columns about churn histories. And then this is churn is the one storing the yes or no. The third information is the auxiliary column. These are basically unique columns. So this msno is the unique customer identifier, and then there’s some data. 

So given this date, and the customer ID we can identify each sample then now we are predicting based on some date. So we also need to provide what is the date when you will execute and predict. So this is the last information for the first steps. 

Now, this is an interesting concept in dotData, which is called the prediction execution time. So when you predict, for example, weekly sales next week, so in this case, there’s a one-week gap, then you don’t have access to that data from today to next week. So, we can also set up one week here. So then dotData also automatically considered that gap when you train the models, this is the one mechanism that you can avoid the data leakage I mentioned okay. Now, we define the problem and then we need to also configure what kind of table we use under this problem definition. So, the first information is to target two tables. So, under the same target schema, we can also have multiple tables, for example, one table coming from 2017 February. So we will train the model based on this February data then we can also select what is the source tables to generate the features. So we will choose member attributes financial transactions and use a log as source data. Okay, now, we select the source table and we want to configure the entity-relationship. So, this is the interesting part again, for dotData, we can also configure the telco data, what is the relationship between target and each source.

So we can tell this is the column that we can connect to these tables. And also, we can tell where you want to search regarding the time. So if you say if you want to search the information, one week, one year, in this member attributes, then we can tell minus one year to the date that we are predicting. And we can configure that. So that we can also do the same thing for here. And then once it’s done, you can just hit the run model design task. That’s it. There are all things you want to you want you to need to configure here. In the interest of time, we have a pre-executed result. So let’s see the pre-executed results.

So once the execution is done, you can also see what kind of model dotData developed and what is the score? We will provide the different evaluation metrics. So, you can also select the model based on your problem. Because depending of the problem, you have a different interest, you might also have a higher precision model or higher precision higher recall models. 

For this problem churn prediction, we want to select based on the classification error. And then we can also look at the detail what is the confusion matrix? And what is the performance? This is a summary of this model. Once you are confident about this model, what you can do is you can make a prediction using this model, then you can also select what is the table data you want to use for this prediction. So, if it’s the competition, then you can also select this is the like test data. And then based on this test data, we can make a prediction and a higher probability, and then we can submit this prediction 

So, this is the old process that we need to configure in dotData. And we can achieve the top two percentile in one of the top data science competitions. So, this is the summary and today I introduced the dotData and how we can help to solve the customer churn prediction. And there’s a lot of there are challenges typically customer might face. There’s a manual feature engineering and also data leakage risk and be with our automation we accelerate and simplify that workflow without having a lot of hassles. And as you, as you can see with our simple operations, you can also achieve very high accuracy and beat most of the data scientists. To achieve that, what you need is just a few clicks and even you don’t need any coding. So bi analysts can also get the same result as experienced data scientists. So this is today’s contents thank you for listening and hope you enjoy it.