Video: Scale Your Data Science Process With Automation

Data scientist giving lecture on automation

Aaron Cheng, PhD / VP Data Science Solutions provides insight on scaling data science processes.

Video Transcript: Scale Your Data Science Process With Automation

So first of all, good morning, everybody. I hope you guys are enjoying this conference so far. My name is Aaron Chang. I’m the Vice President of data science from dotData. Before I begin, I want to thank you for coming to my presentation and joining me in discussing this very important topic. I do have prepared quite a bit of material today. So let’s see how much I can actually get to cover. 

And also during the past few months, I’ve had the opportunity to talk with many customers about their data science practice in each different industry. So from that perspective, I want to take today’s opportunity to incorporate their feedback and share it with you. So what I would like to do today is I want to start this presentation by making a statement: data science has now become the core of many businesses. I don’t think there is any objection to this statement in today’s audience. You know, we are gathered here today from you know, different industries in the United States to discuss data science practice really speaks to the truth of this statement, speak of myself before I become the vice president of data science of dotData. 

And previously, I was managing Accenture’s data science practice, specifically on the west coast. So from that job, I really learned firsthand how important data science has become, to many industries, really, to all the industries. Based on what I have seen, I remember that I was invited to the Chief Marketing Officers office in San Francisco, for one of the largest clothing retailers in the United States, to discuss with him how we should be looking at his data, how we should be using the machine learning technologies to analyze the data. So to help him to drive up to online sales performance, it’s a very real use case for him. And also, I was personally leading a very large infrastructure data science project at the largest social media company in the world in Menlo Park, California, to help them analyze their infrastructure data. So to improve the data center operational efficiency, and reduce downtime. 

And [at] my current job, with dotData, I’m very privileged to be working with one of the top 15 financial institutions in the world, to look at their customer data, to use our tool to analyze their customer data to understand the customer behavior, which enabled them to come up with more ideas to develop better products, and, you know, come up with a lot of better programs to you know, target the customers that they’re interested in. I think, all in all, if there’s one thing that I have no doubt about in the past 13 years of being in this data science business, you know, it is that you know, the impact of data science for today’s business is real, it’s very significant is something that no business leaders can afford to miss. 

Then, on the other hand, we must acknowledge that data science is very difficult. Data science is not an easy subject, there are many different reports coming up talking about the numbers. So last year’s Gartner report is 85% of the data science project will fail. This one is actually coming from Dimension research, it says 96% of the organizations run into issues with the data science machine learning type of initiatives. So what this means is only 4% of the company, has successfully incorporated data science and machine learning into the core part of the business. And to nobody’s surprise, this 4% of the companies and Googles and Amazons of the world. Right. And also, I don’t believe there’s any doubt that into this company without Machine Learning Without data science, they could have achieved what they have achieved, or they could have become what they have, would they who they are today, during the past from the past 10 years of, you know, short history, that’s really the power of machine learning and data science. 

So meanwhile, if we look at this number, right 96% of the company, this is the vast majority of the companies, we are talking about Infosys 96% of the company, a lot of them, they probably have not even started, they just started talking about data science, but they have not really started in anything real in terms of collecting the data, discussing the use case, discussing technology stack, discussing how they should be visualizing the data, how they should be running the production on those data, they have no idea. And then there’s, you know, a significant amount of companies that, you know, I personally have interacted with as well, they have started, they have abroad, you know, a few data scientists, they try their best to hire data scientists, they have probably plan for some data science projects, maybe even data science products, but they are nowhere close to harvesting, the business impact that data sciences promised them right? So they’re really struggling, they’re constantly struggling about multiple issues. Is this the right data? Do we have enough data? You know, what is the right technology infrastructure we should be investing in? You know, should we be doing Python? Should we be doing our should we be doing sauce should be doing all those different kinds of platforms? Right?

So there are many real issues these companies are trying to figure out but the key is that data science is not very easy. Then let me explain to you, from my experience, why data science is not easy, right? So there is a variety of reasons. The first one, in my view, is the technology and the knowledge base that is required to support data science to execute data science is very complex, it’s very broad. And most importantly, it is fast evolving. Think about it 15 years ago, nobody has conceived that we have, we are accumulating this much data as what we are doing today. And the other great example is a deep learning neural network. A few decades ago, when this idea was brought up was proposed, it was perceived as a paper exercise because the amount of computational resources it requires to execute to support that algorithm was beyond physical limitation just a few short decades ago. 

But today, deep learning neural network has become the business norm in many different industries. So that’s the world that we live in, everything is moving very fast. I think the issue here is the fast-evolving nature of this business, or the fast-evolving nature of this data science field has made it close to impossible for any single organization, any single team, or any single data scientist to be capable of managing these data science projects from end to end perspective. And we all have worked at large companies at some point. In our past careers, we all understand that working across teams is never as easy as it sounds right? So that’s the first part. The second part is data science, data scientists today are still a very scarce resource. It’s a what is it? Oh, the data scientists today, it’s a very scarce resource, where a lot of people are here looking for data science jobs. Last year in October, LinkedIn actually published a report. At that point in time, there were 141,000, data science job vacancies in the United States In the United States. 141,000. So data scientists are very, very difficult to hire. You know, there was a lot of demand, but there’s not enough supply in the job market. 

Those are simple truths. And the last one is the data science process is a complex process. You know, we understand data science is more than just speaking, building machine learning models. If data science is as simple as applying machine learning models, all the problems will be solved. We understand data science actually starts with defining the business problem, they have to collect data, they have to go through that data journey, cleansing the data, preparing the data, all of that work is very manual. And after that, you need to go through a process called feature engineering to generate the feature hypothesis coming out from the source data. So to make the source data ready for machine learning this feature engineering process, as we understand it takes a lot of effort talking with the domain experts, right? It’s a very iterative, interactive process. 

And also building machine learning models, we understand requires the decision requirements, mutation, data scientists with enough technical knowledge to do that. And then visualization, right, you build a model nowadays, a model is much more than linear regression, linear regression, linear regression, we understand. But when you’re talking about the deep learning neural network we’re talking about actually boosting. Right? What is the easier way to help you understand the model, right? That’s why you need to be a physician tool to help you understand the model, right. And the last part, which is also very important, is the production validation, right, you build a model with a Jupyter Notebook doesn’t mean anything, because nobody can use it. That model needs to go to some part of your business process so that someone can run it on a daily basis, a weekly basis to generate results that result can give can benefit the business team. That’s a production application part of data science that is not very easy. It’s not very, so this entire process, as we just talked about is very complex. There are many different components involved in people with very different skill sets, right data engineering, database, people with database skills, people with you know, data engineering skills, statistician, mathematician, and also computer science skills. We all know that software engineering and also software, architectural type of knowledge or all of this knowledge is actually required in order to deliver one single data science project. 

That’s how complicated it is, if we are really talking about enterprise data science, this is how complicated it is. And now, I want to tell you, automation will change data science. And at the end of the presentation, I will show you specifically you know what automation will do for real data science use cases. So first of all, let me explain to you why I think automation will change data science. And there are four pillars to data science automation. So they are acceleration, democratization, augmentation, and operationalization. So let me explain this one by one.

The first one is acceleration. So what does this mean? As I explained to you earlier, a data science project typically takes a long time because the process is complex. The reason it takes a long time is, first of all, there is a lot of manual work that is required. Someone needs to manually code up all the data engineering pipelines, someone needs to manually code up all the data quality check rules, and someone needs to manually you know, call the Python library in the order you Some of the, you know, open-source models, those are all manual process. The other part of it, it’s very iterative, you need to talk to a domain expert, you need to talk to people who’s done this before to get their opinion. And we understand, you know, people have a different experience, they may have very different opinions, as far as this process becomes very iterative. 

But if we apply data science automation, it becomes a little bit different because automation tools, basically, it will run through all different hypotheses, it will go through all different combinations possible to come up with the best algorithm to come up with the best data engineering process in order to help you build the best model. So from that perspective, you know, with automation technology, we can iterate more, we can do a lot more, we can try many, many more turn around and deliver results a lot faster. And the next aspect is democratization. I also explained earlier, that data science projects are very complex involves people with different types of skill sets, business analyst, data engineers, people with a lot of status statistics and mathematics, computer science, software architecture type of knowledge, right. So you need people with a very different set of skills in order to participate in order to contribute to a data science project. 

But with automation, because automation will do a lot of heavy lifting behind the scenes, you will do a lot of the manual work for you, instead of you having to do it by yourself, automation will take care of that. So with this, we can really enable more people to participate in this process. Maybe the business analysts, you know, can run a machine learning model, even though this business analyst does not know exactly what is the mathematical mechanism in doing those exercises. But, this automation solution can do that for them. So you can get them started. So that’s the point here, which is what we call democratization. enable more people to participate, to contribute. And the next component is augmentation. 


Right. As I said earlier, the typical data science process involves a lot of conversation with domain experts. In domain experts, you know, I have tremendous respect with you know, for domain experts, I work a lot with them. But I also understand the domain expert, a lot of times, they are only as good as their last mistake. So by talking with them know, by relying on their input to get started with a data science project, we’re somewhat limited by what we have already now, especially, later on, I will show you some case study, you know, oftentimes when we are presented with an enterprise data set, talking with a domain expert simply means ignoring a big part of it. Because we don’t know enough, the right domain expert doesn’t have enough knowledge about that data. 

So what you do is ignore them. So that’s the process. But you know, with automation is different automation, basically will look at the data in a very unbiased in a very impartial approach. Because it’s a machine, you cannot differentiate column A and column B, it will just go through all of them one by one, run through all the combinations run through all possible scenarios analyzing in order to come up with the best, the best feature representation to view the model. So from their perspective, automation can actually extract a lot more information that was not known to us before. automation can come up with a lot of insights. I’m not saying they’re the universal truths, but they are some kind of data patterns, very deeply embedded data patterns that can be discovered, which was not known to the subject matter expert. So that’s another dimension of you know, why automation is a good solution in when we are trying to augment the manual approach, we can take this new insight new data patterns extracted from the automation solution and apply that to whatever we are doing right now manually combine them in order to build the best machine learning model.

And the last part is the operationalization right. So this one is actually critically important. You know, a lot of times we all understand especially having worked in very large companies before right? Data science projects a big success on the data scientist Jupyter Notebook. But when they go to production, it becomes a failure. Right, the code cannot be easily merged because the code is not thoroughly tested. Second of all, the code cannot be merged. Maybe the code is not software quality code at all. Right? So when we come to a potential opportunity for operationalization with a data science project, there is a lot more than just building models. 

There’s a lot more than just building models. So this is where automation can really help. Because when we are running out, when we are using automation solution behind the scenes for all the models, that automation solution is explored. We automatically generate software great code. It’s not a manual code. It’s not a manual data scientist’s code. And those codes, oftentimes, it’s wrapped by some API, which can be directly deployed. Instead of, you know, data scientists manually code some code is to go through all kinds of testing is to go through all kinds of hurdles. And he and his, they have all kinds of issues. But automation automates automatically generated, machine learning code wrapped by API is actually very, very easy to be deployed. So that’s where automation becomes very, very important in terms of helping with data science operationalization. 

So these are the four pillars, the four aspects where automation can really help. One is acceleration. The second one is democratization. The third one, what is it, augmentation, and the last one is operationalization. All right, so in the last part, I’m going to talk a little bit about the company data. And now I want to show you a very quick use case of how we can use the automation solution to solve a very specific problem. To the company I represent. dotData, we’re based out of the San Francisco Bay area, in a town called San Mateo, it’s about five miles south of San Francisco International Airport. So we were recently named as a leader in the Forrester, new wave. So, Forrester, is one of the largest market research companies in the world say they named us a leader in this automated machine learning space. Because they believe that you know, the feature engineering technology that we have developed, it’s really powerful, it can really help the customers to analyze the data from a very different perspective to extract more information to extract a lot of information that was not even known to them before. And this slide best explains, best explains what is automation doing to data science, traditional data science processes. This is a traditional data science process, I believe I just briefly explained it, there is the business problem identification. When you have collected data, you have to wrangle the Data Prep to date, then you need to go through the feature engineering process, feature engineering process. It’s a process of combining data from different sources, bringing them together, generating the feature representations. And then building machine learning models, going through the Visualization Step to help you understand the models understand the features. 

And sometimes there’s a feedback loop in this process as well, and then put its whole thing into production. This entire process takes many steps takes many different resources, with a different type of expertise to complete. But without an automation solution, as you can see, we can shrink this entire process into something much, much smaller, you know, oftentimes the full project that, you know, will take a few months to complete, in our cases a matter of few days. Because, as I explained earlier, the heavy lifting, the manual work is is being automated, the manual work is being automated. So specifically inside our solution, there are a few core components. The first one is called AI power feature engineering, which I will explain very briefly later. So this power feature engineering technology will basically take care of this entire process, then we have automated machine learning. So this one will basically automatically go through a list of machine learning algorithms that identify the best algorithm, the best hyperparameter for the algorithm. And then we also have the DoD data GUI, which supports all kinds of visualization that customers are interested in. And also the API. Our machine learning models are wrapped by API by calling that API you can easily deploy.

Then lastly, let me do this case study very quickly. Okay. So the case study here says, we want to solve a problem, what is the problem we want to solve? Right going back to this process, first of all, we want to identify a problem. The problem is for a large bank, I want to identify in I have a lot of retail customers saving, checking whatever it is. So my problem is I want to analyze all the customer behavior. So to identify the customer that I can, I can promote my new product which is a mortgage product. Then if this is the case, this is data that I can possibly find, right there is mortgage application information, which was information historical information about who supplied who’s not. 

Now you have you know, CRM data, customer attributes, or the customer’s bank is that you know, it’s in this table, they have account master all the account information related to customer and web transaction information. Then there is a balance history information. There’s a store massive information, maybe some geolocation or information. So this is the data we have right as you can see, data is not in one table. Data is distributed across different databases, then the challenges Okay, was this how can we run machine learning models? We all know that this data is a cell as it stands right now, it cannot be directly consumed by any machine learning models. So we need to go through this process which we call feature engineering, which is the process of processing the source data, and then taking the source data to generate what we call the feature representation. This can be fairly complex, like in this case, some features can be like someone that withdraws a lot of money for the past half a year, and occupation is an engineer. So this can be the type of feature representation we are talking about. And if you look at this feature carefully, this feature is not coming from any single source table. Right, if you go to any other single source table, the feature is not an attribute in any one of them. The feature is actually combined aggregate information coming out from a few source tables like in this case, it’s information that is coming from these four tables. So in order to generate this single feature, there is a lot of sequel exercise that is required. But that’s not the only challenge. 

The bigger challenge is that when you have data like this, with this much information, you have six tables, you have 500 columns. For this type of data, how many feature hypotheses are we talking about? How many features pass? Do you think the domain expert is ready? Now? Typically, we have more tables when the data is getting crazier, you need a lot more high feature pauses. So, without using this type of automation solution, we are really just doing whatever the domain experts tell us. Because so complex, right? So that’s feature one, someone withdraws a lot of money for the past year occupation is engineer, you can be talking about next feature, someone spends a lot of time on the web, for the past three weeks receive direct email campaign, you can keep on going, you know, typically you need 100 to us 1000s of future Hypostases, then where is that brain? Who can provide the information who can provide what feature causes to be looked at? Right? 

So this is the challenge. This is where automated feature engineering becomes very important because it will automatically explore the data, you know, coming up extracting the information coming up with the relevant feature and passes, which can be used to directly generate the model. The last slide is to this one is fairly straightforward. Why do we need automated machine learning, right? Because for a single problem, for example, for a single classification problem, there are many different types of algorithms one would want to try. 

Right. And for each algorithm, there are many different hyperparameters, there are different processes that are required. And also this one is also very important parallelization. Because nowadays the data we are looking at, is typically very big. And some algorithms take a very long time to execute. If you run them sequentially, it will take forever to get results. That’s why you need to parallelize the in parallel die in this machine-learning algorithm given that some of them are coming from one library, another one from a different library, there’s a lot of hassles that you need to go through in order to parallelize. So those are all the great challenges we have when we are talking about machine learning, automating machine learning. 

But with an automation solution, all of this can be taken care of. So I think I’m going to stop here I have minus four minutes now. So there’s a lot more that I can talk about. But again, you know, I think that the point of this conference is this the discussion with you guys today is I want to introduce this automation concept, data science automation concept to you. So that you know when you go back to your daily life, you can think about you know if this is something that is irrelevant to your needs. If that’s the case, feel free to you know, go to our website, and get you to know, information and shoot us an email, if there’s any interest. 

That’s all thank you