Video: Why Do 85% of Data Science Projects Fail
Most data science projects fail. This is due to a mix of process and technology challenges Want to get more of your company’s data science projects successfully into production? Watch this video to learn how.
Video Transcript: Why do 85% of Data Science Projects Fail?
Thank you very much. Welcome. Welcome to dotData’s webinar and my name’s Aaron Chang. I’m the vice president of data science at dotData. I’m very excited here today because this is our first dotData webinar. I want to, first of all, thank you. Who have come and joined us either in the room or remotely before I get started. I want to, uh, uh, tell you a little bit about what I’m gonna talk about today. So, um, as a very passionate data science practitioner, myself, I want to take today’s opportunity to share with you some of the experiences that we have had through working with 10 plus enterprises over 80 plus data science projects with SIM and also, or share with you some of the learnings that we have acquired throughout that experience. And also, I want to, uh, provide you with my personal perspective of why so many enterprise data science projects fail.
I want to start today’s presentation by making a statement that is data science is the core to business today on my previous job, before dotData, I was, managing, uh, I was a principal manager of, data science, overseeing the data science practice of Accenture on the west region on that, on that job, I saw, uh, very clearly how important data science has become on all businesses, really on all the businesses. I remember very clearly. I was invited to the chief marketing officer’s office in San Francisco for a very large us retail company to discuss with them how to use their data, to drive their online sales. I was also personally leading an, uh, infrastructure data science project at one of the world’s largest social media company, to look at there, to look at their infrastructure performance and figure out ways to improve the efficiency and reduce the downtime on the current job.
You know, I was very fortunate to have worked with one of the largest banks in the world on their customer data, to try to help ’em understand their customer behaviors. So to design better products and to design better marketing campaigns. I think if there’s one thing that I’m very certain about throughout my past decade of experience in this particular domain, it is that the impact of data science on business is real. It’s very significant and it is something that no business leader can afford to miss today, but like all the great things that we are talking about in our lives, the future is very bright, but the reality is very harsh. Look at this number. Only 4% of the companies have successfully implemented data science and artificial intelligence in their business. And to nobody’s surprise, this 4% of companies are the Googles and Amazons of the world for them. They have not only successfully implemented data science into their business. As a matter of fact, data science has become the
A core part of their business. And I don’t believe that there’s any doubt that without data science, they could have achieved what they’ve achieved today, or they could have become who they are today, but not everybody’s. Uh, Google Amazon, there is 46% of the companies they have started. They have started hiring data science. They have started planning for some data science projects, but they are nowhere close to harvesting the business impact that data sciences promised for them. They are still trying to figure out how to harmonize data science with their existing business processes. They’re still trying to, you know, avoid a situation where data scientists are only doing some after-fact analysis or doing some siloed exploratory experiment that nobody else is cared about, but they are really struggling. They are really struggling. Then there is 50% of the world, the 50% of the world, they look around everybody, data science.
So they tell themselves that we must get started, but how to get started. That’s a problem. Okay. They want to hire data scientists, but they don’t know-how. So they’re also thinking about, shall we go to Accenture or the lo of the world to hire out that type of work? That solution is not very optimal for them as well. And for them, they constantly struggle with the type of question like, uh, do we have enough data? Is my data ready? How about the data quality issue that everybody else is talking about? But for the data science is really, really difficult, but the most, the most disappointing number is actually this one, which is for the companies that have started 85% of them, their project actually failed. Think about it. Think about it. You have worked for a very long time to convince your leadership, to secure the budget, to commit the resources.
You have worked many, many months to discuss with the business units about what to work on, how to address many planning sessions, many workshops, in the end, they fail in the end they fail. What does that feel like? Sounds familiar. And that is the situation. But at this point, I wanna tell you that if you have failed, it is not your problem. It is not only your problem that you failed, or at least you are not failed by yourself. Data science is a very, very difficult subject. It’s not easy by any means to anybody. Data science is complex because first of all, the foundations and the knowledge that is required to execute on data science projects, it’s complex, it’s broad, and it’s ever changed. Remember, 15 years ago, nobody had ever conceived that we are accumulating this amount of data that we are processing today.
Deep learning new network. When that idea was proposed a few decades ago, it was perceived as merely a paper exercise, simply because the number of computational resources that are required to execute on that algorithm was beyond people’s imagination. And yet that is a business norm today. And that is a reality. That is the ever-changing fast-evolving reality of data science that we have to give up. We have to deal with it. And because of that, there is no single person or no single organization that is capable of managing the data science process. From end to end perspective, you have to work with others. You have to work across organizations. And you know, as far as I do that, working across organizations is never as easy as the peer’s data science is hard. So you need export resources to jumpstart data science in your organization, but hiring data scientists, especially hiring those good experienced data scientists in today’s job market typically means one of the two things impossible to find or stunningly expensive to hire.
And that is a challenge. A lot of organizations are facing and also data science by itself, just a very complex process that involves many people. It’s iterated by nature. It’s very, very much manual. It takes a lot longer. It takes a lot of time to complete. And all of this makes data science very challenging makes the data science process very prone to failure. And that is the problem that we are trying to work with. And at this point, I wanna share with you some of the secrets to successful data science. And these are the learnings that we have acquired to work in many years with our and enterprise customers. These considerations must be, must be taken before we embark on any kind of data science project. The first one is transparency, the transparency of the process dictates the credibility of the outcome, as we all know.
So in order to make data science successful, in order to have the business users accept your data science project, invite them to the discussion from the beginning, talk to them, ask them what they want to predict, ask them what is the evaluation criteria make the process fully transparent to them. And also whatever goes into your data science models, whatever goes into your, your, your machine learning that part needs to be transparent as well. That means the business user. If they, if they, if you expect them to accept your result, you must ensure that they really understand what you are trying to do. The model has to be transparent, has to be why box models. The feature going to models has to be understandable, has to be explainable, to be auditable. And that is just a simple thing we need to be absolutely certain about when we embark on this data science process and the last one’s really that, you know, be very much open-minded about the feedback from the business.
They understand business, they don’t understand mathematics, but they know exactly if you are doing the right thing from a business perspective. So we have to be very sure that we talk with them, get their feedback, improve on our process. Remember that a, a, an impactful piece of data science work always withstand challenges and challenges can only make ’em better. So we have to do that. That’s the transparency part. The second part is democratization. As an organization, we have many projects, many, many data science projects that we need to work on. We do not have enough data scientists. What do we do? Wait for that one data scientist to bring more data science projects to bring more data scientists. There are easier solutions, elevate the people that are currently on your staff, train them, give them the opportunity, provide them with tools to empower ’em, to think like what data scientists think to do, what data scientists can do.
I remember I had a very good conversation with the very experienced that is actually one of the most successful factory managers in Malaysia. And he told me that the reason he was so successful in managing his factory was that he was trying to avoid a concept called a super technician. He does not want a super technician. He wants everybody in his factory to be good technic because having the entire factory depend on a single super technician is a very risky business. What if that person takes a vacation? Or if that person leaves, same thing here, do you want your entire organization to depend on one or two data scientists? Why not just empower the rest of the team, even opportunity at a with them, make them so-called citizen data science to truly Ize the data science practice. And that is the right approach in today’s business. Especially if you have a lot of, lot of use cases that you have to work operationalization very important subject.
A lot of times data science projects are wonderful on someone’s personal Jupyter notebook or on someone next term. But when they become part of the business process, they fail. There are many, many reasons for that, right? Data scientists are not software engineers. They cannot write software quality code. Okay. And also data scientists. They do not have that architectural mindset when they start off their problem. That’s why when you try to integrate their work into, you know, a business pipeline, there’s all kinds of problems, integration, issue, compatibility issue, and just a scalability issue, and many of them. So that, you know, when we, when for a large enterprise, if, if you want to start on your data science practice, make sure that you have that architectural mindset bake in your solution from day one. And that is absolutely critical. You do not want to commit your resource to do data science, activities, data science experimentation for a few months. And then, later on, realize this work cannot be productized.
The last one is automation. If a piece of work requires a human three months to finish, versus it requires a machine one day to finish, which approach would you take? Of course, you go with the one-day approach, why time to market. And also a lot of projects, a lot of projects, they will fail. They are going to fail no matter what you do, right? So why don’t you just try it and then learn that this is gonna fail? That is what we call a good concept. You know, a, a fail fast. The simple fact is that in today’s market failure, it’s actually merit, a lot of organizations want to have that so that they can move on to the next thing. And the next thing, because there’s just so many things that they have to look at, use automation, speed up your productivity and fail fast, and that is the right approach.
So those are the four key considerations that, an enterprise must have before they embark on their data science journey. Now, I wanna tell you a little bit about that data because this company is really built around those four, you know, uh, cornerstones of a data science enterprise, uh, data science practice that we just talked about. So this company is a strategic cover from corporation United based out San Francisco bay area. Yeah, it is the first and only company that focuses on delivering entry and data science automation for large enterprises customers. And we have worked with many, many large enterprise customers, including one of the world’s largest banks and one of the world’s largest insurance companies to really empower them, deliver our data science programs from end to end perspective.
And this is the traditional data science process. As we all know that the traditional data science process a hu 85% of it, 85% of is really about the data. It’s a data journey it starts with what we call data collection. And last month, ETR, that means working with your source data that is stored in an individual databases process and transform clean, and then perform a lot of data architectural work, perform a lot of profiling work before you can start the process called feature engineering, which is, you know, as we all know in the community, this is the most challenging the scenario and the most difficult part of data sense, because this is a process that requires you to apply domain knowledge, to transform the source data. That can be data from 20 different databases to, to, to transform those data, to combine, to aggregate those data.
And in the end provide something called a single aggregate table, a single aggregated table, which is a machine learning ready format. And after the feature engineering is finished, the data is machine learning ready. So you can just fetch those off-shelf machine learning models run all the predictions that you can run, tuning parameters, doing the evaluation, and also, uh, uh, visualization, help you understand what you build visualization, give you transparency that you do. And the last part is productization like what we talk about, everything needs to become part of the business process before they deliver business impact. If you do not do that, everything you are doing is only working on your own laptop, and that is not what you want in your organization. If you look at this traditional process, this traditional process, the data part, the part that is doing the ETL jobs, the feature engineering takes months to complete, and that is the bottleneck of the traditional data process.
And what we are doing with that data is very simple. It’s very simple. We have built a, a, a feature engineering engine that takes care of this, uh, entire data pipeline from the data collection, not smell ETR all the way to feature engineer it’s automated, you know, a component that can, does that, that part of dirty work for you. We have, uh, automated machine learning engine that is going, uh, automatically to all the off-shelf models and searching for the right parameters, searching for the right model to give you the best prediction that you can ever get from the data you provide. We also have, have, uh, production engine, which is API production. It’s a simple, uh, uh, TBS API. You can, you know, basically, you know, run that API too, to directly Fe result that is coming up from the model that you have chosen.
And with all this automation we are talking about in just a few days to deliver data science product, remember what we, or in the previous slide, the traditional approach, it takes months, just the data part by itself takes months. But right now, with the dotData automation, we are talking about this, why this is important to ask that, first of all, first of all, you can do this much, much faster. You can fail much faster. You can have much better productivity. Second of all, we have significantly lowered the scale barrier for someone to execute on data science, because the third part, the difficult part is taken care of by thought data platform. So with that, it doesn’t have to be a seasoned, data scientist, or very experienced, you know, a data analyst to do the data. It can be a junior business analyst. It can be your data engineer.
It can be someone who has a good sense about the business who has good knowledge about data, but he just does not know how to cope, or may he may not have enough background about mathematics or statistics, but with the thought data platform, he can do that. That data platform really empowers them to do what data scientists do and think of what data scientists think. And with that type of arrangement, your current data scientist, not that many of them, can focus on the more critical tasks, and then the rest of the team can focus on the general data science tasks. And this type of arrangement in our experience is, is the best combination for a large organization that has many, many data science tasks to work on, but they do not have enough data scientists to execute on all those projects.
So the value, that data value is very obvious. First of all, it’s a time we can move your data science process from the raw data, all the way to delivering business impacts in just a few days, instead of months, the scale we have significantly lower the barrier for data science projects so that everybody on your team can contribute. Everybody can be elevated and impacted because of our operationalization capability, we can not only execute your data science progress. We can operationalize your data science projects by 10 X. And our outcomes are very trans because transparency is core to the data platform. So with that, I want to thank you for your interest today. If you have any questions, feel free to send us an email, and we are very happy to, uh, uh, discuss or thank you.