5 Ultimate Data Science Principles that can be used in Any Industry Project
By Kamal Jacob
In the previous article, we discussed success stories of companies that utilized data science to shape their products and brought revolutionary improvements to their business. Such stories are testimonies of how an effective data science strategy can drive business value and why hiring data scientists can help increase revenue and profit. Following such success stories, most organizations are now making attempts to be more data-driven.
However, data science must be used wisely to be able to deliver value. In this article, we have covered the top 5 industry-agnostic data science principles that can help drive your data science strategy towards success.
Top 5 Effective Data Science Principles:
1) Begin With the End in Mind
“Data scientists often run into the issue of trying to add artificial intelligence or machine learning capabilities without concrete objectives”.
In 2014 FIFA world-cup, the end-goal of the German team was to defeat the ‘Brazilian’. In 2012, Germany hired about 50 students at the German Sport University Cologne and poured over countless hours of footage of the Brazilian players. These students spent days noting their individual running patterns, their reactions to fouls, and every other quantifiable aspect. Never-before-noticed patterns started to emerge, and these patterns became a key part of how German team players prepared for the game.
Instead of keeping the goal to win the world-cup and spending time studying every team, they were clear that their main target was Brazil and hence could prepare better.
Many ‘new data scientists’ turn to data mining and use it as a fishing expedition to explore possibilities with the available data. Whereas some unsupervised data mining is important to ensure we do not miss any possible patterns, trend or correlation from the data set, it is important to be explicit about what we are trying to achieve in the first place. When we know the end-goal, it goes a long way in ensuring that we choose the correct data sets, recruit the right talent, and choose the right algorithms to process the data.
The lack of clarity about the problem in hand is one of the major challenges that most data scientists have to deal with. Therefore, whenever taking a data science task or project, one must ask:
1) What is the end goal and what are we trying to achieve?
2) Do we have a plan?
3) How to decide if the results achieved are the desired results?
Quantifying end-goals with measurable metrics will make it easier to build the right data visualizations and to find out where the project has reached in terms of the ‘success threshold’.
2) Ensure that Your Organization is ready for AI
With the increasing popularity of data science, more and more companies are now hiring data scientists. However, more often than not, these companies are not ready for data science implementation. Most of these companies lack the basic infrastructure needed to implement the data science algorithms and operations.
Companies need to be aware that machine learning comes at a very later stage in the entire data stack. Before doing anything with the data, it has to be first reliably collected, transformed, stored, secured, and then explored. When a data stack follows a correct and correlated upstream process, then only the data can be easily explored and subsequently used for analytics, AI, and Machine Learning.
Image source - https://www.aberdeen.com/cmo-essentials/marketers-ready-artificial-intelligence/
This means that if your company is not ready to adopt machine learning yet, then it should focus on building the basic infrastructure first. Hiring an experienced data scientist without proper tracking and database system in place can prove to be disastrous, both for the company as well as the employee.
3) Data is Not Even, and that Makes it Special
Most people assume that good data is perfectly clean and evenly distributed. However, in reality, the real data is neither clean nor evenly distributed. And that is, in fact, the most awesome thing about the data. Many times, these asymmetries, anomalies and other ‘warts’ give us interesting facts about the domain we are studying.
For example, while performing segmentation, outliers are often dismissed, ignored or clipped from the data assuming it to be some noise or anomalies in data. But, what if a totally new type of behaviour is represented by those outliers?
The ability to discover the unexpected features of the data is what makes data science novel and interesting. If you want to make the most of the data-in-hand, spend some time with those long tail, the outliers, the QQ tail, etc. You might make a ‘Surprise Discovery’. In reality, this very characteristic of data enables us to make unusual and unexpected observations in the domain of our study. Apply some non-parametric statistics test to open a whole new world of data-driven discovery.
Following the chosen path and simply recommending the obvious is not going to make a difference. Explore the unexplored and you will fall in love with your data, because of its diversity.
4) Work on Projects that Add Value to the Business
It is very important to ask the right questions when choosing a machine learning or data science project to work on regardless of the industry. It is essential mainly because of two factors:
1. The extended timeline of machine learning projects which implies that the cost of undertaking a wrong project can be huge (and may exceed benefits)
2. Relying on data points which may not be timely available and thus, results are not guaranteed.
Calculating the opportunity size and its potential impact to the business will help you determine if it is worth undertaking the project. Always, without fail, embark on only the projects whose output directly impacts business levers. In other words, the results of the projects should always be directly actionable.
Generic formula to calculate the size of business opportunity:
“Number of customers affected * target size of effect = Estimated project impact”.
For example, a real estate company may ask its data science team to make a rough estimate of the people who are looking to buy flats in a certain area. Merely having the number of potential buyers does not impact any business levers as you do not know their preferences. You are not aware of their choices or budget. Just identifying the potential customer is not enough. Finding their specific needs is what will make all the difference.
5) Adopt an Iterative Approach
Winners of most machine learning competitions follow an iterative approach. This means you should start with a simple working model and then, iterate. The iterative approach in data science focuses on reaching the ‘first working model’ quickly rather than starting with a model with tons of variable and features. Once the first basic model is built, features are added as the focus shifts on continual improvement.
Iterative Nature of Data Science Model
Image source- https://blog.datarobot.com/measure-once-cut-twice-moving-towards-iteration-in-data-science
In order to take advantage of the empirical nature of machine learning, we need to reduce the cost per attempt. This means, having a higher number of trials (say, N) and allocating 1/Nth time to each trial to minimize the probability of missing anything and thus, maximizing the profit. The baseline models need not be tested on the full data before implementation. For example, you can perform A/B testing on a given model for customers from single geography at one time and can be repeated for a few geographies before testing it for global customers.
Data scientists use model error analysis to find weak areas of the model and often take feedback of domain experts on areas that need improvement. Typically, an iterative approach is much shorter and ends when the model has improved enough to meet the business requirements.
The above-mentioned data science principles are simple to understand and easy to follow. Knowing the ‘end result’ will help you quantify your success. Ensuring that your company is ready to adopt AI and is following an AI hierarchy is essential for scalable machine learning.
Similarly, you can break boundaries and make surprising discoveries by delving deep into the uneven, untraded data. Also, asking the right questions will ensure that the projects deliver value and ‘worthy value’ at that.
And finally, following an iterative approach helps to reduce cost per iteration and minimizes the probability of having inconsequential results.
To understand these principles and devise some of your own, it is important to have a 360-degree knowledge of data science. And, the best approach is to pursue a certificate program/course that covers the A-Z of Data Science. For example, Manipal ProLearn’s Data Science course covers all these beneficial resources with its in-depth curriculum and practical learning methodology and helps you build a solid portfolio required for a career in Data Science.
Thanks for reading. I hope you found the article insightful and useful. Feel free to share your thoughts in the comments section! Also, if you seek to upskill your Data Science skills, feel free to check out our Data Science Courses here.