Importance of statistics in data science
In this series of articles, we will try to understand the importance and impact of different topics in data science. In this article, we will see the importance of statistics in data science. So, how important is statistics? Well, if you know the basic definition of statistics, you know it is the science of data! Statistics is the mother of data science, so no doubt it is very important. One cannot be a data scientist without strong statistical skills. Let’s elaborate. Statistics is broadly divided into 5 categories. Those are –
- Descriptive – What happened?
- Diagnostics – Why does it happen?
- Predictive – What could happen?
- Prescriptive – What should we do?
- Cognitive – Cause something to happen!
The first two divisions i.e., descriptive and diagnostic analytics are analyses of historical data. This is used to identify current patterns, business movements, business health, business performance in the last few years, etc. A person responsible for descriptive and diagnostic analytics is called a data analyst. This analysis can be done with traditional tools like Excel, Tableau, SQL, PPTs, etc. Skillsets to be a data analyst or business intelligence engineer are comparatively easy to acquire as fewer or no programming skills are sufficient for most of the jobs. This is recommended for people looking to get into IT or to enter the world of data as freshers without much coding skills and waiting. The problem though is limited growth, less salary, and high competition as there is a low entry barrier. Anyone after 2 to 3 months of training can become a data analyst. The recent pattern shows a lot of MBA and non-CS engineers selecting data analytics as their choice of field to enter IT.
When it comes to predicting future business decisions, the first two analytics is of little help. This is where companies seek data scientists as advanced statistical skills are part of their toolbox. Predictive, prescriptive, and cognitive analytics are skills that make a data scientist different from data analysts. Though these 3 analytics look fascinating, a data scientist needs to go through intense training before he can truly master and apply them in real-world use cases. Programming is not an option anymore as everything in data science needs to be implemented from scratch. Concepts like data pipeline design, machine learning, precision training, deep learning, reinforcement learning, and model deployment are used extensively on daily basis. But the hard work pays off as the compensation will be comparatively high and the job is secured due to the high entry barrier.
The importance of statistics in data science cannot be ignored but where to start learning? Which concepts are critical for your data science career? Well, there is nothing called as important in statistics as the use of a particular concept heavily depends on the data and the problem statement, yet there are few concepts that are more commonly used than others.
Don’t get me wrong, all the distributions are important! But if we have to select one distribution, that is used all the time it has to be a normal distribution. Even the name of the distribution says it is normal! Also known as Gaussian distribution, half of the machine learning formulas are derived by assuming the data is normally distributed! A good understanding of normal distribution in multi-dimensional data is crucial for concepts like regression analysis, classification, neural networks, confidence intervals, hypothesis testing, etc. So don’t skip it. This video by UBC is a good savior for understanding the multidimensional Gaussian process.
Out of 10 companies, 5 companies will ask about the Bayes theorem if they want to know you are good in statistics and probabilities. Bayes theorem along with Bayesian networks is so important that there is an entire branch of statistics after this theorem. We will use this theorem to find conditional probabilities as conditional probabilities are ubiquitous in predictive analytics. Machine learning is based on conditional probability. We will use Bayes concepts extensively in NLP. So, the answer is easy, learn it.
A good understanding of confidence intervals is necessary to create statistical modeling. In regression analysis which belongs to inferential statistics, data scientists use confidence intervals for predicting the population mean. Time series analysis and risk management use confidence intervals extensively. A good understanding of Z-statistics and T-statistics will be covered under this topic.
Every data scientist will agree to this point that hypothesis testing is used frequently. A good understanding of hypothesis testing is crucial for data science success. But to understand or conduct hypothesis testing in the right manner one needs to be strong in other statistical concepts. A/B testing is a great example of hypothesis testing. If you want us to write an article on hypothesis testing, tell us in a comment.
Hope this article on importance of statistics in data science has helped you to shortlist a few common statistical concepts that you will be used regularly as a data scientist. In the next article, we will see the importance of Linear Algebra in data science.