Understanding the Different Types of Probability Distribution Curves in Statistics
In this article we will try to understand different types of probability distribution curves in statistics written by our Student Revathi. If you are an aspiring data scientist you might come across statistics again and again. Read our article on importance of statistics and machine learning here.
1. Common Types of Data:
Explaining various distributions becomes more manageable with the type of data used is familiar. Usually, two different outcomes are encountered in day-to-day experiments: finite and infinite outcomes. While rolling a die or picking a card from a deck, there are limited number of outcomes possible. This type of data is called Discrete Data, which can only take a specified number of values. For example, in rolling a die, the specified values are 1, 2, 3, 4, 5, and 6. Similarly, examples of infinite outcomes from discrete events in our daily environment are Recording time or measuring a person’s height has infinitely many values within a given interval. This type of data is called Continuous Data, which can have any value within a given range. That range can be finite or infinite. For example, measuring a watermelon’s weight. It can be any value from 10.2 kg, 10.24 kg, or 10.243 kg. Making it measurable but not countable, hence, continuous. On the other hand, suppose counting the number of boys in a class; since the value is countable, it is discreet.
2. Types of statistical distributions:
Depending on the type of data used, it can be grouped into two categories of distributions, discrete distributions for discrete data (finite outcomes) and continuous distributions for continuous data (infinite outcomes). A comparison table showing the difference between Discrete and Continuous Distribution:
Discrete Distribution | Continuous Distribution |
Finite number of different possible outcomes. | Infinite many consecutive possible values. |
Add up individual values to find the probability of an interval. | Cannot add up individual values to find the probability of an interval because there are many. |
Graphs of discrete distribution consists of bars lined up one after the other. | Graphs of continuous distribution consists of smooth curve. |
Chart below summarizes different Statistical Distribution used for Discrete and Continuous Data:
2.1 Discrete distributions:
Finite number of different possible outcomes.
2.1.1 Discrete uniform distribution:
All outcomes are equally likely. In statistics, uniform distribution refers to a statistical distribution in which all outcomes are equally likely.
Consider rolling a six-sided die. There is equal probability of obtaining all six numbers on the next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equating a probability of 1/6, hence an example of a discrete uniform distribution. As a result, the uniform distribution graph contains bars of equal height representing each outcome. In the example, the height is a probability of 1/6 (0.166667).
Below is the code for uniform distribution where the numpy.random() module simulates rolling a six-sided dice n_rolls times. Then the randint() function to generate random integers between 1 and 6 (inclusive) for each roll. Later plot the histogram of the dice roll outcomes using plt.hist(), with 6 bins representing each possible outcome and a range of (0.5, 6.5) to center the bars over each integer value. The align and rwidth parameters are set to mid and 0.8, respectively, to adjust the bar placement and width for better visualization. Finally, set the axis labels and title using plt.xlabel(), plt.ylabel(), and plt.title(), and show the plot using plt.show().
2.1.2 Bernoulli Distribution:
Single-trial with two possible outcomes
The Bernoulli distribution is one of the easiest distributions to understand. It can be used as a starting point to derive more complex distributions. Any event with a single trial and only two outcomes follow a Bernoulli distribution. Flipping a coin or choosing between True and False in a quiz are examples of a Bernoulli distribution. They have a single trial and only two outcomes.
Let’s assume flipping a coin once; this is a single trail. The only two outcomes are either heads or tails. This is an example of a Bernoulli distribution.
Usually, when following a Bernoulli distribution, the probability of one of the outcomes (p). From (p), can deduce the probability of the other outcome by subtracting it from the total probability (1), represented as (1-p). It is represented by bern(p), where p is the probability of success. The expected value of a Bernoulli trial ‘x’ is represented as, E(x) = p, and similarly Bernoulli variance is, Var(x) = p(1-p).
Below code is for Bernoulli distribution the numpy.random() module to simulate flipping a loaded coin n_flips times. The binomial function with n=1 to generate binomial distributions with a single trial (i.e., each flip is treated as an independent Bernoulli trial) and probability of heads p_heads. Then plot the histogram of the coin flip outcomes using plt.hist(), with 2 bins representing heads (1) and tails (0) and a range of (-0.5, 1.5) to center the bars over each integer value. The align and rwidth parameters are set to mid and 0.8, respectively, to adjust the bar placement and width for better visualization. Finally, set the axis labels and title using plt.xlabel(), plt.ylabel(), and plt.title(), and show the plot using plt.show().
It consists of only two bars, one rising to the associated probability p and the other growing to 1-p.
- 2.1.3 Binomial Distribution:
A sequence of Bernoulli events
The Binomial Distribution can be thought of as the sum of outcomes of an event following a Bernoulli distribution. Therefore, Binomial Distribution is used in binary outcome events, and the probability of success and failure is the same in all successive trials. An example of a binomial event would be flipping a coin multiple times to count the number of heads and tails.
Binomial vs Bernoulli distribution.
The difference between these distributions can be explained through an example. Consider you’re attempting a quiz that contains 10 True/False questions. Trying a single T/F question would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions would be categorized as a Binomial trial. The main characteristics of Binomial Distribution are:
- Given multiple trials, each of them is independent of the other. That is, the outcome of one trial doesn’t affect another one.
- Each trial can lead to just two possible results (winning or losing), with probabilities p and (1 – p).
- A binomial distribution is represented by B(n, p), where n is the number of trials and p is the probability of success in a single trial. A Bernoulli distribution can be shaped as a binomial trial as B (1, p) since it has only one trial. The expected value of a binomial trial “x” is the number of times a success occurs, represented as E(x) = np. Similarly, variance is represented as Var(x) = np(1-p).
Let’s consider the probability of success (p) and the number of trials (n). Calculate the likelihood of success (x) for these n trials using the formula below:
For example, suppose that a candy company produces both milk chocolate and dark chocolate candy bars. The total products contain half milk chocolate bars and half dark chocolate bars. Say customer choose ten candy bars at random and choosing milk chocolate is defined as a success. The probability distribution of the number of successes during these ten trials with p = 0.5 is shown here in the binomial distribution graph.
Below is the code for binomial distribution the numpy.random() module to simulate a candy company that produces both milk chocolate and dark chocolate candy bars n_bars times. The binomial function with n=1 to generate binomial distributions with a single trial (i.e., each candy bar production is treated as an independent Bernoulli trial) and probability of producing a milk chocolate candy bar p_milk. Then calculate the number of milk chocolate bars produced as the sum of the milk chocolate outcomes (1), and the number of dark chocolate bars produced as the difference between the total number of bars produced and the number of milk chocolate bars. Plot the histogram of the candy bar production outcomes using plt.hist(), with 2 bins representing milk chocolate (1) and dark chocolate (0) and a range of (-0.5, 1.5) to center the bars over each integer value. The align and rwidth parameters are set to mid and 0.8, respectively, to adjust the bar placement and width for better visualization. Add a legend to the plot to distinguish between the two candy bar types. Finally, set the axis labels and title using plt.xlabel(), plt.ylabel(), and plt.title(), and show the plot using plt.show().
- Poisson Distribution:
The probability that an event may or may not occur.
Poisson distribution deals with the frequency with which an event occurs within a specific interval. Instead of the probability of an event, Poisson distribution requires knowing how often it happens in a particular period or distance.
For example, a cricket chirps two times in 7 seconds on average. Use the Poisson distribution to determine the likelihood of it chirping five times in 15 seconds.
A Poisson process is represented with the notation Po(λ), where λ represents the expected number of events that can take place in a period. The expected value and variance of a Poisson process is λ. X represents the discrete random variable. A Poisson Distribution can be modelled using the following formula.
The main characteristics which describe the Poisson Processes are:
- The events are independent of each other.
- An event can occur any number of times (within the defined period).
- Two events can’t take place simultaneously.
Below is the code for Poisson Distribution the scipy.stats module to generate a Poisson distribution object with the given rate parameter rate and calculate the probability mass function (PMF) for a range of possible outcomes x. Plot the PMF using plt.plot() and plt.vlines() to display the probability of each outcome as a blue dot and a blue vertical line, respectively. Later add vertical lines at the expected number of chirps and the observed number of chirps using plt.axvline() and label them with red dashed and green dashed lines, respectively, using the color, linestyle, and label parameters. Finally, set the axis labels and title using plt.xlabel(), plt.ylabel(), and plt.title(), and add a legend using plt.legend(), and show the plot using plt.show().
2.2 Continuous distributions:
Infinite many consecutive possible values.
2.2.1 Normal Distribution:
Symmetric distribution of values around the mean
Normal distribution is the most used distribution in data science. The normal distribution curve is the most commonly used probability distribution curve. The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that is often used to describe real-world phenomena such as height, weight, and IQ scores. The normal distribution has a bell-shaped curve, with the majority of data points clustered around the mean value.
In a normal distribution graph, data is symmetrically distributed with no skew. When plotted, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the centre.
The normal distribution frequently appears in nature and life in various forms. For example, the scores of a quiz follow a normal distribution. Many of the students scored between 60 and 80 as illustrated in the graph below. Of course, students with scores that fall outside this range are deviating from the centre.
Below is the code for Normal Distribution, first set the mean and standard deviation of the normal distribution to mu = 70 and sigma = 5, respectively. Then set the minimum and maximum scores of the quiz to min_score = 60 and max_score = 80. Next, create an array of possible scores using np.linspace() and calculated the probability density function (PDF) for each score using norm.pdf(). Then plotted the normal distribution using plt.plot() and filled the area under the curve for scores between 60 and 80. Finally, the title and axis labels using plt.title(), plt.xlabel(), and plt.ylabel(), and displayed the plot using plt.show().
you might be interested to know more about advance concept like gaussian process which use gaussian distribution as fundamentals. Refer this link to know more about gaussian process.
2.2.2 Student t-Test Distribution:
Small sample size approximation of a normal distribution
The student’s t-distribution, also known as the t distribution, is a type of statistical distribution similar to the normal distribution with its bell shape but has heavier tails. The t distribution is used instead of the normal distribution when you have small sample sizes.
For example, suppose dealing with the total apples sold by a shopkeeper in a month. In that case, use the normal distribution. Whereas, if dealing with the total amount of apples sold in a day, i.e., a smaller sample, use the t distribution.
Another critical difference between the students’ t distribution and the Normal one is that apart from the mean and variance, defining the degrees of freedom for the distribution. In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. A Student’s t distribution is represented as t(k), where k represents the number of degrees of freedom. For k=2, i.e., 2 degrees of freedom, the expected value is the same as the mean.
Overall, the student t distribution is frequently used when conducting statistical analysis and plays a significant role in performing hypothesis testing with limited data.
Below code is for student’s t distribution first set the mean and standard deviation of the distribution to mu = 200 and sigma = 20, respectively. Later generate a sample of apple sales data using np.random.normal() and calculated the sample mean and standard deviation using np.mean() and np.std() with the ddof parameter set to 1 to account for the fact of estimating the standard deviation from a sample. Next calculate the degrees of freedom as df = sample_size – 1. Next, set the range of possible t-values using np.linspace() and calculated the probability density function (PDF) for each t-value using t.pdf(). Plot the Student’s t distribution using plt.plot() and shaded the area under the curve for t-values outside of the 95% confidence interval using plt.fill_between() and the alpha parameter. Finally, the title and axis labels using plt.title(), plt.xlabel(), and plt.ylabel(), and displayed the plot using plt.show().
2.2.3 Exponential distribution:
Model elapsed time between two events
Exponential distribution is one of the widely used continuous distributions. It is used to model the time taken between different events. For example, in physics, it is often used to measure radioactive decay; in engineering, to measure the time associated with receiving a defective part on an assembly line; and in finance, to measure the likelihood of the next default for a portfolio of financial assets. Another common application of Exponential distributions in survival analysis (e.g., expected life of a device/machine).
The exponential distribution is commonly represented as Exp(λ), where λ is the distribution parameter, often called the rate parameter. Find the value of λ by the formula = 1/μ, where μ is the mean. Here standard deviation is the same as the mean. Var (x) gives the variance = 1/λ2.
An exponential graph is a curved line representing how the probability changes exponentially. Exponential distributions are commonly used in calculations of product reliability or the length of time a product lasts.
Below is the code for Exponential Distribution Population Growth: The growth of populations can often be modelled by an exponential function. In the absence of limiting factors, populations tend to grow at an exponential rate. The equation for population growth can be written as: P(t) = P0 * e^(rt), where P(t) is the population size at time t, P0 is the initial population size, r is the growth rate, and e is the mathematical constant.
Conclusion:
The probability distributions are a common way to describe, and possibly predict, the probability of an event. The main point is to define the character of the variables whose behaviour is described, through probability (discrete or continuous). The identification of the right category will allow a proper application of a model (for instance, the standardized normal distribution) that would easily predict the probability of a given event.
Data is an essential component of the data exploration and model development process. The first thing that springs to mind when working with continuous variables is looking at the data distribution. Adjusting the Machine Learning models to best match the problem, can identify the pattern in the data distribution, which reduces the time to get to an accurate outcome.
Indeed, specific Machine Learning models are built to perform best when certain distribution assumptions are met. Knowing which distributions, are dealing with may thus assist in determining which models to apply.
You can see the difference between discrete and probability distribution here.
Do not forget to check our syllabus of Advanced data science courses in Rajajinagar.
0 comments