Statistics is one of the essential subject matters in data science, which provides tools and methods to give more in-depth insights into data.
Data Scientists must have a deeper understanding of Statistics to perform quantitative analysis of the given data. Especially to build Machine Learning Algorithms, statistics play a significant role. Statistics are of mainly two types.
- Descriptive Statistics
- Inferential Statistics.
In the previous article “Exploratory Data Analysis,” all the analysis, which we have done, is Descriptive Statistics. With Descriptive Statistics, we are merely describing what is present or shown in the data. We’ve understood how to discover patterns in a given data using various approaches and visualization techniques.
With Inferential Statistics, we try to reach conclusions that extend beyond the data. Sometimes, we have to work on a large amount of data for our analysis, which may take too much time and resources. In those situations, we use Inferential Statistics.
Inferential Statistics:
Inferential Statistics makes inferences and predictions about extensive data by considering a sample data from the original data. It uses probability to reach conclusions.
The process of “inferring” insights from a sample data is called “Inferential Statistics.”
The best real-world example of “Inferential Statistics” is, predicting the amount of rainfall we get in the next month by Weather Forecast.
To understand Inferential Statistics, we have to have basic knowledge about the following fundamental topics in Probability.
- The Basic Definition of Probability
- The multiplication rule of Probability
- The addition rule of Probability
- nCr (Combination)
We can practice the basics here,
Random Variable:
Let’s take a real-world example of Slot-Machines in Casinos. How do Casinos make sure they don’t lose money in the long run, on their slot-machines?
It’s pretty simple; they use Probability.
To understand this, let’s play a game,
- Take a bag which contains 3 Red balls and 2 Blue balls.
- The game is, we have to pick one ball from the bag, note it’s color and place the ball inside the bag and pick a ball again from the bag, note the color, and put it inside. Repeat this process total of 4 times. This whole process is one set.
- Conduct this experiment 75 times, i.e., 75 sets.
- The Condition is, if we draw Red ball four times consecutively, i.e., the set contains all the Red Balls, then we would receive $150. If not, we’ll have to pay $10 to the dealer.
Let’s see if we play this game, the dealer wins or loses money in the long run. To understand all this, we’ll approach the problem in 3 steps.
- Find all the Possible outcomes.
- Find the Probability of each outcome.
- Using the Probabilities, estimate the profit/loss.
1. Possible Outcomes
Let’s see, all the possible outcomes we can get if we draw a ball from the bag four times are.
- RRRR
- BRRR, RBRR, RRBR, RRRB
- BBRR, BRBR, BRRB, RBBR, RBRB, RRBB
- BBBR, BBRB, BRBB, RBBB
- BBBB
There are a total of 16 outcomes possible.
2. Probability of each outcome
It is advisable to quantify the outcome, to calculate the probability. By using “Random Variable,” we’ll quantify the result.
Random Variable:
- “X” denotes Random Variable.
- The definition of “X” depends on our problem statement. Here, we are interested in the number of red balls drawn from the bag.
X = Number of Red Balls drawn.
Now the possible outcomes in terms of ‘X’ is,
- RRRR — — — X = 4
- BRRR, RBRR, RRBR, RRRB — — — X = 3
- BBRR, BRBR, BRRB, RBBR, RBRB, RRBB — — — X = 2
- BBBR, BBRB, BRBB, RBBB — — — X = 1
- BBBB — — — X = 0
So, based on X value, we can say that, if X=4, the player wins the game, and for remaining all the values of X, the player loses the game.
After conducting the experiment 75 times, and storing the values in an excel, let’s plot the outcomes in a histogram, we’ll get the graph something like this,
The Values looks like this,
Now, the histogram for the table, will look like,
The definition of Probability is
Probability(P) = (Favorable Outcomes)/(Total Number of Outcomes)
Now, let’s find out the Probability for different values of X based on the above graph.
P(X=0) = 2/75 = 0.027
P(X=1) = 12/75 = 0.160
P(X=2) = 26/75 = 0.347
P(X=3) = 25/75 = 0.333
P(X=4) = 10/75 = 0.133
If we represent them in a table, it looks like this
This table is known as Probability Distribution.
3. Using the Probabilities, estimate the profit/loss
Since we now know the probabilities for X=0 to 4, let’s calculate the total number of red balls drawn by a player in one game.
Number of players with 0 red balls = P(X=0)*75 = 2.025
Number of players with 1 red balls = P(X=1)*75 = 12
Number of players with 2 red balls = P(X=2)*75 = 26.025
Number of players with 3 red balls = P(X=3)*75 = 24.975
Number of players with 4 red balls = P(X=4)*75 = 9.975
Total number of red balls drawn = 0*2.025 + 1*12 + 2*26.025 + 3*24.975 + 4*9.975 = 178.875
So, approximately we saw 178.875 Red Balls drawn by the 75 players in the game.
Average number of red balls = 178.875/75 = 2.385.
In other words, we can expect a player to draw 2.385 Red Balls per game. That average value is known as the Expected Value.
Mathematically speaking, for a random variable X that can take the values x1,x2,x3,………..,xn, the expected value (EV) is given by:
EV(X)=x1∗P(X=x1)+x2∗P(X=x2)+x3∗P(X=x3)+………..+xn∗P(X=xn)
For our game, n=4,
EV = 1*0.16 + 2*0.347 + 3*0.333 + 4*0.133 = 2.385.
If we remember, in our game, players will get $150 if X=4, and for remaining all X values, players have to pay $10.
Now, the probability of a player getting $150 is equal to the player’s probability of drawing four red balls.
That means, for remaining all cases, players have to pay $10.
So, X can take values +150 and -10
P(X=+150) = P(4 red balls) = 0.133
P(X=-10) = P(0,1,2, or 3 red balls) = 0.027+0.160+0.347+0.333 = 0.867
Now, the Expected Value of X(where X is the money won by after playing the game once)
EV(X) = 150*0.133 + (-10)*0.867 = +11.28
This means that the player on average can expect to win +11.28 from this game in the long run, which is very good for the player, but this model will not work for the game organizers as they are losing the money.
If the Casinos wants to make money, they need to ensure the player’s Expected Value should be in Negative. For, this the organizers have to change the prize money, like $100 for the win and $25 if the player loses.
Now, EV(X) = 100*0.133 + (-25)*0.867 = -8.375
Now, this model is profitable for Casinos in the long run.
Till now, we’ve seen how to calculate the probability by experimenting. In the next section, we’ll see how to calculate the probability without experiments.
Calculating Probabilities Theoretically
If we recall the problem statement, the bag contains 3 Red Balls and 2 Blue Balls.
The Probability of getting 1 Red Ball = (Total number of red balls)/(Total number of balls) = 3/5 = 0.6
Similarly,
The Probability of getting 1 blue ball = 2/5 = 0.4
Calculating the probability of drawing red balls in one game, i.e,
For X=0, P(4 Blue) = 0.4*0.4*0.4*0.4 = 0.0256
For X=1, P(1 Red and 3 Blue) = 0.6*0.4*0.4*0.4 , but there are 4 combination for 1Red and 3Blue,
Finally, for X=1, P(X) = 4(0.6*0.4*0.4*0.4) = 0.1536
For X=2, P(X) = 6(0.6*0.6*0.4*0.4) = 0.3456
For X=3, P(X) = 4(0.6*0.6*0.6*0.4) = 0.3456
For X=4, P(X) = 0.6*0.6*0.6*0.6 = 0.1296
Let’s see how the histograms of Experimental and Theoretical Probability Distributions looks like,
As we can see, the theoretical (calculated) values of probability are relatively close to the experimental values. The small differences exist because of the low number of experiments conducted.
Binomial Probability Distribution: P(X=r)
Now, let’s try to generalize the above problem some more. Let’s say the probability of drawing one red ball from the bag = P. Now, the probability of drawing the one blue ball from the bag = 1-P.
Now, the Probability distribution will be,
For X=0, P(4 Blue) = (1-P)⁴
For X=1, P(1 Red 3 Blue) = 4*P*(1-P)³
For X=2, P(2 Red 2 Blue) = 6*P²*(1-P)²
For X=3, P(3 Red 1 Blue) = 4*P³*(1-P)
For X=4, P(4 Red) = P⁴
If we observe, the above probability values carefully, we can observe some type of formula,
The formula looks like this,
Here,
n = Total number of trials
p = Probability of success
r = Number of hits after n trials
But, we should use the Binomial Distribution only if it follows these three conditions.
- The problem should have a fixed number of trials.
- Each trial should have only two outcomes — either a success or a failure.
- The probability of success should be the same in all the trials.
This Binomial Probability Distribution is a very commonly observed type of probability distribution among discrete random variables.
Cumulative Probability: F(x)
Cumulative Probability of X is denoted by F(x). It is defined as the probability of a variable is less than or equal to x.
F(x) = P(X≤x)
So, for the theoretical probability distribution we have of our game, if we calculate F(3), it will be,
F(3) = P(X≤3) = P(X=0) + P(X=1) + P(X=2) + P(X=3) = 0.8704
The Cumulative Probability is more helpful in Continuous Probability Distributions.
Continuous Probability Distributions
So far, we have seen probability works in discrete random variables. Now, let’s see how probability functions in continuous random variables.
If a random variable can take infinite values from a data, it is known as Continuous Random Variable.
For example, a random variable measuring the time taken for an employee’s commute to the office is continuous because there is an infinite number of possibilities that can happen.
Since time is a continuous variable, the probability of random variable taken at one precise exact value is 0. So, to calculate the probabilities, instead of taking particular values, we’ll take the values in terms of ranges or intervals.
Let’s say we have to calculate the probability of time taken by the employees to commute to the office every day and we have conducted a survey and has the probability values as shown below,
With this above Probability values, we can now find the cumulative probability value as follows,
Cumulative Probability for x = 30 is,
F(30) = P(X≤30) = P(0<X<20) + P(20<X<25) + P(25<X<30)
F(30) = 0+0.15+0.20 =0.35
So, the Cumulative Probability for the given probability values looks like,
If we plot this Cumulative Values in a Chart, it is known as the Cumulative Distribution Function(CDF) chart using the following python code,
There are two points to remember in CDF Charts.
- These are monotonically non-decreasing functions.
- The highest value should always be one at the Y-Axis.
We can also plot a chart for Probability in terms of intervals.
This is how a Probability density chart looks like,
We can observe from the above graph that the area under the probability intervals is equal to the Cumulative Probability of that Interval.
If the value of all Probability Density is equal for all the possible values in a continuous random variable, it is known as Uniform Distribution.
Also, in real-life scenarios, PDFs are most commonly used because it is much easier to see PDFs’ patterns compared to CDFs.
Normal Distributions
One of the most commonly used PDF is Normal Distribution/ Bell Curve/ Gaussian Distribution.
The Normal Distribution graph looks like this,
As the shape suggests, most of the values generally lie around the center in this distribution. The distribution will also be symmetrical around the middle. Normal Distribution occurs typically in naturally occurring phenomena.
Normal Distribution is also useful in understanding some advanced concepts of Data Analysis, such as the Central Limit Theorem(CLT). We’ll learn about CLT in the next article.
Let’s look into Normal Distribution in detail.
- Distribution is symmetrical in the middle, which is known as Mean(μ).
- In Normal Distribution, the values of Mean, Median, and Mode are equal. That means the distribution is also symmetrical at Median and Mode.
There is a 1–2–3 Rule of Normal Distribution which follows the following three conditions:
- The probability of values between “μ-σ” and “μ+σ” is around 68%.
- The probability of values between “μ-2σ” and “μ+2σ” is around 95%.
- The probability of values between “μ-3σ” and “μ+3σ” is about 99.7%.
Here “μ” is Mean Value, and “σ” is Standard Deviation.
What this means is the tails of the distribution are very significantly less. Most of the values lie between “μ-3σ” and “μ+3σ” in a Normal Distribution curve.
Let’s do one example, if μ = 30, and σ = 5, the probability for values between 25 and 45 will be,
P(25<X<45) = P(μ-2σ < X < μ+2σ) = 95% or 0.95
Now, the probability for the same values between 25 and 50 will be,
P(25<X<50) = P(μ-2σ < X < μ+3σ)
We know that from 1–2–3 Rule, the values are evenly distributed at Mean(μ), i.e., 50% of the values are ≤ μ and 50% of the values are > μ.
That means, if P(μ-3σ < X < μ+3σ) = 99.7%, then P(X<μ+3σ) = 49.85% and if P(μ-2σ < X < μ+2σ) = 95%, then P(μ-2σ ≤ X) = 47.5%.
Finally,
P(25<X<50) = P(μ-2σ < X < μ+3σ) = 47.5+49.85 = 97.35%
This is how we calculate, the probability values if the distribution follows Normal Curve.
Standard Normal Distribution
As we have seen, it does not matter what the values of μ and σ are, all we are interested to know is how far X is in terms Standard Deviation(σ) from Mean(μ).
Let’s say, μ = 30, and σ = 5, and X=43.25.
We can say that X is 8.25 units away from μ, i.e.,
In Standard Deviation terms, we can say that X is 1.65σ away from Mean(μ).
This value of 1.65 is called the Z — Score of our Random Variable. The Z score can be calculated by:
Z = (X-μ)/σ
This variable Z is called “Standardized Normal Variable”
So, now the 1–2–3 Rule in terms of Z is,
- P(-1<Z<1) = P(μ-σ < X < μ+σ) = 68%
- P(-2<Z<2) = P(μ-2σ < X < μ+2σ) = 95%
- P(-3<Z<3) = P(μ-3σ < X < μ+3σ) = 99.7%
The Standard Normal Distribution(Z) graph looks like this,
As we can see, the Standardized Normal Variable(Z) is a much more informative variable than the Normal Distribution Variable(X).
We can find all the probability values for Z from this table here.
The sample of Z Score Table looks like this,
Let’s take an example; we want to find the probability of random variable in a normal distribution within 1.65 standard deviations?
That means, we have to find the value of P(μ-1.65σ < X < μ+1.65σ).
In terms of Z, we need to find the value,
P(-1.65<Z<1.65) = P(Z=1.65) — P(Z=-1.65)
If we look into the table, w can find the values of,
P(Z=1.65) = 0.9505
P(Z=-1.65) = 0.0495
Now, the Probability = 0.901 = 90%
So, about 90% of the values lie between 1.65 standard deviations.
That is how we calculate Z values from the Table and find out the probabilities.
These are the topics in Inferential Statistics, which every Data Scientist should have a basic knowledge of. Using this normal distribution and standard normal distribution concepts, we’ll learn more about Central Limit Theorem and Hypothesis Testing, which are extensively used in Data Science.
Thank you for reading and Happy Coding!!!
- Exploratory Data Analysis(EDA): Python
- Indexing in Pandas Dataframe using Python
- Seaborn: Python
- Pandas: Python
- Matplotlib: Python
- NumPy: Python
- Data Visualization and its Importance: Python
- Time Complexity and Its Importance in Python
- Inferential Statistics — An Overview: https://www.mygreatlearning.com/blog/inferential-statistics-an-overview/
- Role Of Statistics in Data Science: https://www.topcoder.com/role-of-statistics-in-data-science/
- Inferential Statistics: http://onlinestatbook.com/2/introduction/inferential.html
- What is Inferential Statistics?: https://www.statisticshowto.com/inferential-statistics/
- Descriptive Statistics: https://conjointly.com/kb/descriptive-statistics/
- Z Score Table: http://www.z-table.com/