Inferential Statistics: Data Analysis (2024)

Statistics is one of the essential subject matters in data science, which provides tools and methods to give more in-depth insights into data.

Inferential Statistics: Data Analysis (1)

Data Scientists must have a deeper understanding of Statistics to perform quantitative analysis of the given data. Especially to build Machine Learning Algorithms, statistics play a significant role. Statistics are of mainly two types.

  • Descriptive Statistics
  • Inferential Statistics.

In the previous article “Exploratory Data Analysis,” all the analysis, which we have done, is Descriptive Statistics. With Descriptive Statistics, we are merely describing what is present or shown in the data. We’ve understood how to discover patterns in a given data using various approaches and visualization techniques.

With Inferential Statistics, we try to reach conclusions that extend beyond the data. Sometimes, we have to work on a large amount of data for our analysis, which may take too much time and resources. In those situations, we use Inferential Statistics.

Inferential Statistics:

Inferential Statistics makes inferences and predictions about extensive data by considering a sample data from the original data. It uses probability to reach conclusions.

The process of “inferring” insights from a sample data is called “Inferential Statistics.”

The best real-world example of “Inferential Statistics” is, predicting the amount of rainfall we get in the next month by Weather Forecast.

To understand Inferential Statistics, we have to have basic knowledge about the following fundamental topics in Probability.

  • The Basic Definition of Probability
  • The multiplication rule of Probability
  • The addition rule of Probability
  • nCr (Combination)

We can practice the basics here,

  1. Math is Fun
  2. Mathopolis

Random Variable:

Let’s take a real-world example of Slot-Machines in Casinos. How do Casinos make sure they don’t lose money in the long run, on their slot-machines?

It’s pretty simple; they use Probability.

To understand this, let’s play a game,

  1. Take a bag which contains 3 Red balls and 2 Blue balls.
  2. The game is, we have to pick one ball from the bag, note it’s color and place the ball inside the bag and pick a ball again from the bag, note the color, and put it inside. Repeat this process total of 4 times. This whole process is one set.
  3. Conduct this experiment 75 times, i.e., 75 sets.
  4. The Condition is, if we draw Red ball four times consecutively, i.e., the set contains all the Red Balls, then we would receive $150. If not, we’ll have to pay $10 to the dealer.

Let’s see if we play this game, the dealer wins or loses money in the long run. To understand all this, we’ll approach the problem in 3 steps.

  • Find all the Possible outcomes.
  • Find the Probability of each outcome.
  • Using the Probabilities, estimate the profit/loss.

1. Possible Outcomes

Let’s see, all the possible outcomes we can get if we draw a ball from the bag four times are.

  • RRRR
  • BRRR, RBRR, RRBR, RRRB
  • BBRR, BRBR, BRRB, RBBR, RBRB, RRBB
  • BBBR, BBRB, BRBB, RBBB
  • BBBB

There are a total of 16 outcomes possible.

2. Probability of each outcome

It is advisable to quantify the outcome, to calculate the probability. By using “Random Variable,” we’ll quantify the result.

Random Variable:

  • “X” denotes Random Variable.
  • The definition of “X” depends on our problem statement. Here, we are interested in the number of red balls drawn from the bag.

X = Number of Red Balls drawn.

Now the possible outcomes in terms of ‘X’ is,

  • RRRR — — — X = 4
  • BRRR, RBRR, RRBR, RRRB — — — X = 3
  • BBRR, BRBR, BRRB, RBBR, RBRB, RRBB — — — X = 2
  • BBBR, BBRB, BRBB, RBBB — — — X = 1
  • BBBB — — — X = 0

So, based on X value, we can say that, if X=4, the player wins the game, and for remaining all the values of X, the player loses the game.

After conducting the experiment 75 times, and storing the values in an excel, let’s plot the outcomes in a histogram, we’ll get the graph something like this,

The Values looks like this,

Inferential Statistics: Data Analysis (2)

Now, the histogram for the table, will look like,

Inferential Statistics: Data Analysis (3)

The definition of Probability is

Probability(P) = (Favorable Outcomes)/(Total Number of Outcomes)

Now, let’s find out the Probability for different values of X based on the above graph.

P(X=0) = 2/75 = 0.027
P(X=1) = 12/75 = 0.160
P(X=2) = 26/75 = 0.347
P(X=3) = 25/75 = 0.333
P(X=4) = 10/75 = 0.133

If we represent them in a table, it looks like this

Inferential Statistics: Data Analysis (4)

This table is known as Probability Distribution.

3. Using the Probabilities, estimate the profit/loss

Since we now know the probabilities for X=0 to 4, let’s calculate the total number of red balls drawn by a player in one game.

Number of players with 0 red balls = P(X=0)*75 = 2.025
Number of players with 1 red balls = P(X=1)*75 = 12
Number of players with 2 red balls = P(X=2)*75 = 26.025
Number of players with 3 red balls = P(X=3)*75 = 24.975
Number of players with 4 red balls = P(X=4)*75 = 9.975
Total number of red balls drawn = 0*2.025 + 1*12 + 2*26.025 + 3*24.975 + 4*9.975 = 178.875

So, approximately we saw 178.875 Red Balls drawn by the 75 players in the game.

Average number of red balls = 178.875/75 = 2.385.

In other words, we can expect a player to draw 2.385 Red Balls per game. That average value is known as the Expected Value.

Mathematically speaking, for a random variable X that can take the values x1,x2,x3,………..,xn, the expected value (EV) is given by:

EV(X)=x1∗P(X=x1)+x2∗P(X=x2)+x3∗P(X=x3)+………..+xn∗P(X=xn)

For our game, n=4,

EV = 1*0.16 + 2*0.347 + 3*0.333 + 4*0.133 = 2.385.

If we remember, in our game, players will get $150 if X=4, and for remaining all X values, players have to pay $10.

Now, the probability of a player getting $150 is equal to the player’s probability of drawing four red balls.

That means, for remaining all cases, players have to pay $10.

So, X can take values +150 and -10

P(X=+150) = P(4 red balls) = 0.133
P(X=-10) = P(0,1,2, or 3 red balls) = 0.027+0.160+0.347+0.333 = 0.867

Now, the Expected Value of X(where X is the money won by after playing the game once)

EV(X) = 150*0.133 + (-10)*0.867 = +11.28

This means that the player on average can expect to win +11.28 from this game in the long run, which is very good for the player, but this model will not work for the game organizers as they are losing the money.

If the Casinos wants to make money, they need to ensure the player’s Expected Value should be in Negative. For, this the organizers have to change the prize money, like $100 for the win and $25 if the player loses.

Now, EV(X) = 100*0.133 + (-25)*0.867 = -8.375

Now, this model is profitable for Casinos in the long run.

Till now, we’ve seen how to calculate the probability by experimenting. In the next section, we’ll see how to calculate the probability without experiments.

Calculating Probabilities Theoretically

If we recall the problem statement, the bag contains 3 Red Balls and 2 Blue Balls.

The Probability of getting 1 Red Ball = (Total number of red balls)/(Total number of balls) = 3/5 = 0.6

Similarly,

The Probability of getting 1 blue ball = 2/5 = 0.4

Calculating the probability of drawing red balls in one game, i.e,

For X=0, P(4 Blue) = 0.4*0.4*0.4*0.4 = 0.0256

For X=1, P(1 Red and 3 Blue) = 0.6*0.4*0.4*0.4 , but there are 4 combination for 1Red and 3Blue,
Finally, for X=1, P(X) = 4(0.6*0.4*0.4*0.4) = 0.1536

For X=2, P(X) = 6(0.6*0.6*0.4*0.4) = 0.3456
For X=3, P(X) = 4(0.6*0.6*0.6*0.4) = 0.3456
For X=4, P(X) = 0.6*0.6*0.6*0.6 = 0.1296

Let’s see how the histograms of Experimental and Theoretical Probability Distributions looks like,

Inferential Statistics: Data Analysis (5)

As we can see, the theoretical (calculated) values of probability are relatively close to the experimental values. The small differences exist because of the low number of experiments conducted.

Binomial Probability Distribution: P(X=r)

Now, let’s try to generalize the above problem some more. Let’s say the probability of drawing one red ball from the bag = P. Now, the probability of drawing the one blue ball from the bag = 1-P.

Now, the Probability distribution will be,

For X=0, P(4 Blue) = (1-P)⁴
For X=1, P(1 Red 3 Blue) = 4*P*(1-P)³
For X=2, P(2 Red 2 Blue) = 6*P²*(1-P)²
For X=3, P(3 Red 1 Blue) = 4*P³*(1-P)
For X=4, P(4 Red) = P⁴

If we observe, the above probability values carefully, we can observe some type of formula,

The formula looks like this,

Inferential Statistics: Data Analysis (6)

Here,

n = Total number of trials
p = Probability of success
r = Number of hits after n trials

But, we should use the Binomial Distribution only if it follows these three conditions.

  1. The problem should have a fixed number of trials.
  2. Each trial should have only two outcomes — either a success or a failure.
  3. The probability of success should be the same in all the trials.

This Binomial Probability Distribution is a very commonly observed type of probability distribution among discrete random variables.

Cumulative Probability: F(x)

Cumulative Probability of X is denoted by F(x). It is defined as the probability of a variable is less than or equal to x.

F(x) = P(X≤x)

So, for the theoretical probability distribution we have of our game, if we calculate F(3), it will be,

F(3) = P(X≤3) = P(X=0) + P(X=1) + P(X=2) + P(X=3) = 0.8704

The Cumulative Probability is more helpful in Continuous Probability Distributions.

Continuous Probability Distributions

So far, we have seen probability works in discrete random variables. Now, let’s see how probability functions in continuous random variables.

If a random variable can take infinite values from a data, it is known as Continuous Random Variable.

For example, a random variable measuring the time taken for an employee’s commute to the office is continuous because there is an infinite number of possibilities that can happen.

Since time is a continuous variable, the probability of random variable taken at one precise exact value is 0. So, to calculate the probabilities, instead of taking particular values, we’ll take the values in terms of ranges or intervals.

Let’s say we have to calculate the probability of time taken by the employees to commute to the office every day and we have conducted a survey and has the probability values as shown below,

Inferential Statistics: Data Analysis (7)

With this above Probability values, we can now find the cumulative probability value as follows,

Cumulative Probability for x = 30 is,
F(30) = P(X≤30) = P(0<X<20) + P(20<X<25) + P(25<X<30)
F(30) = 0+0.15+0.20 =0.35

So, the Cumulative Probability for the given probability values looks like,

Inferential Statistics: Data Analysis (8)

If we plot this Cumulative Values in a Chart, it is known as the Cumulative Distribution Function(CDF) chart using the following python code,

Inferential Statistics: Data Analysis (9)

There are two points to remember in CDF Charts.

  1. These are monotonically non-decreasing functions.
  2. The highest value should always be one at the Y-Axis.

We can also plot a chart for Probability in terms of intervals.

This is how a Probability density chart looks like,

Inferential Statistics: Data Analysis (10)

We can observe from the above graph that the area under the probability intervals is equal to the Cumulative Probability of that Interval.

If the value of all Probability Density is equal for all the possible values in a continuous random variable, it is known as Uniform Distribution.

Also, in real-life scenarios, PDFs are most commonly used because it is much easier to see PDFs’ patterns compared to CDFs.

Normal Distributions

One of the most commonly used PDF is Normal Distribution/ Bell Curve/ Gaussian Distribution.

The Normal Distribution graph looks like this,

Inferential Statistics: Data Analysis (11)

As the shape suggests, most of the values generally lie around the center in this distribution. The distribution will also be symmetrical around the middle. Normal Distribution occurs typically in naturally occurring phenomena.

Normal Distribution is also useful in understanding some advanced concepts of Data Analysis, such as the Central Limit Theorem(CLT). We’ll learn about CLT in the next article.

Let’s look into Normal Distribution in detail.

  1. Distribution is symmetrical in the middle, which is known as Mean(μ).
  2. In Normal Distribution, the values of Mean, Median, and Mode are equal. That means the distribution is also symmetrical at Median and Mode.

There is a 1–2–3 Rule of Normal Distribution which follows the following three conditions:

  1. The probability of values between “μ-σ” and “μ+σ” is around 68%.
  2. The probability of values between “μ-2σ” and “μ+2σ” is around 95%.
  3. The probability of values between “μ-3σ” and “μ+3σ” is about 99.7%.

Here “μ” is Mean Value, and “σ” is Standard Deviation.

Inferential Statistics: Data Analysis (12)

What this means is the tails of the distribution are very significantly less. Most of the values lie between “μ-3σ” and “μ+3σ” in a Normal Distribution curve.

Let’s do one example, if μ = 30, and σ = 5, the probability for values between 25 and 45 will be,

P(25<X<45) = P(μ-2σ < X < μ+2σ) = 95% or 0.95

Now, the probability for the same values between 25 and 50 will be,

P(25<X<50) = P(μ-2σ < X < μ+3σ)

We know that from 1–2–3 Rule, the values are evenly distributed at Mean(μ), i.e., 50% of the values are ≤ μ and 50% of the values are > μ.

That means, if P(μ-3σ < X < μ+3σ) = 99.7%, then P(X<μ+3σ) = 49.85% and if P(μ-2σ < X < μ+2σ) = 95%, then P(μ-2σ ≤ X) = 47.5%.

Finally,

P(25<X<50) = P(μ-2σ < X < μ+3σ) = 47.5+49.85 = 97.35%

This is how we calculate, the probability values if the distribution follows Normal Curve.

Standard Normal Distribution

As we have seen, it does not matter what the values of μ and σ are, all we are interested to know is how far X is in terms Standard Deviation(σ) from Mean(μ).

Let’s say, μ = 30, and σ = 5, and X=43.25.

We can say that X is 8.25 units away from μ, i.e.,

In Standard Deviation terms, we can say that X is 1.65σ away from Mean(μ).

This value of 1.65 is called the Z — Score of our Random Variable. The Z score can be calculated by:

Z = (X-μ)/σ

This variable Z is called “Standardized Normal Variable

So, now the 1–2–3 Rule in terms of Z is,

  1. P(-1<Z<1) = P(μ-σ < X < μ+σ) = 68%
  2. P(-2<Z<2) = P(μ-2σ < X < μ+2σ) = 95%
  3. P(-3<Z<3) = P(μ-3σ < X < μ+3σ) = 99.7%

The Standard Normal Distribution(Z) graph looks like this,

Inferential Statistics: Data Analysis (13)

As we can see, the Standardized Normal Variable(Z) is a much more informative variable than the Normal Distribution Variable(X).

We can find all the probability values for Z from this table here.

The sample of Z Score Table looks like this,

Inferential Statistics: Data Analysis (14)

Let’s take an example; we want to find the probability of random variable in a normal distribution within 1.65 standard deviations?

That means, we have to find the value of P(μ-1.65σ < X < μ+1.65σ).

In terms of Z, we need to find the value,

P(-1.65<Z<1.65) = P(Z=1.65) — P(Z=-1.65)

If we look into the table, w can find the values of,

P(Z=1.65) = 0.9505
P(Z=-1.65) = 0.0495

Now, the Probability = 0.901 = 90%

So, about 90% of the values lie between 1.65 standard deviations.

That is how we calculate Z values from the Table and find out the probabilities.

These are the topics in Inferential Statistics, which every Data Scientist should have a basic knowledge of. Using this normal distribution and standard normal distribution concepts, we’ll learn more about Central Limit Theorem and Hypothesis Testing, which are extensively used in Data Science.

Thank you for reading and Happy Coding!!!

Inferential Statistics: Data Analysis (2024)

FAQs

How do you analyze data using inferential statistics? ›

With inferential statistics, you take data from samples and make generalizations about a population.
...
You could use descriptive statistics to describe your sample, including:
  1. Sample mean.
  2. Sample standard deviation.
  3. Making a bar chart or boxplot.
  4. Describing the shape of the sample probability distribution.

Which can be answered by inferential statistics? ›

Inferential statistics can only answer questions of how many, how much, and how often. This limit on the types of questions a researcher can ask comes, because inferential statistics rely on frequencies and probabilities to make inferences.

What questions do inferential statistics attempt to answer? ›

Inferential statistics, unlike descriptive statistics, is the attempt to apply the conclusions that have been obtained from one experimental study to more general populations. This means inferential statistics tries to answer questions about populations and samples that have not been tested in the given experiment.

Why inferential statistical data analysis is so important? ›

Inferential statistics helps to suggest explanations for a situation or phenomenon. It allows you to draw conclusions based on extrapolations, and is in that way fundamentally different from descriptive statistics that merely summarize the data that has actually been measured.

What is inferential data analysis in research? ›

Inferential analysis is used to generalize the results obtained from a random (probability) sample back to the population from which the sample was drawn. This analysis is only required when: a sample is drawn by a random procedure; and. the response rate is very high.

Which analysis comes under inferential analysis? ›

There are two main types of inferential statistics - hypothesis testing and regression analysis. The samples chosen in inferential statistics need to be representative of the entire population.

Top Articles
Latest Posts
Article information

Author: Greg O'Connell

Last Updated:

Views: 6628

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.