Season 03 - Introduction to Statistics
This is all great for numerical data, how can we compare when there is categorical data?
Our Earlier Hypothesis questions was like: “Is the average height different between groups?”
These are continuous or discrete numbers.
These are frequencies and proportions.
You can’t take the mean of “Red, Blue, Green.” But you can count them.
Test 1 — Goodness of Fit
“Does the distribution of my data match an expected pattern?”
Example: Is a die fair? Do customers prefer all flavours equally?
You have one categorical variable. You compare observed counts to expected counts.
Test 2 — Test of Independence
“Are two categorical variables related to each other?”
Example: Is gender related to product preference? Is smoking related to disease status?
You have two categorical variables. You test whether knowing one tells you anything about the other.
The chi-square distribution is right-skewed and always positive. As df increases, it shifts right and becomes more symmetric. Our test statistic must exceed the critical value to reject H₀.
If the null hypothesis were true —
what counts would we expect to see?
Then we compare those expected counts to what we actually observed.
If the gap is small → data is consistent with H₀
If the gap is large → something is going on
\[\chi^2 = \sum \frac{(O - E)^2}{E}\]
Each cell contributes to the total.
Large \(\chi^2\) → observed and expected are far apart → evidence against H₀
Small \(\chi^2\) → data fits the expected pattern → no evidence against H₀
Setup: You roll a die 60 times. If the die is fair, you expect each face 10 times.
| Face | Observed (O) | Expected (E) | (O−E)²/E |
|---|---|---|---|
| 1 | 8 | 10 | 0.40 |
| 2 | 12 | 10 | 0.40 |
| 3 | 7 | 10 | 0.90 |
| 4 | 14 | 10 | 1.60 |
| 5 | 11 | 10 | 0.10 |
| 6 | 8 | 10 | 0.40 |
\[\chi^2 = 0.40+0.40+0.90+1.60+0.10+0.40 = 3.80\]
df = k − 1 = 5 | Critical value at α=0.05 → 11.07
3.80 < 11.07 → Fail to Reject H₀ ✅
No evidence the die is unfair.
The Question: Is there a relationship between two categorical variables?
Setup: A survey of 200 customers asks: “Do you prefer Product A, B, or C?” — split by gender.
| Product A | Product B | Product C | Row Total | |
|---|---|---|---|---|
| Male | 40 | 35 | 25 | 100 |
| Female | 30 | 45 | 25 | 100 |
| Col Total | 70 | 80 | 50 | 200 |
H₀: Gender and product preference are independent
H₁: Gender and product preference are associated
For each cell, the expected count under independence is:
\[E_{ij} = \frac{\text{Row Total}_i \times \text{Column Total}_j}{\text{Grand Total}}\]
| Product A | Product B | Product C | |
|---|---|---|---|
| Male | (100×70)/200 = 35 | (100×80)/200 = 40 | (100×50)/200 = 25 |
| Female | (100×70)/200 = 35 | (100×80)/200 = 40 | (100×50)/200 = 25 |
\[\chi^2 = \frac{(40-35)^2}{35}+\frac{(35-40)^2}{40}+\frac{(25-25)^2}{25}+\frac{(30-35)^2}{35}+\frac{(45-40)^2}{40}+\frac{(25-25)^2}{25}\]
\[= 0.714 + 0.625 + 0 + 0.714 + 0.625 + 0 = \mathbf{2.678}\]
df = (rows−1)(cols−1) = 1×2 = 2 | Critical value = 5.99
2.678 < 5.99 → Fail to Reject H₀ ✅ No significant association.
Switch between weak, moderate, and strong associations. Watch how χ² and the p-value respond as the gap between groups widens.
Marketing: Do customers buy all products equally? Or is one product dominating?
Genetics: Do offspring ratios match Mendel’s predicted 3:1 ratio?
Operations: Are defects distributed equally across shifts? Or is one shift producing more errors?
Healthcare: Is smoking status associated with lung disease?
HR analytics: Is employee attrition related to department?
Retail: Is purchase behaviour related to customer age group?
A/B Testing: Is click-through rate independent of the ad version shown?