Part 4: Probability and Statistics

Part of the Mathematics for Programming 101 Series

The A/B Test That Almost Cost Us

We ran an A/B test on a new checkout flow. Test group showed 8% higher conversion.

Control: 152/1000 = 15.2% conversion
Test: 165/1000 = 16.5% conversion
Improvement: 8.6%

Product wanted to ship immediately. "The numbers don't lie," they said.

I ran a statistical significance test. P-value: 0.23

Translation: 23% chance this result is random noise. Not significant at all.

We didn't ship. Ran the test for two more weeks. Final result: 0.3% difference, not statistically significant.

That's when I learned: Statistics isn't about collecting numbers—it's about making correct decisions under uncertainty.

Probability: Quantifying Uncertainty

The Basics

Probability measures how likely something is to happen, on a scale from 0 (impossible) to 1 (certain).

import numpy as np
from collections import Counter

# Simulate rolling a fair die
def roll_die(n_rolls=1000):
    """Simulate rolling a six-sided die"""
    return np.random.randint(1, 7, size=n_rolls)

rolls = roll_die(10000)
probabilities = Counter(rolls)

print("Probability of each outcome:")
for outcome in sorted(probabilities.keys()):
    prob = probabilities[outcome] / len(rolls)
    print(f"{outcome}: {prob:.3f} (expected: 0.167)")

# Law of large numbers: more rolls → closer to theoretical probability
for n in [100, 1000, 10000, 100000]:
    rolls = roll_die(n)
    prob_six = np.sum(rolls == 6) / n
    print(f"{n:6d} rolls: P(6) = {prob_six:.4f}")

Probability Distributions

Distributions describe how probabilities are spread across possible values.

Bernoulli Distribution (yes/no events)

def bernoulli_trial(p, n_trials=1000):
    """
    Bernoulli: Single trial with success probability p
    Examples: coin flip, user clicks ad, system failure
    """
    return np.random.random(n_trials) < p

# Simulate user clicking an ad (5% CTR)
clicks = bernoulli_trial(p=0.05, n_trials=10000)
ctr = np.mean(clicks)

print(f"Simulated CTR: {ctr:.3f} (expected: 0.05)")

Binomial Distribution (multiple yes/no trials)

def binomial_probability(n, k, p):
    """
    Probability of exactly k successes in n trials
    Example: probability of exactly 3 heads in 10 coin flips
    """
    from math import comb
    return comb(n, k) * (p ** k) * ((1 - p) ** (n - k))

# What's the probability of exactly 7 out of 10 users clicking?
p_7_of_10 = binomial_probability(n=10, k=7, p=0.05)
print(f"P(exactly 7 clicks out of 10): {p_7_of_10:.6f}")

# Simulate: Out of 1000 emails, how many clicks do we expect?
emails_sent = 1000
ctr = 0.05
clicks_per_thousand = np.random.binomial(n=emails_sent, p=ctr, size=10000)

print(f"\nExpected clicks: {emails_sent * ctr:.0f}")
print(f"Simulated mean: {np.mean(clicks_per_thousand):.1f}")
print(f"Standard deviation: {np.std(clicks_per_thousand):.1f}")

Normal Distribution (bell curve)

# Most important distribution in statistics
def normal_distribution_demo():
    """
    Normal distribution: N(μ, σ²)
    μ = mean (center)
    σ = standard deviation (spread)
    
    Examples: heights, measurement errors, aggregated data
    """
    # Generate normally distributed data
    mean = 100
    std_dev = 15
    data = np.random.normal(mean, std_dev, size=10000)
    
    print(f"Mean: {np.mean(data):.2f} (expected: {mean})")
    print(f"Std Dev: {np.std(data):.2f} (expected: {std_dev})")
    
    # 68-95-99.7 rule (empirical rule)
    within_1_std = np.sum(np.abs(data - mean) <= std_dev) / len(data)
    within_2_std = np.sum(np.abs(data - mean) <= 2*std_dev) / len(data)
    within_3_std = np.sum(np.abs(data - mean) <= 3*std_dev) / len(data)
    
    print(f"\nWithin 1σ: {within_1_std:.1%} (expected: 68%)")
    print(f"Within 2σ: {within_2_std:.1%} (expected: 95%)")
    print(f"Within 3σ: {within_3_std:.1%} (expected: 99.7%)")
    
    return data

normal_data = normal_distribution_demo()

Real Application: Anomaly Detection

class AnomalyDetector:
    """Detect anomalies using statistical methods"""
    
    def __init__(self, threshold=3):
        """threshold: number of standard deviations to consider anomalous"""
        self.threshold = threshold
        self.mean = None
        self.std = None
    
    def fit(self, data):
        """Learn normal behavior from training data"""
        self.mean = np.mean(data)
        self.std = np.std(data)
        return self
    
    def predict(self, data):
        """Detect anomalies in new data"""
        # Calculate z-scores
        z_scores = np.abs((data - self.mean) / self.std)
        
        # Flag points beyond threshold
        is_anomaly = z_scores > self.threshold
        
        return is_anomaly, z_scores
    
    def explain(self, value):
        """Explain why a value is anomalous"""
        z_score = abs((value - self.mean) / self.std)
        
        if z_score > self.threshold:
            print(f"⚠ ANOMALY DETECTED")
            print(f"Value: {value:.2f}")
            print(f"Expected range: [{self.mean - self.threshold*self.std:.2f}, "
                  f"{self.mean + self.threshold*self.std:.2f}]")
            print(f"Z-score: {z_score:.2f} (threshold: {self.threshold})")
            print(f"Probability of this extreme: {2 * (1 - stats.norm.cdf(z_score)):.4%}")
        else:
            print(f"✓ Normal value: {value:.2f} (z-score: {z_score:.2f})")

# Example: Monitor API response times
from scipy import stats

normal_response_times = np.random.normal(100, 10, size=1000)  # 100ms mean, 10ms std
anomalous_response_times = np.concatenate([
    normal_response_times,
    [250, 300, 280]  # Anomalously slow
])

detector = AnomalyDetector(threshold=3)
detector.fit(normal_response_times)

# Check new measurements
for response_time in [105, 120, 250, 95, 300]:
    detector.explain(response_time)
    print()

Bayesian Thinking

Update beliefs based on evidence.

Bayes' Theorem

P(A|B) = P(B|A) * P(A) / P(B)

Translation: Probability of A given B = How well A explains B × Prior belief in A / Probability of seeing B

Real Example: Spam Classification

class NaiveBayesSpamFilter:
    """Simple spam filter using Bayes' theorem"""
    
    def __init__(self):
        self.word_probs = {}
        self.spam_prob = 0.5  # Prior: 50% of emails are spam
    
    def train(self, emails, labels):
        """
        Learn P(word|spam) and P(word|ham)
        
        emails: list of emails (each email is a list of words)
        labels: list of 0 (ham) or 1 (spam)
        """
        spam_words = []
        ham_words = []
        
        for email, label in zip(emails, labels):
            if label == 1:  # spam
                spam_words.extend(email)
            else:  # ham
                ham_words.extend(email)
        
        # Count word frequencies
        spam_word_counts = Counter(spam_words)
        ham_word_counts = Counter(ham_words)
        
        # Calculate probabilities (with smoothing)
        all_words = set(spam_words + ham_words)
        for word in all_words:
            spam_count = spam_word_counts.get(word, 0)
            ham_count = ham_word_counts.get(word, 0)
            
            # Laplace smoothing
            p_word_given_spam = (spam_count + 1) / (len(spam_words) + len(all_words))
            p_word_given_ham = (ham_count + 1) / (len(ham_words) + len(all_words))
            
            self.word_probs[word] = {
                'spam': p_word_given_spam,
                'ham': p_word_given_ham
            }
        
        self.spam_prob = sum(labels) / len(labels)
    
    def predict(self, email):
        """
        Calculate P(spam|email) using Bayes' theorem
        
        P(spam|email) ∝ P(email|spam) * P(spam)
        """
        # Start with prior probabilities (log scale to avoid underflow)
        log_prob_spam = np.log(self.spam_prob)
        log_prob_ham = np.log(1 - self.spam_prob)
        
        # Multiply by likelihood of each word
        for word in email:
            if word in self.word_probs:
                log_prob_spam += np.log(self.word_probs[word]['spam'])
                log_prob_ham += np.log(self.word_probs[word]['ham'])
        
        # Convert back from log space
        if log_prob_spam > log_prob_ham:
            return 1, np.exp(log_prob_spam) / (np.exp(log_prob_spam) + np.exp(log_prob_ham))
        else:
            return 0, np.exp(log_prob_ham) / (np.exp(log_prob_spam) + np.exp(log_prob_ham))

# Example usage
training_emails = [
    ["free", "money", "win", "prize"],  # spam
    ["meeting", "tomorrow", "discuss", "project"],  # ham
    ["click", "here", "free", "offer"],  # spam
    ["report", "quarterly", "results"],  # ham
    ["congratulations", "winner", "claim", "prize"],  # spam
]

labels = [1, 0, 1, 0, 1]  # 1=spam, 0=ham

classifier = NaiveBayesSpamFilter()
classifier.train(training_emails, labels)

# Test
test_email = ["free", "prize", "winner"]
prediction, confidence = classifier.predict(test_email)

print(f"Prediction: {'SPAM' if prediction == 1 else 'HAM'}")
print(f"Confidence: {confidence:.2%}")

Hypothesis Testing and A/B Tests

The Right Way to Do A/B Testing

class ABTest:
    """Properly implemented A/B test with statistical significance"""
    
    def __init__(self, alpha=0.05):
        """
        alpha: significance level (typically 0.05 for 95% confidence)
        Lower alpha = more conservative = fewer false positives
        """
        self.alpha = alpha
    
    def analyze(self, control_conversions, control_total, 
                test_conversions, test_total):
        """
        Analyze A/B test results
        
        Returns:
            - conversion rates
            - p-value (probability results are due to chance)
            - confidence interval
            - recommendation
        """
        # Conversion rates
        control_rate = control_conversions / control_total
        test_rate = test_conversions / test_total
        
        # Pooled proportion (null hypothesis assumption)
        pooled_p = (control_conversions + test_conversions) / (control_total + test_total)
        
        # Standard error
        se = np.sqrt(pooled_p * (1 - pooled_p) * (1/control_total + 1/test_total))
        
        # Z-score
        z = (test_rate - control_rate) / se
        
        # P-value (two-tailed test)
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for difference
        se_diff = np.sqrt(control_rate*(1-control_rate)/control_total + 
                         test_rate*(1-test_rate)/test_total)
        ci_lower = (test_rate - control_rate) - 1.96 * se_diff
        ci_upper = (test_rate - control_rate) + 1.96 * se_diff
        
        # Relative improvement
        relative_improvement = (test_rate - control_rate) / control_rate
        
        # Decision
        is_significant = p_value < self.alpha
        
        # Print results
        print("="*60)
        print("A/B TEST RESULTS")
        print("="*60)
        print(f"Control group:  {control_conversions:4d}/{control_total:4d} = {control_rate:.2%}")
        print(f"Test group:     {test_conversions:4d}/{test_total:4d} = {test_rate:.2%}")
        print(f"\nAbsolute difference: {test_rate - control_rate:+.2%}")
        print(f"Relative improvement: {relative_improvement:+.2%}")
        print(f"\n95% Confidence Interval: [{ci_lower:.2%}, {ci_upper:.2%}]")
        print(f"P-value: {p_value:.4f}")
        print(f"Z-score: {z:.2f}")
        print(f"\nStatistically significant? {is_significant} (α={self.alpha})")
        
        if is_significant:
            if test_rate > control_rate:
                print("\n✓ RECOMMENDATION: Ship the test variant!")
            else:
                print("\n✗ RECOMMENDATION: Test variant is worse. Don't ship.")
        else:
            print("\n⚠ RECOMMENDATION: Results not significant. Need more data or test is neutral.")
        
        print("="*60)
        
        return {
            'control_rate': control_rate,
            'test_rate': test_rate,
            'absolute_diff': test_rate - control_rate,
            'relative_improvement': relative_improvement,
            'p_value': p_value,
            'z_score': z,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            'is_significant': is_significant
        }

# Example 1: Strong signal
print("\nExample 1: Strong conversion improvement")
ab_test = ABTest(alpha=0.05)
result1 = ab_test.analyze(
    control_conversions=150, control_total=1000,
    test_conversions=195, test_total=1000
)

# Example 2: Weak signal (false positive risk)
print("\n\nExample 2: Weak signal, not significant")
result2 = ab_test.analyze(
    control_conversions=152, control_total=1000,
    test_conversions=165, test_total=1000
)

# Example 3: Need more data
print("\n\nExample 3: Small sample size")
result3 = ab_test.analyze(
    control_conversions=15, control_total=100,
    test_conversions=22, test_total=100
)

Sample Size Calculation

def calculate_sample_size(baseline_rate, minimum_detectable_effect, 
                         alpha=0.05, power=0.80):
    """
    Calculate required sample size for A/B test
    
    Args:
        baseline_rate: Current conversion rate
        minimum_detectable_effect: Smallest improvement you care about (e.g., 0.10 for 10%)
        alpha: Significance level (false positive rate)
        power: Statistical power (1 - false negative rate)
    
    Returns:
        Required sample size per group
    """
    # Effect size
    test_rate = baseline_rate * (1 + minimum_detectable_effect)
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)
    
    # Sample size formula
    pooled_p = (baseline_rate + test_rate) / 2
    
    n = ((z_alpha * np.sqrt(2 * pooled_p * (1 - pooled_p)) + 
          z_beta * np.sqrt(baseline_rate * (1 - baseline_rate) + test_rate * (1 - test_rate)))**2 /
         (test_rate - baseline_rate)**2)
    
    return int(np.ceil(n))

# Example: How many users do I need?
baseline = 0.10  # 10% current conversion rate
mde = 0.15  # Want to detect 15% relative improvement

sample_size = calculate_sample_size(baseline, mde)

print(f"Baseline conversion rate: {baseline:.1%}")
print(f"Minimum detectable effect: {mde:.1%}")
print(f"Required sample size per group: {sample_size:,}")
print(f"Total users needed: {2*sample_size:,}")
print(f"\nIf you get 1000 users/day: {2*sample_size/1000:.1f} days")

Correlation vs Causation

def correlation_analysis(x, y, x_name="X", y_name="Y"):
    """
    Analyze correlation between two variables
    
    Correlation != Causation!
    """
    # Pearson correlation coefficient
    correlation = np.corrcoef(x, y)[0, 1]
    
    # Statistical significance
    n = len(x)
    t_stat = correlation * np.sqrt(n - 2) / np.sqrt(1 - correlation**2)
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), n - 2))
    
    print(f"Correlation between {x_name} and {y_name}:")
    print(f"Pearson r: {correlation:.3f}")
    print(f"P-value: {p_value:.4f}")
    
    # Interpretation
    if abs(correlation) > 0.7:
        strength = "strong"
    elif abs(correlation) > 0.3:
        strength = "moderate"
    else:
        strength = "weak"
    
    direction = "positive" if correlation > 0 else "negative"
    
    print(f"Interpretation: {strength} {direction} correlation")
    
    if p_value < 0.05:
        print("Statistically significant (p < 0.05)")
    else:
        print("Not statistically significant")
    
    print("\n⚠ Remember: Correlation does NOT imply causation!")
    
    return correlation, p_value

# Example 1: Real correlation
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
exam_scores = hours_studied * 8 + np.random.normal(0, 5, 10) + 20

correlation_analysis(hours_studied, exam_scores, "Hours Studied", "Exam Score")

# Example 2: Spurious correlation
ice_cream_sales = np.random.normal(100, 20, 100)
shark_attacks = 0.3 * ice_cream_sales + np.random.normal(0, 5, 100)

print("\n" + "="*60)
correlation_analysis(ice_cream_sales, shark_attacks, "Ice Cream Sales", "Shark Attacks")
print("\nThis is spurious! Both are caused by summer weather (confounding variable)")

Confidence Intervals

def confidence_interval(data, confidence=0.95):
    """
    Calculate confidence interval for the mean
    
    Interpretation: "We are 95% confident the true mean lies in this range"
    """
    n = len(data)
    mean = np.mean(data)
    std_err = stats.sem(data)  # Standard error of the mean
    
    # Critical value from t-distribution
    confidence_level = confidence
    degrees_of_freedom = n - 1
    critical_value = stats.t.ppf((1 + confidence) / 2, degrees_of_freedom)
    
    # Margin of error
    margin_of_error = critical_value * std_err
    
    # Confidence interval
    ci_lower = mean - margin_of_error
    ci_upper = mean + margin_of_error
    
    print(f"Sample size: {n}")
    print(f"Sample mean: {mean:.2f}")
    print(f"Standard error: {std_err:.2f}")
    print(f"{confidence:.0%} Confidence Interval: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print(f"Margin of error: ±{margin_of_error:.2f}")
    
    return ci_lower, ci_upper

# Example: Average API response time
response_times = np.random.normal(100, 15, size=50)
confidence_interval(response_times, confidence=0.95)

Key Takeaways

Probability quantifies uncertainty in systems
Distributions model different types of random events
Bayesian thinking updates beliefs based on evidence
Hypothesis testing makes statistically sound decisions
A/B testing requires proper sample sizes and significance tests
Correlation ≠ Causation - always check for confounding variables
Confidence intervals quantify estimation uncertainty

What's Next

In the next article, we'll explore discrete mathematics—the foundation of algorithms, data structures, and computational thinking.

You'll learn:

Set theory and data structures
Logic and Boolean algebra
Combinatorics and counting
Recurrence relations and recursion

Continue to Part 5: Discrete Mathematics →

PreviousPart 3: Calculus and Optimization NextPart 5: Discrete Mathematics and Algorithms

Last updated 15 hours ago

hashtagThe A/B Test That Almost Cost Us

hashtagProbability: Quantifying Uncertainty

hashtagThe Basics

hashtagProbability Distributions

hashtagBernoulli Distribution (yes/no events)

hashtagBinomial Distribution (multiple yes/no trials)

hashtagNormal Distribution (bell curve)

hashtagReal Application: Anomaly Detection

hashtagBayesian Thinking

hashtagBayes' Theorem

hashtagReal Example: Spam Classification

hashtagHypothesis Testing and A/B Tests

hashtagThe Right Way to Do A/B Testing

hashtagSample Size Calculation

hashtagCorrelation vs Causation

hashtagConfidence Intervals

hashtagKey Takeaways

hashtagWhat's Next

hashtagNavigation