Statistics for Data Science and Machine Learning

Population vs. Sample

Population

A population consists of the entire set of individuals or items that are the subject of a statistical study. It encompasses every member that fits the criteria of the research question.

Characteristics:

Comprehensive: Includes all individuals or items of interest.

Parameters: Measurements that describe the entire population. Examples of parameters include:

Population Mean (μ): The average of all values in the population.

Population Standard Deviation (σ): A measure of the dispersion of values in the population.

Example: All students enrolled in a university.

Sample

A sample is a subset of the population selected for the purpose of analysis. It allows researchers to draw conclusions about the population without examining every individual.

Characteristics:

Subset: A smaller, manageable group chosen from the population.

Statistics: Measurements that describe the sample. Examples of statistics include:

Sample Mean (x̄): The average of all values in the sample.

Sample Standard Deviation (s): A measure of the dispersion of values in the sample.

Example: A group of 200 students chosen randomly from a university’s total enrollment.

Mean, Median, and Mode

Mean

The mean, or average, is a measure of central tendency that is calculated by summing all the values in a dataset and then dividing by the number of values.

Formula:

Mean (x̄) = (Σx) / N

where:

Σx is the sum of all values in the dataset.
N is the number of values in the dataset.

Example:
For the dataset: 2, 4, 6, 8, 10

Mean (x̄) = (2 + 4 + 6 + 8 + 10) / 5 = 30 / 5 = 6

Median

The median is the middle value of a dataset when it is ordered from least to greatest. If the dataset has an odd number of observations, the median is the middle value. If it has an even number of observations, the median is the average of the two middle values.

Formula:

For an odd number of observations: Median = middle value

For an even number of observations: Median = (middle value 1 + middle value 2) / 2

Example:
For the dataset (odd number): 1, 3, 3, 6, 7, 8, 9

Median = 6

For the dataset (even number): 1, 2, 3, 4, 5, 6, 8, 9

Median = (4 + 5) / 2 = 9 / 2 = 4.5

Mode

The mode is the value that appears most frequently in a dataset. A dataset can have more than one mode if multiple values have the same highest frequency, or no mode if all values are unique.

Formula:

Mode = value with the highest frequency

Example:
For the dataset: 1, 2, 2, 3, 4, 4, 4, 5, 5

Mode = 4

Variance and Standard Deviation

Variance

Variance measures the spread of a set of numbers. It represents the average of the squared differences from the mean, providing a sense of how much the values in a dataset deviate from the mean.

Formula:
For a population:

Variance (σ²) = Σ (x – μ)² / N

For a sample:

Variance (s²) = Σ (x – x̄)² / (n – 1)

where:

Σ is the sum of all values.
x is each individual value.
μ is the population mean.
x̄ is the sample mean.
N is the total number of values in the population.
n is the total number of values in the sample.

Example:
For the sample dataset: 2, 4, 4, 4, 5, 5, 7, 9

1. Calculate the sample mean (x̄):
x̄ = (2 + 4 + 4 + 4 + 5 + 5 + 7 + 9) / 8 = 40 / 8 = 5

2. Calculate each (x – x̄)²:
(2 – 5)² = 9
(4 – 5)² = 1
(4 – 5)² = 1
(4 – 5)² = 1
(5 – 5)² = 0
(5 – 5)² = 0
(7 – 5)² = 4
(9 – 5)² = 16

3. Sum of squared differences:
Σ (x – x̄)² = 9 + 1 + 1 + 1 + 0 + 0 + 4 + 16 = 32

4. Calculate the variance:
s² = 32 / (8 – 1) = 32 / 7 ≈ 4.57

Standard Deviation

Standard deviation is the square root of the variance. It provides a measure of the spread of the values in a dataset in the same units as the data, making it easier to interpret.

Formula:
For a population:

Standard Deviation (σ) = √(Σ (x – μ)² / N)

For a sample:

Standard Deviation (s) = √(Σ (x – x̄)² / (n – 1))

Example:
Using the variance calculated above (s² ≈ 4.57):

Standard Deviation (s) = √4.57 ≈ 2.14

Correlation Coefficient

The correlation coefficient measures the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where:

r = 1: Perfect positive correlation
r = -1: Perfect negative correlation
r = 0: No correlation

Formula

The Pearson correlation coefficient (often denoted as r) is calculated using the following formula:

r = Σ((x – x̄)(y – ȳ)) / √(Σ(x – x̄)² * Σ(y – ȳ)²)

where:

x and y are the individual values of the two variables.
x̄ and ȳ are the means of the two variables.
Σ denotes the summation over all data points.

Interpretation

r > 0: Positive correlation (as one variable increases, the other tends to increase).
r < 0: Negative correlation (as one variable increases, the other tends to decrease).
r = 0: No linear correlation.
The closer r is to 1 or -1, the stronger the correlation.

Example

Consider two variables, X (hours of study) and Y (exam scores), for a group of students:

Hours of Study (X)
Exam Scores (Y)

3
65

4
75

6
85

7
90

9
95

Calculations:

Calculate the means (x̄ and ȳ).
Calculate the deviations from the means (x – x̄ and y – ȳ).
Square the deviations and sum them.
Multiply the deviations of X and Y, sum them, and divide by the product of the square roots of the sum of squared deviations.

Result:

r ≈ 0.98

Interpretation

The correlation coefficient r is approximately 0.98, indicating a strong positive linear relationship between hours of study and exam scores. As hours of study increase, exam scores tend to increase as well.

Point Estimation

Point estimation is a statistical method used to estimate an unknown parameter of a population based on sample data. It involves using a single value, called a point estimate, to approximate the true value of the parameter.

Key Concepts

Population: The entire group of individuals, items, or events of interest in a statistical study.

Parameter: A numerical characteristic of a population that is unknown and typically of interest in statistical analysis. Examples include the population mean, population proportion, population variance, etc.

Sample: A subset of the population from which data is collected.

Point Estimate: A single value, calculated from sample data, that serves as the best guess for the true value of the population parameter. It is denoted by a specific symbol, such as “x̄” for a point estimate of parameter “μ”.

Properties of Point Estimates

Unbiasedness: A point estimate is unbiased if its expected value is equal to the true value of the parameter being estimated.

Efficiency: An efficient point estimate has the smallest possible variance among all unbiased estimators of the parameter.

Consistency: A consistent point estimate converges to the true value of the parameter as the sample size increases.

Point Estimate Symbols

Population Mean: “μ”
Sample Mean: “x̄”
Population Variance: “σ²”
Sample Variance: “s²”
Population Standard Deviation: “σ”
Sample Standard Deviation: “s”

Example

Suppose we want to estimate the mean income of all households in a city. We collect a random sample of 100 households and calculate the mean income of the sample (“x̄”). We use “x̄” as our point estimate of the population mean income (“μ”).

Estimator

An estimator is a statistical function or rule used to estimate an unknown parameter of a population based on sample data. It calculates a point estimate, which serves as the best guess for the true value of the parameter.

Types of Estimators

Unbiased Estimator: An estimator whose expected value is equal to the true value of the parameter being estimated.

Consistent Estimator: An estimator that converges to the true value of the parameter as the sample size increases.

Efficient Estimator: An estimator with the smallest possible variance among all unbiased estimators of the parameter.

Biased and Unbiased Estimators

Unbiased Estimator

An estimator is unbiased if its expected value is equal to the true value of the population parameter it is estimating. In other words, an unbiased estimator does not systematically overestimate or underestimate the parameter.

Example: Sample Mean as an Unbiased Estimator of Population Mean

The sample mean (“x̄”) is an unbiased estimator of the population mean (“μ”). This means that, on average, the sample mean will equal the population mean when taken over many samples.

Formula for Sample Mean:

x̄ = Σx / n

where:

Σx is the sum of all sample values.
n is the number of sample values.

Biased Estimator

An estimator is biased if its expected value is not equal to the true value of the population parameter it is estimating. A biased estimator systematically overestimates or underestimates the parameter.

Example: Sample Variance as a Biased Estimator of Population Variance

The sample variance calculated using the formula with “n” in the denominator (instead of “n-1”) is a biased estimator of the population variance (“σ²”). This formula tends to underestimate the true population variance, especially for small sample sizes.

Biased Formula for Sample Variance:

s²_biased = Σ(x – x̄)² / n

where:

Σ(x – x̄)² is the sum of squared deviations from the sample mean.
n is the number of sample values.

To correct this bias, we use Bessel’s correction, replacing “n” with “n-1” in the denominator, which provides an unbiased estimator of the population variance.

Unbiased Formula for Sample Variance:

s²_unbiased = Σ(x – x̄)² / (n – 1)

where:

Σ(x – x̄)² is the sum of squared deviations from the sample mean.
n is the number of sample values.

Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences or draw conclusions about a population based on sample data. It involves making an initial assumption (the null hypothesis) and determining whether the sample data provides sufficient evidence to reject this assumption in favor of an alternative hypothesis.

Key Concepts

Null Hypothesis (H₀): The statement being tested, typically representing no effect or no difference. It is assumed to be true unless the data provides strong evidence against it.

Alternative Hypothesis (H₁ or Ha): The statement we want to test for, representing an effect or a difference. It is accepted if the null hypothesis is rejected.

Significance Level (α): The threshold for determining whether the evidence is strong enough to reject the null hypothesis. Common significance levels are 0.05, 0.01, and 0.10.

Test Statistic: A standardized value calculated from sample data, used to determine whether to reject the null hypothesis. Different tests have different test statistics, such as the t-statistic or z-statistic.

p-value: The probability of obtaining a test statistic as extreme as the one observed, assuming the null hypothesis is true. If the p-value is less than or equal to the significance level, we reject the null hypothesis.

Type I Error (α): The error made when the null hypothesis is wrongly rejected (false positive).

Type II Error (β): The error made when the null hypothesis is not rejected when it is false (false negative).

Steps in Hypothesis Testing

State the Hypotheses:

Null Hypothesis (H₀): Example – The population mean is equal to a specified value (μ = μ₀).
Alternative Hypothesis (H₁): Example – The population mean is not equal to the specified value (μ ≠ μ₀).

Choose the Significance Level (α):

Common choices are 0.05, 0.01, or 0.10.

Select the Appropriate Test and Calculate the Test Statistic:

Depending on the sample size and whether the population standard deviation is known, choose a test (e.g., z-test, t-test).
Calculate the test statistic using the sample data.

Determine the p-value or Critical Value:

Compare the test statistic to a critical value from statistical tables or calculate the p-value.

Make a Decision:

If the p-value ≤ α, reject the null hypothesis (H₀).
If the p-value > α, do not reject the null hypothesis (H₀).

Interpret the Results:

Draw conclusions based on the decision made in the previous step.

Example: t-Test

The t-test is a statistical test used to determine whether there is a significant difference between the means of two groups or between a sample mean and a known population mean. It is particularly useful when the sample size is small and the population standard deviation is unknown.

Types of t-Tests

One-Sample t-Test: Tests whether the mean of a single sample is significantly different from a known population mean.

Independent Two-Sample t-Test: Tests whether the means of two independent samples are significantly different.

Paired Sample t-Test: Tests whether the means of two related groups (e.g., measurements before and after treatment) are significantly different.

Key Concepts

Null Hypothesis (H₀): The hypothesis that there is no effect or no difference. It assumes that any observed difference is due to sampling variability.

Alternative Hypothesis (H₁ or Ha): The hypothesis that there is an effect or a difference. It suggests that the observed difference is real and not due to chance.

Degrees of Freedom (df): The number of independent values or quantities that can vary in the analysis. It is used to determine the critical value from the t-distribution table.

Significance Level (α): The threshold for rejecting the null hypothesis. Common significance levels are 0.05, 0.01, and 0.10.

Test Statistic: A value calculated from the sample data that is used to make a decision about the null hypothesis.

One-Sample t-Test

Purpose: To determine if the sample mean is significantly different from a known population mean.

Formula:

t = (x̄ – μ) / (s / √n)

where:

x̄ is the sample mean.
μ is the population mean.
s is the sample standard deviation.
n is the sample size.

Steps:

State the hypotheses.

H₀: μ = μ₀
H₁: μ ≠ μ₀

Choose the significance level (α).
Calculate the test statistic (t).
Determine the critical value or p-value.
Make a decision and interpret the results.

Independent Two-Sample t-Test

Purpose: To determine if the means of two independent samples are significantly different.

Formula:

t = (x̄₁ – x̄₂) / √[(s₁² / n₁) + (s₂² / n₂)]

where:

x̄₁ and x̄₂ are the sample means.
s₁² and s₂² are the sample variances.
n₁ and n₂ are the sample sizes.

Steps:

State the hypotheses.

H₀: μ₁ = μ₂
H₁: μ₁ ≠ μ₂

Choose the significance level (α).
Calculate the test statistic (t).
Determine the degrees of freedom (df).
Determine the critical value or p-value.
Make a decision and interpret the results.