STATISTICS

STATISTICS

Population: This refers to the entire group that you’re interested in studying. Think of it as the big picture. For example, if you want to know about the reading habits of all adults in Bengaluru, the population would be all the adults in the city.

Sample: This is a smaller group selected from the population. It’s like taking a slice of the whole cake. Instead of studying every adult in Bengaluru, you might survey 500 randomly chosen adults. This smaller group is your sample.

So, in short, the population is the whole, and the sample is a manageable portion that represents the population.

Mean

Definition: The average of a data set.
Formula: $Mean = \frac{\sum x_{i}}{n}$

Median

Definition: The middle value of a data set when it’s ordered.
Usage: Useful for skewed distributions.

Mode

Definition: The value that appears most frequently in a data set.

Min and Max

Min: The smallest value in a data set.
Max: The largest value in a data set.

5-Point Summary

Components: Minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
Usage: Provides a quick overview of the distribution of data.

Variance

Definition: Measures how much the data values deviate from the mean.
Formula: $Variance = \frac{\sum (x_{i} - μ)^{2}}{n}$

Covariance

Definition: Measures the relationship between two variables.
Formula: $Cov (X, Y) = \frac{\sum (x_{i} - μ_{X}) (y_{i} - μ_{Y})}{n}$

Standard Deviation

Definition: The square root of the variance, indicating how spread out the data is.
Formula: $SD = \sqrt{Variance}$

Correlation

Definition: Measures the strength and direction of a linear relationship between two variables.
Formula: $Correlation = \frac{Cov (X, Y)}{σ_{X} σ_{Y}}$
Range: -1 to 1, where 1 means perfect positive correlation, -1 means perfect negative correlation, and 0 means no correlation.

Other Basic Concepts

Range: The difference between the maximum and minimum values.
Interquartile Range (IQR): The range between the first quartile (Q1) and the third quartile (Q3).
Skewness: Measures the asymmetry of the distribution.
Kurtosis: Measures the "tailedness" of the distribution.
Outliers: Data points that are significantly different from the rest of the data.

Percentiles and Quartiles

Percentiles: Values below which a certain percent of observations fall. For example, the 25th percentile is the value below which 25% of observations fall.
Quartiles: Specific percentiles that divide the data into quarters. Q1 (25th percentile), Q2 (median, 50th percentile), Q3 (75th percentile).

Probability Distributions

Uniform Distribution: All outcomes are equally likely. Example: Rolling a fair die.
Geometric Distribution: Number of trials until the first success. Example: First heads in coin flips.
Hypergeometric Distribution: Like the binomial distribution, but without replacement. Example: Selecting cards from a deck without putting them back.

Basic Data Visualization

Histograms: Used to represent the distribution of numerical data by bins.
Box Plots: Visualize the five-number summary (min, Q1, median, Q3, max).
Scatter Plots: Show the relationship between two continuous variables.

Sampling Methods

Simple Random Sampling: Each member of the population has an equal chance of being selected.
Stratified Sampling: Population divided into subgroups (strata) and a random sample taken from each stratum.
Cluster Sampling: Divides the population into clusters and randomly selects clusters, then samples all members of chosen clusters.

Probability theory is the branch of mathematics that deals with uncertainty. It's like a toolkit for quantifying the chances of various outcomes.

Here are the basics:

Probability: It measures the likelihood of an event occurring, between 0 and 1. A probability of 0 means the event won't happen; a probability of 1 means it definitely will.
Random Variables: These represent outcomes of random phenomena. For example, a dice roll can be modeled by a random variable that takes values 1 through 6.
Distributions: These describe how probabilities are distributed over the values of the random variable. Common distributions include the normal distribution (bell curve) and the binomial distribution (like flipping coins).
Expected Value: This is the long-term average value of repetitions of the experiment it represents.
Variance and Standard Deviation: These measure how spread out the values are.

Sample Space: This is the set of all possible outcomes of a random experiment. For instance, when rolling a standard six-sided die, the sample space is {1, 2, 3, 4, 5, 6}.

Sample Value: This refers to a specific outcome from the sample space. So, if you roll the die and it lands on 4, then 4 is your sample value.

In short:

Sample space = all possible outcomes.
Sample value = a particular outcome from that set.

an event is a set of outcomes of an experiment or a particular situation that you're observing. Here are the main types of events:

Simple (or Elementary) Event: An event with only one outcome. For example, rolling a die and getting a 4.
Compound Event: An event with two or more outcomes. For example, rolling a die and getting an even number (2, 4, or 6).
Independent Event: Events where the outcome of one event doesn't affect the outcome of another. For example, flipping a coin and rolling a die.
Dependent Event: Events where the outcome of one event affects the outcome of another. For example, drawing a card from a deck, not replacing it, and drawing another card.
Mutually Exclusive Event: Events that cannot happen at the same time. For example, rolling a die and getting both a 2 and a 3 in a single roll.
Non-Mutually Exclusive Event: Events that can happen at the same time. For example, being a student and being employed.

So, events can vary widely based on how they relate to other events and what outcomes they include.

Mutually Exclusive Events

Definition: These are events that cannot happen at the same time. The occurrence of one event means the other cannot happen.
Example: Rolling a single die. The event "getting a 2" and "getting a 3" are mutually exclusive because you can't roll a 2 and a 3 simultaneously.
Probability Rule: $P (A or B) = P (A) + P (B)$ . If two events are mutually exclusive, the probability that either event occurs is the sum of their individual probabilities.

Non-Mutually Exclusive Events

Definition: These are events that can happen at the same time. The occurrence of one event does not prevent the other.
Example: Being a student and being employed. These two events are non-mutually exclusive because one can be both a student and employed at the same time.
Probability Rule: $P (A or B) = P (A) + P (B) - P (A and B)$ . Since the events can overlap, you subtract the probability of both events occurring together to avoid double-counting.

In essence, mutually exclusive events are all about exclusivity and separation, while non-mutually exclusive events allow for overlap and simultaneity.

some fundamental probability rules:

1. Addition Rule

Mutually Exclusive Events: For events that cannot occur together, $P (A or B) = P (A) + P (B)$ .
Non-Mutually Exclusive Events: For events that can occur together, $P (A or B) = P (A) + P (B) - P (A and B)$ .

2. Multiplication Rule

Independent Events: Events where the outcome of one does not affect the other, $P (A and B) = P (A) \times P (B)$ .
Dependent Events: Events where the outcome of one event affects the other, $P (A and B) = P (A) \times P (B ∣ A)$ , where $P (B ∣ A)$ is the probability of B given that A has occurred.

3. Complementary Rule

The probability of an event not occurring is $1 - P (A)$ .

4. Total Probability Rule

If events $B_{1}, B_{2}, \dots, B_{n}$ are mutually exclusive and collectively exhaustive (i.e., one of them must happen), then for any event A: $P (A) = P (A ∣ B_{1}) P (B_{1}) + P (A ∣ B_{2}) P (B_{2}) + \dots + P (A ∣ B_{n}) P (B_{n})$ .

5. Bayes' Theorem

This helps us find the probability of an event given that another event has occurred: $P (A ∣ B) = \frac{P (B ∣ A) P (A)}{P (B)}$ .

These rules form the backbone of probability theory and are pivotal in analyzing and making sense of uncertain situations.

Conditional Probability: This is the probability of an event happening given that another event has already occurred.

Formula: $P (A ∣ B) = \frac{P (A and B)}{P (B)}$

Random Variables are a fundamental concept in statistics and probability theory. They represent numerical outcomes of random phenomena.

Types of Random Variables:

Discrete Random Variables: These take on countable values. Example: The number of heads in 10 coin flips.
- Probability Mass Function (PMF): Gives the probability that a discrete random variable is exactly equal to some value. For example, the number of heads in coin flips.
Continuous Random Variables: These take on an infinite number of values within a range. Example: The height of students in a class.
- Probability Density Function (PDF): Describes the likelihood of a random variable to take on a given value. For example, the exact height of students.

Cumulative Distribution Function (CDF): This function gives the probability that a random variable takes on a value less than or equal to a certain value.

Key Points:

Definition: $F (x) = P (X \leq x)$
Range: The CDF ranges from 0 to 1.
Discrete Case: Sum of the probabilities of all outcomes up to x.
Continuous Case: Integral of the probability density function (PDF) up to x.

Example:

For a discrete random variable like the number of heads in three coin flips:

$P (X \leq 2)$ is the sum of the probabilities of getting 0, 1, or 2 heads.

In essence, the CDF provides a cumulative probability for values up to a specific point.

Probability Distributions describe how the probabilities of a random variable are distributed over possible values. Here are the key types:

Discrete Probability Distributions:

Binomial Distribution: Deals with the number of successes in a fixed number of independent trials. Example: Flipping a coin 10 times.
- Formula: $P (X = k) = (\binom{n}{k}) p^{k} (1 - p)^{n - k}$
Poisson Distribution: Counts the number of events that happen in a fixed interval of time or space. Example: Number of emails received in an hour.
- Formula: $P (X = k) = \frac{λ^{k} e^{- λ}}{k!}$

Continuous Probability Distributions:

Normal Distribution: Bell-shaped curve; most values cluster around the mean. Example: Heights of people.
- Formula: $f (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{1}{2} {(\frac{x - μ}{σ})}^{2}}$
Exponential Distribution: Deals with the time between events in a Poisson process. Example: Time between arrivals of buses.
- Formula: $f (x) = λ e^{- λ x}$

Key Points:

Mean: Average value.
Variance: Spread of the data.
Skewness: Symmetry of the distribution.

The normal distribution is one of the most important probability distributions in statistics, often called the "bell curve" due to its shape.

Key Features:

Symmetrical: The left and right sides of the curve are mirror images.
Mean, Median, Mode: All are equal and located at the center.
Standard Deviation: Measures the spread of the data. A larger standard deviation means a wider spread.

Skewness

Definition: Skewness measures the asymmetry of a distribution.
Types:
- Positive Skew: Tail on the right side is longer. The mean is greater than the median.
- Negative Skew: Tail on the left side is longer. The mean is less than the median.
Normal Distribution: A normal distribution has a skewness of 0, meaning it’s perfectly symmetrical.

Kurtosis

Definition: Kurtosis measures the "tailedness" of the distribution—how heavy or light the tails are.
Types:
- Leptokurtic: Tails are heavier than the normal distribution. Has a kurtosis greater than 3.
- Platykurtic: Tails are lighter than the normal distribution. Has a kurtosis less than 3.
- Mesokurtic: This is the normal distribution. Has a kurtosis of 3.

Distribution Levels in Normal Distribution

Standard Deviation: In a normal distribution, about 68% of values fall within 1 standard deviation of the mean, about 95% within 2 standard deviations, and about 99.7% within 3 standard deviations.

In simple terms:

Skewness tells you if your data leans to one side.
Kurtosis tells you how heavy or light the tails of your distribution are.
Normal distribution is your benchmark for symmetry and tail weight.

A z-table (or standard normal table) is used to find the probability that a statistic is observed below, above, or between values on the standard normal distribution. It shows the cumulative probability of a z-score up to a given point.

Key Points:

Z-Score: Represents the number of standard deviations a data point is from the mean.
Use: Helps in finding the area under the curve to the left of a given z-score.
Format: Typically, the table provides the cumulative probability from the mean (z = 0) up to the z-score of interest.

Example:

If you have a z-score of 1.96, the z-table shows a cumulative probability of about 0.975. This means there’s a 97.5% chance a value falls below a z-score of 1.96 in a standard normal distribution.

Z-tables are super handy for calculating probabilities in statistics and help in various hypothesis testing and confidence interval calculations.

Hypothesis Testing is a method used to make statistical decisions using experimental data.

Key Steps:

State the Hypotheses:
- Null Hypothesis ( $H_{0}$ ): A statement that there is no effect or no difference. It is the default assumption.
- Alternative Hypothesis ( $H_{a}$ ): A statement that there is an effect or a difference. It is what you aim to prove.
Select the Significance Level ( $α$ ): Commonly 0.05, representing a 5% risk of concluding that an effect exists when there is none.
Choose the Test Statistic: Depending on your data and hypotheses (e.g., z-test, t-test).
Compute the Test Statistic and P-value:
- Test Statistic: A standardized value that is calculated from sample data during a hypothesis test.
- P-value: The probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
Decision:
- Reject $H_{0}$ if the P-value is less than the significance level ( $α$ ).
- Fail to Reject $H_{0}$ if the P-value is greater than the significance level.

Example:

Suppose you want to test if a new drug is effective.

$H_{0}$ : The drug has no effect.
$H_{a}$ : The drug has an effect.
$α$ : 0.05

If your test statistic yields a P-value of 0.03, you reject $H_{0}$ and conclude that the drug has an effect.

Hypothesis testing helps in making data-driven decisions and drawing conclusions about populations based on sample data.

Selecting the Significance Level ( $α$ )

This step involves deciding how stringent you want to be in determining whether to reject the null hypothesis ( $H_{0}$ ). The significance level, denoted by $α$ , is the threshold for this decision.

Common Levels:

0.05 (5%): This means there’s a 5% risk of concluding that a difference exists when there is no actual difference. It's a standard choice in many fields.
0.01 (1%): More stringent, used when you want to be extra cautious. Here, there's only a 1% risk of a false positive.
0.10 (10%): Less stringent, used when you can tolerate a higher risk of a false positive.

Interpretation:

P-value < $α$ : Reject $H_{0}$ . This suggests that the observed data is unlikely under the null hypothesis, indicating a statistically significant result.
P-value > $α$ : Fail to reject $H_{0}$ . This suggests that the observed data is not sufficiently unusual under the null hypothesis, indicating that the result is not statistically significant.

The significance level $α$ is chosen before conducting the test and represents the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true.

Choosing the Test Statistic

This step involves selecting the appropriate method to analyze your data based on your hypothesis and the type of data you have. Here are some common test statistics:

Z-Test:
- Use: When the sample size is large (n > 30) or the population variance is known.
- Formula: $Z = \frac{\overset{ˉ}{X} - μ}{σ / \sqrt{n}}$
- Example: Testing if the mean height of students is different from a known population mean.
T-Test:
- Use: When the sample size is small (n < 30) and the population variance is unknown.
- Formula: $t = \frac{\overset{ˉ}{X} - μ}{s / \sqrt{n}}$
- Example: Comparing the mean blood pressure levels before and after a treatment in a small group.
Chi-Square Test:
- Use: For categorical data to test relationships between variables.
- Formula: $χ^{2} = \sum \frac{(O_{i} - E_{i})^{2}}{E_{i}}$
- Example: Testing if there is an association between gender and voting preference.
ANOVA (Analysis of Variance):
- Use: To compare means of three or more samples.
- Formula: $F = \frac{Variance between groups}{Variance within groups}$
- Example: Comparing test scores across different teaching methods.

Selecting the correct test statistic depends on your hypothesis and data type, ensuring accurate and meaningful results.

P-Value

The P-value is a measure in statistics that helps you determine the significance of your results in hypothesis testing.

Key Points:

Definition: The P-value represents the probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis ( $H_{0}$ ) is true.
Interpretation:

If the P-value is low (typically less than 0.05), you reject the null hypothesis. This suggests that the observed effect is statistically significant.
If the P-value is high, you fail to reject the null hypothesis. This indicates that the observed effect could be due to random chance.

Pearson vs. Spearman Correlation

Pearson Correlation:

Definition: Measures the linear relationship between two continuous variables.
Range: -1 to 1
Formula: $r = \frac{\sum (x_{i} - \overset{ˉ}{x}) (y_{i} - \overset{ˉ}{y})}{\sqrt{\sum (x_{i} - \overset{ˉ}{x})^{2} \sum (y_{i} - \overset{ˉ}{y})^{2}}}$
Assumptions: Data is normally distributed and relationship is linear.
Use: Best for linear relationships. Example: Height and weight.

Spearman Correlation:

Definition: Measures the rank-order (monotonic) relationship between two variables.
Range: -1 to 1
Formula: $ρ = 1 - \frac{6 \sum d_{i}^{2}}{n (n^{2} - 1)}$ where $d_{i}$ is the difference in ranks and $n$ is the number of observations.
Assumptions: None about the distribution; uses ranks instead of raw data.
Use: Best for monotonic relationships (doesn't have to be linear). Example: Rank in class and hours studied.

In summary:

Pearson: Linear relationships and continuous data.
Spearman: Monotonic relationships and ordinal or non-normally distributed data

Parametric Testing

Definition: Parametric tests are statistical tests that make assumptions about the parameters (mean, variance) of the population distribution.
Assumptions:
- The data follows a normal distribution.
- The sample size is sufficiently large.
- Homogeneity of variance (equal variances).
Examples:
- T-test: Compares the means of two groups.
- ANOVA (Analysis of Variance): Compares the means among three or more groups.
- Z-test: Used for large samples to test the mean of a population.

Non-Parametric Testing

Definition: Non-parametric tests are statistical tests that do not assume a specific distribution for the population.
Assumptions:
- Few to no assumptions about the population parameters.
- Can be used for small sample sizes.
Examples:
- Mann-Whitney U test: Compares differences between two independent groups.
- Kruskal-Wallis test: Compares differences among three or more independent groups.
- Wilcoxon signed-rank test: Compares differences between two related samples.
- chi-square test

Key Differences

Assumptions: Parametric tests assume underlying statistical distributions, non-parametric do not.
Data Type: Parametric tests are used for continuous data, non-parametric can be used for ordinal or non-normally distributed data.
Power: Parametric tests generally have more statistical power if their assumptions are met.

Use parametric tests when assumptions about the population are satisfied, and non-parametric tests when data doesn’t meet these assumptions or is on an ordinal scale.

One-Tail vs. Two-Tail Tests

One-Tail Test:

Definition: Tests if a parameter is either greater than or less than a certain value.
Hypotheses:
- $H_{0}$ : The parameter is equal to the value.
- $H_{a}$ : The parameter is either greater than or less than the value (but not both).
Use: When you're only interested in deviations in one direction.
Example: Testing if a new drug is more effective than the standard one (greater than only).

Two-Tail Test:

Definition: Tests if a parameter is different from a certain value (it can be either greater or less).
Hypotheses:
- $H_{0}$ : The parameter is equal to the value.
- $H_{a}$ : The parameter is different from the value (can be either greater or less).
Use: When you want to detect any deviation from the null hypothesis.
Example: Testing if a new teaching method affects scores differently than the traditional method (both greater and less).

Key Differences:

Direction: One-tail focuses on one side of the distribution; two-tail covers both sides.
Significance Level: For a given $α$ , the critical region is concentrated in one tail for a one-tail test, and split between both tails for a two-tail test.

In short, choose a one-tail test when you have a specific direction in mind and a two-tail test when any direction of deviation is of interest.

Errors in Hypothesis Testing

Type I Error:

Definition: Rejecting the null hypothesis ( $H_{0}$ ) when it is actually true.
Probability: Denoted by $α$ , which is the significance level of the test.
Consequence: False positive; concluding there is an effect when there isn’t one.

Type II Error:

Definition: Failing to reject the null hypothesis ( $H_{0}$ ) when it is actually false.
Probability: Denoted by $β$ .
Consequence: False negative; concluding there is no effect when there actually is one.

Example:

Testing a new drug:

Type I Error: Concluding the drug works when it doesn’t.
Type II Error: Concluding the drug doesn’t work when it does.

Central Limit Theorem (CLT)

Definition: This theorem states that the distribution of the sample mean will approach a normal distribution as the sample size grows, regardless of the population's distribution.

Law of Large Numbers

Definition: As a sample size increases, the sample mean will get closer to the population mean.

Confidence Level

Definition: The probability that the confidence interval contains the true population parameter. Common levels are 90%, 95%, and 99%.

Power of a Test

Definition: The probability that the test correctly rejects a false null hypothesis (1 - $β$ ). It's related to the sample size and effect size.

Effect Size

Definition: A measure of the strength of the relationship between two variables or the size of an effect. Common measures include Cohen's d and Pearson's r.

Bayesian Statistics

Definition: An approach to statistics in which probabilities express a degree of belief in an event, rather than a frequency.

Resampling Methods

Bootstrap: A method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample.
Jackknife: A method for estimating the bias and variance of a statistical estimator by systematically leaving out each observation from the sample set.

STATISTICS

Mean

Median

Mode

Min and Max

5-Point Summary

Variance

Covariance

Standard Deviation

Correlation

Other Basic Concepts

Percentiles and Quartiles

Probability Distributions

Basic Data Visualization

Sampling Methods

Mutually Exclusive Events

Non-Mutually Exclusive Events

1. Addition Rule

2. Multiplication Rule

3. Complementary Rule

4. Total Probability Rule

5. Bayes' Theorem

Types of Random Variables:

Key Points:

Example:

Discrete Probability Distributions:

Continuous Probability Distributions:

Key Points:

Key Features:

Skewness

Kurtosis

Distribution Levels in Normal Distribution

Key Points:

Example:

Key Steps:

Example:

Selecting the Significance Level (α)

Common Levels:

Interpretation:

Choosing the Test Statistic

P-Value

Key Points:

Pearson vs. Spearman Correlation

Parametric Testing

Non-Parametric Testing

Key Differences

One-Tail vs. Two-Tail Tests

Key Differences:

Errors in Hypothesis Testing

Example:

Central Limit Theorem (CLT)

Law of Large Numbers

Confidence Level

Power of a Test

Effect Size

Bayesian Statistics

Resampling Methods

Comments

Post a Comment

Popular posts from this blog

Resume Work and Project Details

Time Series and MMM basics

LINEAR REGRESSION

Selecting the Significance Level ( $α$ )