uni/Sampling Distributions & Confidence Intervals.md at c8af0f8c2e5a69b51e80032c05329ae10326486d

aindreas/uni

Fork 0

Files

Andrew 1f7d812b98 Rename year directories to allow natural ordering

2023-12-20 03:57:27 +00:00

6.6 KiB

Raw Blame History

#ST2001 - Statistics in Data Science I
Previous Topic: The Normal Distribution
Next Topic: Hypothesis Testing
Relevant Slides:
Probability & Statistics
- In Probability theory, we consider some known process which has some randomness or uncertainty.
  - We model the outcomes by random variables, and we figure out the probabilities of what will happen.
  - There is one correct answer to any probability question.
- In Statistical Inference, we observe something that has happened, and try to figure out what underlying process would explain those observations.
  - The basic aim behind all statistical methods is to make inferences about a population by studying a relatively small sample from it.
- Probability is the engine that drives all statistical modelling, data analysis, & inference.
Sampling Distributions
- The probability distribution of a statistic is called a sampling distribution.
- Sampling distributions arise because samples vary.
  - Each random sample will have a different value of the statistic.
The Central Limit Theorem #card
card-last-interval:: -1 card-repeats:: 1 card-ease-factor:: 2.5 card-next-schedule:: 2022-11-15T00:00:00.000Z card-last-reviewed:: 2022-11-14T20:14:02.517Z card-last-score:: 1 id:: 6356abee-cb6a-48c5-8f8b-72122b6099eb
- The sampling distribution of any mean becomes more nearly Normal as the sample size grows.
  - Observations must be independent.
  - The shape of the population distribution doesn't matter.
- What is the Central Limit Theorem? #card card-last-interval:: -1 card-repeats:: 1 card-ease-factor:: 2.5 card-next-schedule:: 2022-11-22T00:00:00.000Z card-last-reviewed:: 2022-11-21T13:06:03.899Z card-last-score:: 1
  - The Central Limit Theorem states that ==a sample means follow a Normal distribution centred on the population mean== with a standard deviation divided by the square root of the sample size.
    - \bar X \sim N (\mu, \frac{\sigma^2}{n})
- The CLT depends crucially on the assumption of independence.
- The Standard Error
  - The Standard Error is a measure of the variability in the sampling distribution (i.e., how do sample statistics vary about the unknown population parameter they are trying to estimate).
  - It describes the typical "error" or "uncertainty" associated with the estimate.
    - SE = \frac{\sigma}{\sqrt{n}}
  - Interval Estimation for \mu
    - Use the CLT to provide a range of values that will capture 95% of sample means.
    - In repeated sampling, 95% of intervals calculated in this manner will contain the true mean \mu.
      - \bar{x} \pm 1.96 \times \frac{\sigma}{\sqrt{n}}
Confidence Intervals
- The population mean \mu is fixed.
- The intervals from different samples are random.
- From our single sample, we only observe one of the intervals.
  - Our interval may or may not contain the true mean.
- If we had taken many samples, and calculated the 95% CI for each, 95% of them would include the true mean.
- We say that we are "95% confident" that the interval contains the true mean.
- A point estimate (i.e., a statistic) is a single plausible value for a parameter.
  - A point estimate is rarely perfect, usually there is some error in the estimate.
  - Instead of supplying just a point estimate of a parameter, a next logical step would be to provide a plausible range of values for the parameter.
    - To do this, an estimate of the precision of the sample statistic (i.e., the estimate) is needed.
The t-distribution
- In practice, we cannot directly calculate the standard error for \bar x since we do not know the population standard deviation, \sigma.
  - We can use the sample standard deviation s in place of \sigma for computing the standard error of \bar x:
    - SE = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}}
  - This strategy tends to work well when we have a lot of data and can estimate \sigma using s accurately. However, this estimate is less precise with smaller samples, and this leads to problems when using the normal distribution to model \bar x.
- Enter a new distribution for inference calculations called the t-distribution.
  - A t-distribution has a bell shape, but its tails are thicker than the Normal Distribution's, meaning that observations are more likely to fall beyond two standard deviations from the mean than under the Normal Distribution.
  - The extra-thick tails of the t-distribution are exactly the correction needed to resolve the problem of using s in place of \sigma in the SE calculation.
- The t-distribution is always centred at zero and has a single parameter: degrees of freedom.
  - What are degrees of freedom? #card card-last-interval:: -1 card-repeats:: 1 card-ease-factor:: 2.5 card-next-schedule:: 2022-11-15T00:00:00.000Z card-last-reviewed:: 2022-11-14T16:21:05.697Z card-last-score:: 1
    - The degrees of freedom (df) describes the precise form of the bell-shaped t-distribution.
    - In general, df = n -1 where n is the sample size.
      - That is, when we have more observation, the degrees of freedom will be larger and the t-distribution will look more the standard normal distribution.
        
        When df \geq 30, the t-distribution is nearly indistinguishable from the normal distribution.
The Bootstrap
- We can quantify the variability of sample statistics using theory, e.g. ((6356abee-cb6a-48c5-8f8b-72122b6099eb)), or by simulation via bootstrapping.
- Bootstrapping Scheme #card
  card-last-interval:: -1 card-repeats:: 1 card-ease-factor:: 2.5 card-next-schedule:: 2022-11-22T00:00:00.000Z card-last-reviewed:: 2022-11-21T13:10:41.508Z card-last-score:: 1
  - 1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
    2. Calculate the bootstrap statistic - a statistic such as mean, median, proportiion, etc., computed on the bootstrap samples.
    3. Repeat steps 1. & 2. many times to create a bootstrap distribution - a distribution of bootstrap statistics.
    4. Calculate the bounds of the XX% confidence interval as the middle of the XX% of the bootstrap distribution.
Theorem 9.2
- If \bar x is used as an estimate of \mu, we can be 100(1 - \alpha)% confident that the error will not exceed a specified amount e when the sample size is
  - n = \left(\frac{z_\alpha / 2^\sigma}{e} \right)^2

6.6 KiB Raw Blame History

Probability & Statistics

Sampling Distributions

The Central Limit Theorem #card

The Standard Error

Interval Estimation for \mu

Confidence Intervals

The t-distribution

The Bootstrap

Bootstrapping Scheme #card

Theorem 9.2

6.6 KiB

Raw Blame History

Interval Estimation for `\mu`