Files
uni/year2/semester1/logseq-stuff/pages/Exploratory Data Analysis.md

282 lines
13 KiB
Markdown

- #[[ST2001 - Statistics in Data Science I]]
- **Previous Topic:** null
- **Next Topic:** [[Sampling]]
- **Relevant Slides:** ![Lecture01.pdf](../assets/Lecture01_1662914505882_0.pdf)
-
- ## What is / are Statistics?
collapsed:: true
- What is a **statistic**? #card
card-last-interval:: 4
card-repeats:: 2
card-ease-factor:: 2.66
card-next-schedule:: 2022-11-26T13:40:07.042Z
card-last-reviewed:: 2022-11-22T13:40:07.042Z
card-last-score:: 5
- A **statistic** is any quantity computed from sample data.
- What is the **Science of Statistics**?
card-last-interval:: 10.24
card-repeats:: 3
card-ease-factor:: 2.56
card-next-schedule:: 2022-10-13T16:37:15.281Z
card-last-reviewed:: 2022-10-03T11:37:15.284Z
card-last-score:: 5
- The collecting, classifying, summarising, organising, analysing, estimation, and interpretation of information.
- What is the **role of statistics**?
card-last-interval:: 9.28
card-repeats:: 3
card-ease-factor:: 2.32
card-next-schedule:: 2022-10-09T18:10:56.000Z
card-last-reviewed:: 2022-09-30T12:10:56.000Z
card-last-score:: 5
- The field of statistics deals with the collection, presentation, analysis, and use of data to:
- make decisions
- solve problems
- design products & processes
- Statistics is the ^^science of uncertainty.^^
- What is the **role of probability** in statistics?
- **Probability** provides the framework for the study & application of statistics.
-
- ## Types of Statistics
collapsed:: true
- What is **Descriptive Statistics**? #card
card-last-interval:: 41.44
card-repeats:: 5
card-ease-factor:: 2.18
card-next-schedule:: 2022-12-30T04:36:51.695Z
card-last-reviewed:: 2022-11-18T18:36:51.696Z
card-last-score:: 3
- **Descriptive Statistics** is the science of summarising data, both numerically & graphically.
- The analysis methods applicable depends on the variable being measured and the research questions that you are trying to answer.
- What is **Inferential Statistics**? #card
card-last-interval:: 4
card-repeats:: 2
card-ease-factor:: 2.66
card-next-schedule:: 2022-11-26T13:37:26.721Z
card-last-reviewed:: 2022-11-22T13:37:26.722Z
card-last-score:: 3
- **Inferential Statistics** is the science of using the ^^information in your sample^^ to ^^infer^^ something about the population of statistics.
-
- ## Important Terms
collapsed:: true
- What is an **experimental unit**? #card
collapsed:: true
card-last-interval:: 33.64
card-repeats:: 4
card-ease-factor:: 2.9
card-next-schedule:: 2022-12-18T07:44:53.701Z
card-last-reviewed:: 2022-11-14T16:44:53.701Z
card-last-score:: 5
- An **experimental unit** / individual is a single object upon which we collect data. e.g., a person, thing, transaction, or event.
- What is a **population**? #card
collapsed:: true
card-last-interval:: 33.64
card-repeats:: 4
card-ease-factor:: 2.9
card-next-schedule:: 2022-12-18T07:49:52.921Z
card-last-reviewed:: 2022-11-14T16:49:52.922Z
card-last-score:: 5
- A **population** is a ^^collection of experimental units^^ / individuals that we are interested in studying.
- What is a **sample**? #card
card-last-interval:: 33.64
card-repeats:: 4
card-ease-factor:: 2.9
card-next-schedule:: 2022-12-18T07:51:36.002Z
card-last-reviewed:: 2022-11-14T16:51:36.002Z
card-last-score:: 5
- A **sample** is a subset of experimental units from the population.
- What is a **variable**? #card
collapsed:: true
card-last-interval:: 10.64
card-repeats:: 3
card-ease-factor:: 2.66
card-next-schedule:: 2022-11-25T07:35:49.776Z
card-last-reviewed:: 2022-11-14T16:35:49.776Z
card-last-score:: 5
- A **variable** is a ^^characteristic or property of an individual experimental unit^^.
- A variable may be measured, or more generally "observed" on each individual.
- What is **Qualitative Data**? #card
card-last-interval:: 8.72
card-repeats:: 3
card-ease-factor:: 2.18
card-next-schedule:: 2022-11-23T09:37:05.082Z
card-last-reviewed:: 2022-11-14T16:37:05.082Z
card-last-score:: 3
- **Qualitative Data** is data which can be classified into categories.
- Two types of Qualitative Data:
- **Ordinal:** ordered qualitative data - e.g., a grade,
- **Nominal:** unordered qualitative data - e.g., a gender, a method of payment
- What is **Quantitative Data**? #card
card-last-interval:: 30.47
card-repeats:: 4
card-ease-factor:: 2.76
card-next-schedule:: 2022-12-15T07:21:06.252Z
card-last-reviewed:: 2022-11-14T20:21:06.252Z
card-last-score:: 5
- **Quantitative Data** is data in the form of counts or numbers - it cannot be classified into categories.
- Two types of Quantitative Data:
- **Discrete:** non-divisible, single points of data, **counts** - e.g., number of texts sent
- **Continuous:** measurements that, if placed on a number scale, can be placed in an infinite number of spaces between two whole numbers - e.g., age, rent, temperature
-
- Pie charts make data very difficult to interpret & read - **don't use them**.
-
- ## Numerical Summaries
- ### Central Tendency
- What is a **numerical summary**? #card
card-last-score:: 5
card-repeats:: 5
card-next-schedule:: 2023-02-06T22:21:22.599Z
card-last-interval:: 84.1
card-ease-factor:: 2.76
card-last-reviewed:: 2022-11-14T20:21:22.599Z
- A **numerical summary** is a way of summarising categorical data using a frequency count or percentage.
- How do you calculate the **sample mean**? #card
card-last-interval:: 29.26
card-repeats:: 4
card-ease-factor:: 2.66
card-next-schedule:: 2022-12-13T22:47:38.996Z
card-last-reviewed:: 2022-11-14T16:47:38.997Z
card-last-score:: 5
- Suppose that the observations in a sample are $x_1,\ x_2,\ ...\ ,\ x_n$. The **sample mean**, denoted by $\bar{x}$, is:
- $$\bar{x} = \sum_{i=1}^{n}\frac{x_i}{n}=\frac{x_1+x_2+...+x_n}{n}$$
:LOGBOOK:
CLOCK: [2022-09-12 Mon 18:35:39]
:END:
- ^^The sample mean is **sensitive** to extreme values^^
- How do you calculate the **sample median**? #card
card-last-interval:: 8.72
card-repeats:: 3
card-ease-factor:: 2.18
card-next-schedule:: 2022-11-23T09:49:27.636Z
card-last-reviewed:: 2022-11-14T16:49:27.637Z
card-last-score:: 5
- Given that the observations in a sample are $x_1,\ x_2,\ ...\ ,\ x_n$, arranged in **increasing order** of magnitude, the **sample median** is:
- $$\bar{x} = \begin{cases}x_{(n+1)/2}, & \text{if $n$ is odd},\\ \frac{1}{2}(x_{n/2} + x_{n/2+1}), &\text{if $n$ is even.} \\ \end{cases}$$
- ^^The sample median is **not** sensitive to extreme values.^^
- What is the **mode**? #card
card-last-interval:: 33.64
card-repeats:: 4
card-ease-factor:: 2.9
card-next-schedule:: 2022-12-18T07:51:42.814Z
card-last-reviewed:: 2022-11-14T16:51:42.815Z
card-last-score:: 5
- The **mode** is the most frequent observation in a dataset.
- ### Variation
collapsed:: true
- What is the **range** of a sample? #card
collapsed:: true
card-last-interval:: 33.64
card-repeats:: 4
card-ease-factor:: 2.9
card-next-schedule:: 2022-12-18T07:51:53.283Z
card-last-reviewed:: 2022-11-14T16:51:53.283Z
card-last-score:: 5
- The **range** of a sample is the **maximum** - **minimum**.
- The range is a ^^poor measure of spread and is badly affected by outliers.^^
- The range is also ^^badly affected by outliers.^^
- #### Interquartile Range
- What is the **interquartile range** of a sample? #card
card-last-interval:: 41.44
card-repeats:: 5
card-ease-factor:: 2.18
card-next-schedule:: 2022-12-26T06:18:35.901Z
card-last-reviewed:: 2022-11-14T20:18:35.901Z
card-last-score:: 5
- The **interquartile range** is the middle 50% of the data.
- Therefore, it is ^^robust to outliers.^^
- To calculate the **IQR**, first split the data in 4 quarters and subtract the value at $Q_3$ from the value at $Q_1$.
- $$IQR=Q_3-Q_1$$
- ![image.png](../assets/image_1663005545935_0.png)
- #### Tukey's Method for IQR
- There are also many other methods for calculating IQR.
- 1. Put data in **ascending** order.
2. The **lower quartile** ($Q_1$)is the **median** of the **lower** 50% of the data, including the median.
3. The **upper quartile** ($Q_3$) is the **median** of the **upper** 50% of the data, including the median.
- #### Standard Deviation #card
card-last-interval:: 4.14
card-repeats:: 2
card-ease-factor:: 2.56
card-next-schedule:: 2022-11-27T15:16:26.346Z
card-last-reviewed:: 2022-11-23T12:16:26.350Z
card-last-score:: 5
- A common measure of spread is the **standard deviation**, which takes into account how far *each* data value is from the mean.
- A **deviation** is the distance of a datapoint from the mean.
- Since the sum of all the deviations would be zero, we square each deviation and find an average of the deviations called the **variance**.
- We then get the positive square root of the **sample variance** to get the the **sample standard deviation**, which is preferable to the sample variance, as the sample variance is in squared units.
- The **standard deviation** is ^^sensitive to outliers.^^
- How do you calculate the **sample variance**, and hence, the **sample standard deviation**? #card
card-last-interval:: 10.97
card-repeats:: 3
card-ease-factor:: 2.56
card-next-schedule:: 2022-11-25T15:37:19.443Z
card-last-reviewed:: 2022-11-14T16:37:19.444Z
card-last-score:: 3
- The **sample variance**, denoted by $s^2$, is given by:
- $$s^2=\sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$$
- The **sample standard deviation**, denoted by $s$, is the **positive square root** of $s^2$, that is:
- $$s=\sqrt{s^2}$$
- ### Shape
- #### Graphical Summaries of Data
- Depends on the variable of interest.
- **Categorical** response variable -> bar chart or pie chart.
- **Categorical** response variable ^^with an explanatory variable^^ -> grouped bar chart.
- **Continuous** response variable -> histogram, boxplot, densit plot.
- **Continuous** response variable ^^with an explanatory variable^^ -> grouped boxplot.
-
- What is a **boxplot**? #card
card-last-interval:: 5.52
card-repeats:: 3
card-ease-factor:: 2.46
card-next-schedule:: 2022-11-20T04:34:53.491Z
card-last-reviewed:: 2022-11-14T16:34:53.492Z
card-last-score:: 3
- A **boxplot** is a graphical display showing centre, spread, shape, & outliers.
- It displays the **5-number summary**:
- *min, Q1, median, Q3, max*
- ![image.png](../assets/image_1663236210540_0.png)
- What is a **histogram**? #card
card-last-interval:: -1
card-repeats:: 1
card-ease-factor:: 2.7
card-next-schedule:: 2022-11-15T00:00:00.000Z
card-last-reviewed:: 2022-11-14T20:00:35.072Z
card-last-score:: 1
- **Histograms** are useful to show the general shape, location, and spread of data values.
- Representation by *area*.
- **Construction**
- Determine range of data *minimum, maximum*.
- Split into convenient intervals or *bins*.
- Usually use 5 to 15 intervals.
- Count number of observations in each interval - *frequency*.
- When talking about the shape of the data, make sure to address the following 3 questions:
- 1. Does the histogram have a single, central hump or several well-separated bumps?
2. Is the histogram or boxplot **symmetric**, or more spread out in one direction (skewed)?
3. Any unusual features? e.g.., outliers, spikes.
- ![image.png](../assets/image_1663237164731_0.png)
- ![image.png](../assets/image_1663237245117_0.png)
-
- #### Explanatory & Response Variables
collapsed:: true
- To identify the **explanatory** variable in a pair of variables, identify which of the two is suspected of affecting the other and plan an appropriate analysis
- explanatory variable -might effect-> response variable
- continent -might effect-> life expectancy.
-
- ## R Markdown
- What is **R Markdown**? #card
card-last-interval:: 28.93
card-repeats:: 4
card-ease-factor:: 2.56
card-next-schedule:: 2022-12-13T14:49:15.636Z
card-last-reviewed:: 2022-11-14T16:49:15.637Z
card-last-score:: 5
- **R Markdown** is a file format for making ^^dynamic documents in R.^^
- R Markdown is written in Markdown and contains chunks of embedded R code (data management, summaries, graphics, analysis & interpretation) all in one document.
- Documents can be **knitted** to HTML, PDF, Word, and many other formats.
- ### Key Benefits of R Markdown
- Makes it easy to produce statistical reports with code, analysis, outputs, and write-up all in one place.
- Perfect for reproducible research.
- Easy to convert to different document types.
- ### Structure
- R Markdown contains **three** types of content:
- A **YAML Header**.
- Text, formatted with Markdown.
- Code chunks.