uni/year2/semester1/logseq-stuff/pages/Exploratory Data Analysis.md

- #[[ST2001 - Statistics in Data Science I]]
- **Previous Topic:** null
- **Next Topic:** [[Sampling]]
- **Relevant Slides:** ![Lecture01.pdf](../assets/Lecture01_1662914505882_0.pdf)
-
- ## What is / are Statistics?
  collapsed:: true
	- What is a **statistic**? #card
	  card-last-interval:: 4
	  card-repeats:: 2
	  card-ease-factor:: 2.66
	  card-next-schedule:: 2022-11-26T13:40:07.042Z
	  card-last-reviewed:: 2022-11-22T13:40:07.042Z
	  card-last-score:: 5
		- A **statistic** is any quantity computed from sample data.
	- What is the **Science of Statistics**?
	  card-last-interval:: 10.24
	  card-repeats:: 3
	  card-ease-factor:: 2.56
	  card-next-schedule:: 2022-10-13T16:37:15.281Z
	  card-last-reviewed:: 2022-10-03T11:37:15.284Z
	  card-last-score:: 5
		- The collecting, classifying, summarising, organising, analysing, estimation, and interpretation of information.
	- What is the **role of statistics**?
	  card-last-interval:: 9.28
	  card-repeats:: 3
	  card-ease-factor:: 2.32
	  card-next-schedule:: 2022-10-09T18:10:56.000Z
	  card-last-reviewed:: 2022-09-30T12:10:56.000Z
	  card-last-score:: 5
		- The field of statistics deals with the collection, presentation, analysis, and use of data to:
			- make decisions
			- solve problems
			- design products & processes
		- Statistics is the ^^science of uncertainty.^^
	- What is the **role of probability** in statistics?
		- **Probability** provides the framework for the study & application of statistics.
-
- ## Types of Statistics
  collapsed:: true
	- What is **Descriptive Statistics**? #card
	  card-last-interval:: 41.44
	  card-repeats:: 5
	  card-ease-factor:: 2.18
	  card-next-schedule:: 2022-12-30T04:36:51.695Z
	  card-last-reviewed:: 2022-11-18T18:36:51.696Z
	  card-last-score:: 3
		- **Descriptive Statistics** is the science of summarising data, both numerically & graphically.
		- The analysis methods applicable depends on the variable being measured and the research questions that you are trying to answer.
	- What is **Inferential Statistics**? #card
	  card-last-interval:: 4
	  card-repeats:: 2
	  card-ease-factor:: 2.66
	  card-next-schedule:: 2022-11-26T13:37:26.721Z
	  card-last-reviewed:: 2022-11-22T13:37:26.722Z
	  card-last-score:: 3
		- **Inferential Statistics** is the science of using the ^^information in your sample^^ to ^^infer^^ something about the population of statistics.
-
- ## Important Terms
  collapsed:: true
	- What is an **experimental unit**? #card
	  collapsed:: true
	  card-last-interval:: 33.64
	  card-repeats:: 4
	  card-ease-factor:: 2.9
	  card-next-schedule:: 2022-12-18T07:44:53.701Z
	  card-last-reviewed:: 2022-11-14T16:44:53.701Z
	  card-last-score:: 5
		- An **experimental unit** / individual is a single object upon which we collect data. e.g., a person, thing, transaction, or event.
	- What is a **population**? #card
	  collapsed:: true
	  card-last-interval:: 33.64
	  card-repeats:: 4
	  card-ease-factor:: 2.9
	  card-next-schedule:: 2022-12-18T07:49:52.921Z
	  card-last-reviewed:: 2022-11-14T16:49:52.922Z
	  card-last-score:: 5
		- A **population** is a ^^collection of experimental units^^ / individuals that we are interested in studying.
	- What is a **sample**? #card
	  card-last-interval:: 33.64
	  card-repeats:: 4
	  card-ease-factor:: 2.9
	  card-next-schedule:: 2022-12-18T07:51:36.002Z
	  card-last-reviewed:: 2022-11-14T16:51:36.002Z
	  card-last-score:: 5
		- A **sample** is a subset of experimental units from the population.
	- What is a **variable**? #card
	  collapsed:: true
	  card-last-interval:: 10.64
	  card-repeats:: 3
	  card-ease-factor:: 2.66
	  card-next-schedule:: 2022-11-25T07:35:49.776Z
	  card-last-reviewed:: 2022-11-14T16:35:49.776Z
	  card-last-score:: 5
		- A **variable** is a ^^characteristic or property of an individual experimental unit^^.
		- A variable may be measured, or more generally "observed" on each individual.
	- What is **Qualitative Data**? #card
	  card-last-interval:: 8.72
	  card-repeats:: 3
	  card-ease-factor:: 2.18
	  card-next-schedule:: 2022-11-23T09:37:05.082Z
	  card-last-reviewed:: 2022-11-14T16:37:05.082Z
	  card-last-score:: 3
		- **Qualitative Data** is data which can be classified into categories.
		- Two types of Qualitative Data:
			- **Ordinal:** ordered qualitative data - e.g., a grade,
			- **Nominal:** unordered qualitative data - e.g., a gender, a method of payment
	- What is **Quantitative Data**? #card
	  card-last-interval:: 30.47
	  card-repeats:: 4
	  card-ease-factor:: 2.76
	  card-next-schedule:: 2022-12-15T07:21:06.252Z
	  card-last-reviewed:: 2022-11-14T20:21:06.252Z
	  card-last-score:: 5
		- **Quantitative Data** is data in the form of counts or numbers - it cannot be classified into categories.
		- Two types of Quantitative Data:
			- **Discrete:** non-divisible, single points of data, **counts** - e.g., number of texts sent
			- **Continuous:** measurements that, if placed on a number scale, can be placed in an infinite number of spaces between two whole numbers - e.g., age, rent, temperature
-
- Pie charts make data very difficult to interpret & read - **don't use them**.
-
- ## Numerical Summaries
	- ### Central Tendency
		- What is a **numerical summary**? #card
		  card-last-score:: 5
		  card-repeats:: 5
		  card-next-schedule:: 2023-02-06T22:21:22.599Z
		  card-last-interval:: 84.1
		  card-ease-factor:: 2.76
		  card-last-reviewed:: 2022-11-14T20:21:22.599Z
			- A **numerical summary** is a way of summarising categorical data using a frequency count or percentage.
		- How do you calculate the **sample mean**? #card
		  card-last-interval:: 29.26
		  card-repeats:: 4
		  card-ease-factor:: 2.66
		  card-next-schedule:: 2022-12-13T22:47:38.996Z
		  card-last-reviewed:: 2022-11-14T16:47:38.997Z
		  card-last-score:: 5
			- Suppose that the observations in a sample are $x_1,\ x_2,\ ...\ ,\ x_n$. The **sample mean**, denoted by $\bar{x}$, is:
				- $$\bar{x} = \sum_{i=1}^{n}\frac{x_i}{n}=\frac{x_1+x_2+...+x_n}{n}$$
				  :LOGBOOK:
				  CLOCK: [2022-09-12 Mon 18:35:39]
				  :END:
			- ^^The sample mean is **sensitive** to extreme values^^
		- How do you calculate the **sample median**? #card
		  card-last-interval:: 8.72
		  card-repeats:: 3
		  card-ease-factor:: 2.18
		  card-next-schedule:: 2022-11-23T09:49:27.636Z
		  card-last-reviewed:: 2022-11-14T16:49:27.637Z
		  card-last-score:: 5
			- Given that the observations in a sample are $x_1,\ x_2,\ ...\ ,\ x_n$, arranged in **increasing order** of magnitude, the **sample median** is:
				- $$\bar{x} = \begin{cases}x_{(n+1)/2}, & \text{if $n$ is odd},\\ \frac{1}{2}(x_{n/2} + x_{n/2+1}), &\text{if $n$ is even.} \\ \end{cases}$$
				- ^^The sample median is **not** sensitive to extreme values.^^
		- What is the **mode**? #card
		  card-last-interval:: 33.64
		  card-repeats:: 4
		  card-ease-factor:: 2.9
		  card-next-schedule:: 2022-12-18T07:51:42.814Z
		  card-last-reviewed:: 2022-11-14T16:51:42.815Z
		  card-last-score:: 5
			- The **mode** is the most frequent observation in a dataset.
	- ### Variation
	  collapsed:: true
		- What is the **range** of a sample? #card
		  collapsed:: true
		  card-last-interval:: 33.64
		  card-repeats:: 4
		  card-ease-factor:: 2.9
		  card-next-schedule:: 2022-12-18T07:51:53.283Z
		  card-last-reviewed:: 2022-11-14T16:51:53.283Z
		  card-last-score:: 5
			- The **range** of a sample is the **maximum** - **minimum**.
			- The range is a ^^poor measure of spread and is badly affected by outliers.^^
			- The range is also ^^badly affected by outliers.^^
		- #### Interquartile Range
			- What is the **interquartile range** of a sample? #card
			  card-last-interval:: 41.44
			  card-repeats:: 5
			  card-ease-factor:: 2.18
			  card-next-schedule:: 2022-12-26T06:18:35.901Z
			  card-last-reviewed:: 2022-11-14T20:18:35.901Z
			  card-last-score:: 5
				- The **interquartile range** is the middle 50% of the data.
					- Therefore, it is ^^robust to outliers.^^
				- To calculate the **IQR**, first split the data in 4 quarters and subtract the value at $Q_3$ from the value at $Q_1$.
					- $$IQR=Q_3-Q_1$$
				- ![image.png](../assets/image_1663005545935_0.png)
			- #### Tukey's Method for IQR
				- There are also many other methods for calculating IQR.
				- 1. Put data in **ascending** order.
				  2. The **lower quartile** ($Q_1$)is the **median** of the **lower** 50% of the data, including the median.
				  3. The **upper quartile** ($Q_3$) is the **median** of the **upper** 50% of the data, including the median.
		- #### Standard Deviation #card
		  card-last-interval:: 4.14
		  card-repeats:: 2
		  card-ease-factor:: 2.56
		  card-next-schedule:: 2022-11-27T15:16:26.346Z
		  card-last-reviewed:: 2022-11-23T12:16:26.350Z
		  card-last-score:: 5
			- A common measure of spread is the **standard deviation**, which takes into account how far *each* data value is from the mean.
				- A **deviation** is the distance of a datapoint from the mean.
			- Since the sum of all the deviations would be zero, we square each deviation and find an average of the deviations called the **variance**.
				- We then get the positive square root of the **sample variance** to get the the **sample standard deviation**, which is preferable to the sample variance, as the sample variance is in squared units.
			- The **standard deviation** is ^^sensitive to outliers.^^
			- How do you calculate the **sample variance**, and hence, the **sample standard deviation**? #card
			  card-last-interval:: 10.97
			  card-repeats:: 3
			  card-ease-factor:: 2.56
			  card-next-schedule:: 2022-11-25T15:37:19.443Z
			  card-last-reviewed:: 2022-11-14T16:37:19.444Z
			  card-last-score:: 3
				- The **sample variance**, denoted by $s^2$, is given by:
					- $$s^2=\sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$$
					- The **sample standard deviation**, denoted by $s$, is the **positive square root** of $s^2$, that is:
						- $$s=\sqrt{s^2}$$
	- ### Shape
		- #### Graphical Summaries of Data
			- Depends on the variable of interest.
				- **Categorical** response variable -> bar chart or pie chart.
					- **Categorical** response variable ^^with an explanatory variable^^ -> grouped bar chart.
				- **Continuous** response variable -> histogram, boxplot, densit plot.
					- **Continuous** response variable ^^with an explanatory variable^^ -> grouped boxplot.
		-
			- What is a **boxplot**? #card
			  card-last-interval:: 5.52
			  card-repeats:: 3
			  card-ease-factor:: 2.46
			  card-next-schedule:: 2022-11-20T04:34:53.491Z
			  card-last-reviewed:: 2022-11-14T16:34:53.492Z
			  card-last-score:: 3
				- A **boxplot** is a graphical display showing centre, spread, shape, & outliers.
				- It displays the **5-number summary**:
					- *min, Q1, median, Q3, max*
				- ![image.png](../assets/image_1663236210540_0.png)
		- What is a **histogram**? #card
		  card-last-interval:: -1
		  card-repeats:: 1
		  card-ease-factor:: 2.7
		  card-next-schedule:: 2022-11-15T00:00:00.000Z
		  card-last-reviewed:: 2022-11-14T20:00:35.072Z
		  card-last-score:: 1
			- **Histograms** are useful to show the general shape, location, and spread of data values.
			- Representation by *area*.
			- **Construction**
				- Determine range of data *minimum, maximum*.
				- Split into convenient intervals or *bins*.
				- Usually use 5 to 15 intervals.
				- Count number of observations in each interval - *frequency*.
		- When talking about the shape of the data, make sure to address the following 3 questions:
			- 1. Does the histogram have a single, central hump or several well-separated bumps?
			  2. Is the histogram or boxplot **symmetric**, or more spread out in one direction (skewed)?
			  3. Any unusual features? e.g.., outliers, spikes.
				- ![image.png](../assets/image_1663237164731_0.png)
				- ![image.png](../assets/image_1663237245117_0.png)
				-
		- #### Explanatory & Response Variables
		  collapsed:: true
			- To identify the **explanatory** variable in a pair of variables, identify which of the two is suspected of affecting the other and plan an appropriate analysis
				- explanatory variable -might effect-> response variable
					- continent -might effect-> life expectancy.
-
- ## R Markdown
	- What is **R Markdown**? #card
	  card-last-interval:: 28.93
	  card-repeats:: 4
	  card-ease-factor:: 2.56
	  card-next-schedule:: 2022-12-13T14:49:15.636Z
	  card-last-reviewed:: 2022-11-14T16:49:15.637Z
	  card-last-score:: 5
		- **R Markdown** is a file format for making ^^dynamic documents in R.^^
		- R Markdown is written in Markdown and contains chunks of embedded R code (data management, summaries, graphics, analysis & interpretation) all in one document.
		- Documents can be **knitted** to HTML, PDF, Word, and many other formats.
	- ### Key Benefits of R Markdown
		- Makes it easy to produce statistical reports with code, analysis, outputs, and write-up all in one place.
		- Perfect for reproducible research.
		- Easy to convert to different document types.
	- ### Structure
		- R Markdown contains **three** types of content:
			- A **YAML Header**.
			- Text, formatted with Markdown.
			- Code chunks.