- #[[ST2001 - Statistics in Data Science I]] - **Previous Topic:** null - **Next Topic:** [[Sampling]] - **Relevant Slides:** ![Lecture01.pdf](../assets/Lecture01_1662914505882_0.pdf) - - ## What is / are Statistics? collapsed:: true - What is a **statistic**? #card card-last-interval:: 4 card-repeats:: 2 card-ease-factor:: 2.66 card-next-schedule:: 2022-11-26T13:40:07.042Z card-last-reviewed:: 2022-11-22T13:40:07.042Z card-last-score:: 5 - A **statistic** is any quantity computed from sample data. - What is the **Science of Statistics**? card-last-interval:: 10.24 card-repeats:: 3 card-ease-factor:: 2.56 card-next-schedule:: 2022-10-13T16:37:15.281Z card-last-reviewed:: 2022-10-03T11:37:15.284Z card-last-score:: 5 - The collecting, classifying, summarising, organising, analysing, estimation, and interpretation of information. - What is the **role of statistics**? card-last-interval:: 9.28 card-repeats:: 3 card-ease-factor:: 2.32 card-next-schedule:: 2022-10-09T18:10:56.000Z card-last-reviewed:: 2022-09-30T12:10:56.000Z card-last-score:: 5 - The field of statistics deals with the collection, presentation, analysis, and use of data to: - make decisions - solve problems - design products & processes - Statistics is the ^^science of uncertainty.^^ - What is the **role of probability** in statistics? - **Probability** provides the framework for the study & application of statistics. - - ## Types of Statistics collapsed:: true - What is **Descriptive Statistics**? #card card-last-interval:: 41.44 card-repeats:: 5 card-ease-factor:: 2.18 card-next-schedule:: 2022-12-30T04:36:51.695Z card-last-reviewed:: 2022-11-18T18:36:51.696Z card-last-score:: 3 - **Descriptive Statistics** is the science of summarising data, both numerically & graphically. - The analysis methods applicable depends on the variable being measured and the research questions that you are trying to answer. - What is **Inferential Statistics**? #card card-last-interval:: 4 card-repeats:: 2 card-ease-factor:: 2.66 card-next-schedule:: 2022-11-26T13:37:26.721Z card-last-reviewed:: 2022-11-22T13:37:26.722Z card-last-score:: 3 - **Inferential Statistics** is the science of using the ^^information in your sample^^ to ^^infer^^ something about the population of statistics. - - ## Important Terms collapsed:: true - What is an **experimental unit**? #card collapsed:: true card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:44:53.701Z card-last-reviewed:: 2022-11-14T16:44:53.701Z card-last-score:: 5 - An **experimental unit** / individual is a single object upon which we collect data. e.g., a person, thing, transaction, or event. - What is a **population**? #card collapsed:: true card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:49:52.921Z card-last-reviewed:: 2022-11-14T16:49:52.922Z card-last-score:: 5 - A **population** is a ^^collection of experimental units^^ / individuals that we are interested in studying. - What is a **sample**? #card card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:51:36.002Z card-last-reviewed:: 2022-11-14T16:51:36.002Z card-last-score:: 5 - A **sample** is a subset of experimental units from the population. - What is a **variable**? #card collapsed:: true card-last-interval:: 10.64 card-repeats:: 3 card-ease-factor:: 2.66 card-next-schedule:: 2022-11-25T07:35:49.776Z card-last-reviewed:: 2022-11-14T16:35:49.776Z card-last-score:: 5 - A **variable** is a ^^characteristic or property of an individual experimental unit^^. - A variable may be measured, or more generally "observed" on each individual. - What is **Qualitative Data**? #card card-last-interval:: 8.72 card-repeats:: 3 card-ease-factor:: 2.18 card-next-schedule:: 2022-11-23T09:37:05.082Z card-last-reviewed:: 2022-11-14T16:37:05.082Z card-last-score:: 3 - **Qualitative Data** is data which can be classified into categories. - Two types of Qualitative Data: - **Ordinal:** ordered qualitative data - e.g., a grade, - **Nominal:** unordered qualitative data - e.g., a gender, a method of payment - What is **Quantitative Data**? #card card-last-interval:: 30.47 card-repeats:: 4 card-ease-factor:: 2.76 card-next-schedule:: 2022-12-15T07:21:06.252Z card-last-reviewed:: 2022-11-14T20:21:06.252Z card-last-score:: 5 - **Quantitative Data** is data in the form of counts or numbers - it cannot be classified into categories. - Two types of Quantitative Data: - **Discrete:** non-divisible, single points of data, **counts** - e.g., number of texts sent - **Continuous:** measurements that, if placed on a number scale, can be placed in an infinite number of spaces between two whole numbers - e.g., age, rent, temperature - - Pie charts make data very difficult to interpret & read - **don't use them**. - - ## Numerical Summaries - ### Central Tendency - What is a **numerical summary**? #card card-last-score:: 5 card-repeats:: 5 card-next-schedule:: 2023-02-06T22:21:22.599Z card-last-interval:: 84.1 card-ease-factor:: 2.76 card-last-reviewed:: 2022-11-14T20:21:22.599Z - A **numerical summary** is a way of summarising categorical data using a frequency count or percentage. - How do you calculate the **sample mean**? #card card-last-interval:: 29.26 card-repeats:: 4 card-ease-factor:: 2.66 card-next-schedule:: 2022-12-13T22:47:38.996Z card-last-reviewed:: 2022-11-14T16:47:38.997Z card-last-score:: 5 - Suppose that the observations in a sample are $x_1,\ x_2,\ ...\ ,\ x_n$. The **sample mean**, denoted by $\bar{x}$, is: - $$\bar{x} = \sum_{i=1}^{n}\frac{x_i}{n}=\frac{x_1+x_2+...+x_n}{n}$$ :LOGBOOK: CLOCK: [2022-09-12 Mon 18:35:39] :END: - ^^The sample mean is **sensitive** to extreme values^^ - How do you calculate the **sample median**? #card card-last-interval:: 8.72 card-repeats:: 3 card-ease-factor:: 2.18 card-next-schedule:: 2022-11-23T09:49:27.636Z card-last-reviewed:: 2022-11-14T16:49:27.637Z card-last-score:: 5 - Given that the observations in a sample are $x_1,\ x_2,\ ...\ ,\ x_n$, arranged in **increasing order** of magnitude, the **sample median** is: - $$\bar{x} = \begin{cases}x_{(n+1)/2}, & \text{if $n$ is odd},\\ \frac{1}{2}(x_{n/2} + x_{n/2+1}), &\text{if $n$ is even.} \\ \end{cases}$$ - ^^The sample median is **not** sensitive to extreme values.^^ - What is the **mode**? #card card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:51:42.814Z card-last-reviewed:: 2022-11-14T16:51:42.815Z card-last-score:: 5 - The **mode** is the most frequent observation in a dataset. - ### Variation collapsed:: true - What is the **range** of a sample? #card collapsed:: true card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:51:53.283Z card-last-reviewed:: 2022-11-14T16:51:53.283Z card-last-score:: 5 - The **range** of a sample is the **maximum** - **minimum**. - The range is a ^^poor measure of spread and is badly affected by outliers.^^ - The range is also ^^badly affected by outliers.^^ - #### Interquartile Range - What is the **interquartile range** of a sample? #card card-last-interval:: 41.44 card-repeats:: 5 card-ease-factor:: 2.18 card-next-schedule:: 2022-12-26T06:18:35.901Z card-last-reviewed:: 2022-11-14T20:18:35.901Z card-last-score:: 5 - The **interquartile range** is the middle 50% of the data. - Therefore, it is ^^robust to outliers.^^ - To calculate the **IQR**, first split the data in 4 quarters and subtract the value at $Q_3$ from the value at $Q_1$. - $$IQR=Q_3-Q_1$$ - ![image.png](../assets/image_1663005545935_0.png) - #### Tukey's Method for IQR - There are also many other methods for calculating IQR. - 1. Put data in **ascending** order. 2. The **lower quartile** ($Q_1$)is the **median** of the **lower** 50% of the data, including the median. 3. The **upper quartile** ($Q_3$) is the **median** of the **upper** 50% of the data, including the median. - #### Standard Deviation #card card-last-interval:: 4.14 card-repeats:: 2 card-ease-factor:: 2.56 card-next-schedule:: 2022-11-27T15:16:26.346Z card-last-reviewed:: 2022-11-23T12:16:26.350Z card-last-score:: 5 - A common measure of spread is the **standard deviation**, which takes into account how far *each* data value is from the mean. - A **deviation** is the distance of a datapoint from the mean. - Since the sum of all the deviations would be zero, we square each deviation and find an average of the deviations called the **variance**. - We then get the positive square root of the **sample variance** to get the the **sample standard deviation**, which is preferable to the sample variance, as the sample variance is in squared units. - The **standard deviation** is ^^sensitive to outliers.^^ - How do you calculate the **sample variance**, and hence, the **sample standard deviation**? #card card-last-interval:: 10.97 card-repeats:: 3 card-ease-factor:: 2.56 card-next-schedule:: 2022-11-25T15:37:19.443Z card-last-reviewed:: 2022-11-14T16:37:19.444Z card-last-score:: 3 - The **sample variance**, denoted by $s^2$, is given by: - $$s^2=\sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$$ - The **sample standard deviation**, denoted by $s$, is the **positive square root** of $s^2$, that is: - $$s=\sqrt{s^2}$$ - ### Shape - #### Graphical Summaries of Data - Depends on the variable of interest. - **Categorical** response variable -> bar chart or pie chart. - **Categorical** response variable ^^with an explanatory variable^^ -> grouped bar chart. - **Continuous** response variable -> histogram, boxplot, densit plot. - **Continuous** response variable ^^with an explanatory variable^^ -> grouped boxplot. - - What is a **boxplot**? #card card-last-interval:: 5.52 card-repeats:: 3 card-ease-factor:: 2.46 card-next-schedule:: 2022-11-20T04:34:53.491Z card-last-reviewed:: 2022-11-14T16:34:53.492Z card-last-score:: 3 - A **boxplot** is a graphical display showing centre, spread, shape, & outliers. - It displays the **5-number summary**: - *min, Q1, median, Q3, max* - ![image.png](../assets/image_1663236210540_0.png) - What is a **histogram**? #card card-last-interval:: -1 card-repeats:: 1 card-ease-factor:: 2.7 card-next-schedule:: 2022-11-15T00:00:00.000Z card-last-reviewed:: 2022-11-14T20:00:35.072Z card-last-score:: 1 - **Histograms** are useful to show the general shape, location, and spread of data values. - Representation by *area*. - **Construction** - Determine range of data *minimum, maximum*. - Split into convenient intervals or *bins*. - Usually use 5 to 15 intervals. - Count number of observations in each interval - *frequency*. - When talking about the shape of the data, make sure to address the following 3 questions: - 1. Does the histogram have a single, central hump or several well-separated bumps? 2. Is the histogram or boxplot **symmetric**, or more spread out in one direction (skewed)? 3. Any unusual features? e.g.., outliers, spikes. - ![image.png](../assets/image_1663237164731_0.png) - ![image.png](../assets/image_1663237245117_0.png) - - #### Explanatory & Response Variables collapsed:: true - To identify the **explanatory** variable in a pair of variables, identify which of the two is suspected of affecting the other and plan an appropriate analysis - explanatory variable -might effect-> response variable - continent -might effect-> life expectancy. - - ## R Markdown - What is **R Markdown**? #card card-last-interval:: 28.93 card-repeats:: 4 card-ease-factor:: 2.56 card-next-schedule:: 2022-12-13T14:49:15.636Z card-last-reviewed:: 2022-11-14T16:49:15.637Z card-last-score:: 5 - **R Markdown** is a file format for making ^^dynamic documents in R.^^ - R Markdown is written in Markdown and contains chunks of embedded R code (data management, summaries, graphics, analysis & interpretation) all in one document. - Documents can be **knitted** to HTML, PDF, Word, and many other formats. - ### Key Benefits of R Markdown - Makes it easy to produce statistical reports with code, analysis, outputs, and write-up all in one place. - Perfect for reproducible research. - Easy to convert to different document types. - ### Structure - R Markdown contains **three** types of content: - A **YAML Header**. - Text, formatted with Markdown. - Code chunks.