Files
uni/year2/semester1/logseq-stuff/pages/Exploratory Data Analysis.md

13 KiB

  • #ST2001 - Statistics in Data Science I
  • Previous Topic: null
  • Next Topic: Sampling
  • Relevant Slides: Lecture01.pdf
  • What is / are Statistics?

    collapsed:: true
    • What is a statistic? #card card-last-interval:: 4 card-repeats:: 2 card-ease-factor:: 2.66 card-next-schedule:: 2022-11-26T13:40:07.042Z card-last-reviewed:: 2022-11-22T13:40:07.042Z card-last-score:: 5
      • A statistic is any quantity computed from sample data.
    • What is the Science of Statistics? card-last-interval:: 10.24 card-repeats:: 3 card-ease-factor:: 2.56 card-next-schedule:: 2022-10-13T16:37:15.281Z card-last-reviewed:: 2022-10-03T11:37:15.284Z card-last-score:: 5
      • The collecting, classifying, summarising, organising, analysing, estimation, and interpretation of information.
    • What is the role of statistics? card-last-interval:: 9.28 card-repeats:: 3 card-ease-factor:: 2.32 card-next-schedule:: 2022-10-09T18:10:56.000Z card-last-reviewed:: 2022-09-30T12:10:56.000Z card-last-score:: 5
      • The field of statistics deals with the collection, presentation, analysis, and use of data to:
        • make decisions
        • solve problems
        • design products & processes
      • Statistics is the ^^science of uncertainty.^^
    • What is the role of probability in statistics?
      • Probability provides the framework for the study & application of statistics.
  • Types of Statistics

    collapsed:: true
    • What is Descriptive Statistics? #card card-last-interval:: 41.44 card-repeats:: 5 card-ease-factor:: 2.18 card-next-schedule:: 2022-12-30T04:36:51.695Z card-last-reviewed:: 2022-11-18T18:36:51.696Z card-last-score:: 3
      • Descriptive Statistics is the science of summarising data, both numerically & graphically.
      • The analysis methods applicable depends on the variable being measured and the research questions that you are trying to answer.
    • What is Inferential Statistics? #card card-last-interval:: 4 card-repeats:: 2 card-ease-factor:: 2.66 card-next-schedule:: 2022-11-26T13:37:26.721Z card-last-reviewed:: 2022-11-22T13:37:26.722Z card-last-score:: 3
      • Inferential Statistics is the science of using the ^^information in your sample^^ to ^^infer^^ something about the population of statistics.
  • Important Terms

    collapsed:: true
    • What is an experimental unit? #card collapsed:: true card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:44:53.701Z card-last-reviewed:: 2022-11-14T16:44:53.701Z card-last-score:: 5
      • An experimental unit / individual is a single object upon which we collect data. e.g., a person, thing, transaction, or event.
    • What is a population? #card collapsed:: true card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:49:52.921Z card-last-reviewed:: 2022-11-14T16:49:52.922Z card-last-score:: 5
      • A population is a ^^collection of experimental units^^ / individuals that we are interested in studying.
    • What is a sample? #card card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:51:36.002Z card-last-reviewed:: 2022-11-14T16:51:36.002Z card-last-score:: 5
      • A sample is a subset of experimental units from the population.
    • What is a variable? #card collapsed:: true card-last-interval:: 10.64 card-repeats:: 3 card-ease-factor:: 2.66 card-next-schedule:: 2022-11-25T07:35:49.776Z card-last-reviewed:: 2022-11-14T16:35:49.776Z card-last-score:: 5
      • A variable is a ^^characteristic or property of an individual experimental unit^^.
      • A variable may be measured, or more generally "observed" on each individual.
    • What is Qualitative Data? #card card-last-interval:: 8.72 card-repeats:: 3 card-ease-factor:: 2.18 card-next-schedule:: 2022-11-23T09:37:05.082Z card-last-reviewed:: 2022-11-14T16:37:05.082Z card-last-score:: 3
      • Qualitative Data is data which can be classified into categories.
      • Two types of Qualitative Data:
        • Ordinal: ordered qualitative data - e.g., a grade,
        • Nominal: unordered qualitative data - e.g., a gender, a method of payment
    • What is Quantitative Data? #card card-last-interval:: 30.47 card-repeats:: 4 card-ease-factor:: 2.76 card-next-schedule:: 2022-12-15T07:21:06.252Z card-last-reviewed:: 2022-11-14T20:21:06.252Z card-last-score:: 5
      • Quantitative Data is data in the form of counts or numbers - it cannot be classified into categories.
      • Two types of Quantitative Data:
        • Discrete: non-divisible, single points of data, counts - e.g., number of texts sent
        • Continuous: measurements that, if placed on a number scale, can be placed in an infinite number of spaces between two whole numbers - e.g., age, rent, temperature
  • Pie charts make data very difficult to interpret & read - don't use them.
  • Numerical Summaries

    • Central Tendency

      • What is a numerical summary? #card card-last-score:: 5 card-repeats:: 5 card-next-schedule:: 2023-02-06T22:21:22.599Z card-last-interval:: 84.1 card-ease-factor:: 2.76 card-last-reviewed:: 2022-11-14T20:21:22.599Z
        • A numerical summary is a way of summarising categorical data using a frequency count or percentage.
      • How do you calculate the sample mean? #card card-last-interval:: 29.26 card-repeats:: 4 card-ease-factor:: 2.66 card-next-schedule:: 2022-12-13T22:47:38.996Z card-last-reviewed:: 2022-11-14T16:47:38.997Z card-last-score:: 5
        • Suppose that the observations in a sample are x_1,\ x_2,\ ...\ ,\ x_n. The sample mean, denoted by \bar{x}, is:
          • \bar{x} = \sum_{i=1}^{n}\frac{x_i}{n}=\frac{x_1+x_2+...+x_n}{n} :LOGBOOK: CLOCK: [2022-09-12 Mon 18:35:39] :END:
        • ^^The sample mean is sensitive to extreme values^^
      • How do you calculate the sample median? #card card-last-interval:: 8.72 card-repeats:: 3 card-ease-factor:: 2.18 card-next-schedule:: 2022-11-23T09:49:27.636Z card-last-reviewed:: 2022-11-14T16:49:27.637Z card-last-score:: 5
        • Given that the observations in a sample are x_1,\ x_2,\ ...\ ,\ x_n, arranged in increasing order of magnitude, the sample median is:
          • \bar{x} = \begin{cases}x_{(n+1)/2}, & \text{if $n$ is odd},\\ \frac{1}{2}(x_{n/2} + x_{n/2+1}), &\text{if $n$ is even.} \\ \end{cases}
          • ^^The sample median is not sensitive to extreme values.^^
      • What is the mode? #card card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:51:42.814Z card-last-reviewed:: 2022-11-14T16:51:42.815Z card-last-score:: 5
        • The mode is the most frequent observation in a dataset.
    • Variation

      collapsed:: true
      • What is the range of a sample? #card collapsed:: true card-last-interval:: 33.64 card-repeats:: 4 card-ease-factor:: 2.9 card-next-schedule:: 2022-12-18T07:51:53.283Z card-last-reviewed:: 2022-11-14T16:51:53.283Z card-last-score:: 5
        • The range of a sample is the maximum - minimum.
        • The range is a ^^poor measure of spread and is badly affected by outliers.^^
        • The range is also ^^badly affected by outliers.^^
      • Interquartile Range

        • What is the interquartile range of a sample? #card card-last-interval:: 41.44 card-repeats:: 5 card-ease-factor:: 2.18 card-next-schedule:: 2022-12-26T06:18:35.901Z card-last-reviewed:: 2022-11-14T20:18:35.901Z card-last-score:: 5
          • The interquartile range is the middle 50% of the data.
            • Therefore, it is ^^robust to outliers.^^
          • To calculate the IQR, first split the data in 4 quarters and subtract the value at Q_3 from the value at Q_1.
            • IQR=Q_3-Q_1
          • image.png
        • Tukey's Method for IQR

          • There are also many other methods for calculating IQR.
            1. Put data in ascending order.
            2. The lower quartile (Q_1)is the median of the lower 50% of the data, including the median.
            3. The upper quartile (Q_3) is the median of the upper 50% of the data, including the median.
      • Standard Deviation #card

        card-last-interval:: 4.14 card-repeats:: 2 card-ease-factor:: 2.56 card-next-schedule:: 2022-11-27T15:16:26.346Z card-last-reviewed:: 2022-11-23T12:16:26.350Z card-last-score:: 5
        • A common measure of spread is the standard deviation, which takes into account how far each data value is from the mean.
          • A deviation is the distance of a datapoint from the mean.
        • Since the sum of all the deviations would be zero, we square each deviation and find an average of the deviations called the variance.
          • We then get the positive square root of the sample variance to get the the sample standard deviation, which is preferable to the sample variance, as the sample variance is in squared units.
        • The standard deviation is ^^sensitive to outliers.^^
        • How do you calculate the sample variance, and hence, the sample standard deviation? #card card-last-interval:: 10.97 card-repeats:: 3 card-ease-factor:: 2.56 card-next-schedule:: 2022-11-25T15:37:19.443Z card-last-reviewed:: 2022-11-14T16:37:19.444Z card-last-score:: 3
          • The sample variance, denoted by s^2, is given by:
            • s^2=\sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}
            • The sample standard deviation, denoted by s, is the positive square root of s^2, that is:
              • s=\sqrt{s^2}
    • Shape

      • Graphical Summaries of Data

        • Depends on the variable of interest.
          • Categorical response variable -> bar chart or pie chart.
            • Categorical response variable ^^with an explanatory variable^^ -> grouped bar chart.
          • Continuous response variable -> histogram, boxplot, densit plot.
            • Continuous response variable ^^with an explanatory variable^^ -> grouped boxplot.
        • What is a boxplot? #card card-last-interval:: 5.52 card-repeats:: 3 card-ease-factor:: 2.46 card-next-schedule:: 2022-11-20T04:34:53.491Z card-last-reviewed:: 2022-11-14T16:34:53.492Z card-last-score:: 3
          • A boxplot is a graphical display showing centre, spread, shape, & outliers.
          • It displays the 5-number summary:
            • min, Q1, median, Q3, max
          • image.png
      • What is a histogram? #card card-last-interval:: -1 card-repeats:: 1 card-ease-factor:: 2.7 card-next-schedule:: 2022-11-15T00:00:00.000Z card-last-reviewed:: 2022-11-14T20:00:35.072Z card-last-score:: 1
        • Histograms are useful to show the general shape, location, and spread of data values.
        • Representation by area.
        • Construction
          • Determine range of data minimum, maximum.
          • Split into convenient intervals or bins.
          • Usually use 5 to 15 intervals.
          • Count number of observations in each interval - frequency.
      • When talking about the shape of the data, make sure to address the following 3 questions:
          1. Does the histogram have a single, central hump or several well-separated bumps?
          2. Is the histogram or boxplot symmetric, or more spread out in one direction (skewed)?
          3. Any unusual features? e.g.., outliers, spikes.
          • image.png
          • image.png
      • Explanatory & Response Variables

        collapsed:: true
        • To identify the explanatory variable in a pair of variables, identify which of the two is suspected of affecting the other and plan an appropriate analysis
          • explanatory variable -might effect-> response variable
            • continent -might effect-> life expectancy.
  • R Markdown

    • What is R Markdown? #card card-last-interval:: 28.93 card-repeats:: 4 card-ease-factor:: 2.56 card-next-schedule:: 2022-12-13T14:49:15.636Z card-last-reviewed:: 2022-11-14T16:49:15.637Z card-last-score:: 5
      • R Markdown is a file format for making ^^dynamic documents in R.^^
      • R Markdown is written in Markdown and contains chunks of embedded R code (data management, summaries, graphics, analysis & interpretation) all in one document.
      • Documents can be knitted to HTML, PDF, Word, and many other formats.
    • Key Benefits of R Markdown

      • Makes it easy to produce statistical reports with code, analysis, outputs, and write-up all in one place.
      • Perfect for reproducible research.
      • Easy to convert to different document types.
    • Structure

      • R Markdown contains three types of content:
        • A YAML Header.
        • Text, formatted with Markdown.
        • Code chunks.