[CT4101]: Week 8 lecture materials & slides

This commit is contained in:
2024-10-31 10:46:52 +00:00
parent af26190d70
commit 6eea1b4d87
7 changed files with 199 additions and 0 deletions

View File

@ -0,0 +1,3 @@
equal frequency and equal width binning very well may come up on exam
need to know entropy formnulae as often asked on exam

View File

@ -1180,9 +1180,205 @@ The area under the ROC curve is often used to determine the ``strength'' of a cl
scikit-learn provides various classes to generate ROC curves: \url{https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html}.
The link also gives some details on how ROC curves can be computed for mutli-class classification problems.
\section{Data Processing \& Normalisation}
\subsection{Data Normalisation}
One major problem in data normalisation is \textbf{scaling}.
For example, if Attribute 1 has a range of 0-10 and Attribute 2 has a range from 0-1000, then Attribute 2 will dominate calculations.
The solution for this problem is to rescale all dimensions independently:
\begin{itemize}
\item Z-normalisation: calculated by subtracting the population mean from an individual raw score and then dividing it by the population standard deviation.
\[
z = \frac{x - \mu}{\sigma}
\]
where $z$ is the z-score, $x$ is the raw score, $\mu$ is the population mean, \& $\sigma$ is the standard deviation of the population.
\\\\
This can be achieved in scikit-learn using the \mintinline{python}{StandardScaler} utility class.
\item Min-Max data scaling: also called 0-1 normalisation or range normalisation.
\[
X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
\]
This can be achieved in scikit-learn using the \mintinline{python}{MinMaxScaler} utility class.
\end{itemize}
It is generally good practice to normalise continuous variables before developing an ML model.
Some algorithms (e.g., $k$-NN) are much more susceptible to the effects of the relative scale of attributes than others (e.g., decision trees are more robust to the effects of scale).
\subsection{Binning}
\textbf{Binning} involves converting a continuous feature into a categorical feature.
To perform binning, we define a series of ranges called \textbf{bins} for the continuous feature that corresponds to the levels of the new categorical feature we are creating.
Two of the more popular ways of defining bins include equal-width binning \& equal-frequency binning.
\\\\
Deciding on the number of bins can be complex:
in general, if we set the number of bins to a very low number we may lose a lot of information, but if we set the number of bins to a very high number then we might have very few instances in each bin or even end up with empty bins.
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{images/numbins.png}
\caption{The effect of different numbers of bins}
\end{figure}
\subsubsection{Equal-Width Binning}
The \textbf{equal-width binning} approach splits the range of the feature values into $b$ bins, each of size $\frac{\text{range}}{b}$.
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{images/equalbins.png}
\caption{Equal-width binning}
\end{figure}
\subsubsection{Equal-Frequency Binning}
\textbf{Equal-frequency binning} first sorts the continuous feature values into ascending order, and then places an equal number of instances into each bin, starting with bin 1.
The number of instances placed in each bin is simply the total number of instances divided by the number of bins $b$.
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{images/freqbins.png}
\caption{Equal-frequency binning}
\end{figure}
\subsection{Sampling}
Sometimes, the dataset that we have is so large that we do not use all the data available to us and instead take a smaller percentage from the larger dataset.
For example, we may wish to only use part of the data because training will take a long time with very many examples for some algorithms.
In the case of $k$-NN, a very large training set may lead to long prediction times.
However, we need to be careful when sampling to ensure that the resulting datasets are still representative of the original data and that no unintended bias is introduced during this process.
Common forms of sampling include: top sampling, random sampling, stratified sampling, under-sampling, \& over-sampling.
\subsubsection{Top Sampling}
\textbf{Top sampling} simply selects the top $s\%$ of instances from a dataset to create a sample.
It runs a serious risk of introducing bias as the sample will be affected by any ordering of the original dataset; therefore, top sampling should be avoided.
\subsubsection{Random Sampling}
\textbf{Random sampling} is a good default sampling strategy as it randomly selects a proportion ($s\%$) of the instances from a large dataset to create a smaller set.
It is a good choice in most cases as the random nature of the selection of instances should avoid introducing bias.
\subsubsection{Stratified Sampling}
\textbf{Stratified sampling} is a sampling method that ensures that the relative frequencies of the levels of a specific stratification feature are maintained in the sampled dataset.
To perform stratified sampling, the instances in a dataset are divided into groups (or \textit{strata}), where each group contains only instances that have a particular level for the stratification feature.
$s\%$ of the instances in each stratum are randomly selected and these selections are combined to give an overall sample of $s\%$ of the original dataset.
\\\\
In contrast to stratified sampling, sometimes we would like a sample to contain different relative frequencies of the levels of a particular discrete feature to the distribution in the original dataset.
To do this, we can use under-sampling or over-sampling.
\subsubsection{Under-Sampling}
\textbf{Under-sampling} begins by dividing a dataset into groups, where each group contains only instances that have a particular level for the feature to be under-sampled.
The number of instances in the smallest group is the under-sampling target size.
Each group containing more instances than the smallest one is then randomly sampled by the appropriate percentage to create a subset that is the under-sampling target size.
These under-sampled groups are then combined to create the overall under-sampled dataset.
\subsubsection{Over-Sampling}
\textbf{Over-sampling} addresses the same issue as under-sampling but in the opposite way:
after dividing the dataset into groups, the number of instances in the largest group becomes the over-sampling target size.
From each smaller group, we then create a sample containing that number of instances using random sampling without replacement.
These larger samples are combined to form the overall over-sampled dataset.
\subsection{Feature Selection}
\subsubsection{The Curse of Dimensionality}
Some attributes are much more significant than others are considered equally in distance metric, possibly leading to bad predictions.
$k$-NN uses all attributes when making a prediction, whereas other algorithms (e.g., decision trees) use only the most useful features and so are not as badly affected by the curse of dimensionality.
Any algorithm that considers all attributes in a high-dimensional space equally has this problem, not just $k$-NN + Euclidean distance.
Two solutions to the curse of dimensionality are:
\begin{itemize}
\item Assign weighting to each dimension (not the same as distance-weighted $k$-NN).
Optimise weighting to minimise error.
\item Give some dimensions 0 weight: feature subset selection.
\end{itemize}
Consider $~$ cases with $d$ dimensions, in a hypercube of \textit{unit} volume.
Assume that neighbourhoods are hypercubes with length $b$; volume is $b^d$.
To contain $k$ points, the average neighbourhood must occupy $\frac{k}{N}$ of the entire volume.
\begin{align*}
\Rightarrow& b^d = \frac{k}{N} \\
\Rightarrow& b = \left( \frac{k}{N}\right)^\frac{1}{d}
\end{align*}
In high dimensions, e.g. $k=10$, $N=1,000,000$, $d=100$, then $b = 0.89$, i.e., a neighbourhood must occupy 90\% of each dimension of space.
In low dimensions, with $k=10$, $N=1,000,000$, $d=2$, then $b= 0.003$ which is acceptable.
High dimensional spaces are generally very sparse, and each neighbour is very far away.
\subsubsection{Feature Selection}
Fortunately, some algorithms partially mitigate the effects of the curse of dimensionality (e.g., decision tree learning).
However, this is not true for all algorithms, and heuristics for search can sometimes be misleading.
$k$-NN and many other algorithms use all attributes when making a prediction.
Acquiring more data is not always a realistic option; the best way to avoid the curse of dimensionality is to use only the most useful features during learning: \textbf{feature selection}.
\\\\
We may wish to distinguish between different types of descriptive features:
\begin{itemize}
\item \textbf{Predictive:} provides information that is useful when estimating the correct target value.
\item \textbf{Interacting:} provides useful information only when considered in conjunction with other features.
\item \textbf{Redundant:} features that have a strong correlation with another feature.
\item \textbf{Irrelevant:} doesn't provide any useful information for estimating the target value.
\end{itemize}
Ideally, a good feature selection approach should identify the smallest subset of features that maintain prediction performance.
\subsubsection{Feature Selection Approaches}
\begin{itemize}
\item \textbf{Rank \& Prune:} rank features according to their predictive power and keep only the top $X\%$.
A \textbf{filter} is a measure of the predictive power used during ranking, e.g., information gain.
A drawback of rank \& prune is that features are evaluated in isolation, so we will miss useful \textit{interacting features}.
\item \textbf{Search for useful feature subsets:}
we can pick out useful interacting features by evaluating feature subsets.
We could generate, evaluate, \& rank all possible feature subsets then pick the best (essentially a brute force approach, computationally expensive).
A better approach is a \textbf{greedy local search}, which builds the feature subset iteratively by starting out with an empty selection, then trying to add additional features incrementally.
This requires evaluation experiments along the way.
We stop trying to add more features to the selection once termination conditions are met.
\end{itemize}
\subsection{Covariance \& Correlation}
As well as visually inspecting scatter plots, we can calculate formal measures of the relationship between two continuous features using \textbf{covariance} \& \textbf{correlation}.
\subsubsection{Measuring Covariance}
For two features $a$ \& $b$ in a dataset of $n$ instances, the \textbf{sample covariance} between $a$ \& $b$ is:
\[
\text{cov}(a,b) = \frac{1}{n-1} \sum^n_{i=1} \left( \left( a_i - \overline{a} \right) \times \left( b_i - \overline{b} \right) \right)
\]
where $a_i$ \& $b_i$ are the $i^\text{th}$ instances of features $a$ \& $b$ in a dataset, and $\overline{a}$ \& $\overline{b}$ are the sample means of features $a$ \& $b$.
\\\\
Covariance values fall into the range $[- \infty, \infty]$, where negative values indicate a negative relationship, positive values indicate a positive relationship, \& values near to zero indicate that there is little to no relationship between the features.
\\\\
Covariance is measured in the same units as the features that it measures, so comparing something like weight of a basketball player to the height of a basketball player doesn't really make sense as the features are not in the same units.
To solve this problem, we use the \textbf{correlation coefficient}, also known as the Pearson product-moment correlation coefficient or Pearson's $r$.
\subsubsection{Measuring Correlation}
\textbf{Correlation} is a normalised form of covariance with range $[-1,1]$.
The correlation between two features $a$ \& $b$ can be calculated as
\[
\text{corr}(a,b) = \frac{\text{cov}(a,b)}{\text{sd}(a) \times \text{sd}(b)}
\]
where $\text{cov}(a,b)$ is the covariance between features $a$ \& $b$, and $\text{sd}(a)$ \& $\text{sd}(b)$ are the standard deviations of $a$ \& $b$ respectively.
\\\\
Correlation values fall into the range $[-1,1]$ where values close to $-1$ indicate a very strong negative correlation (or covariance), values close to $1$ indicate a very strong positive correlation, \& values around $0$ indicate no correlation.
Features that have no correlation are said to be \textbf{independent}.
\\\\
The \textbf{covariance matrix}, usually denoted as $\sum$, between a set of continuous features $\{a,b, \dots, z\}$, is given as
\[
\sum_{\{ a, b, \dots, z \}} =
\begin{bmatrix}
\text{var}(a) & \text{cov}(a,b) & \cdots & \text{cov}(a,z) \\
\text{cov}(a,b) & \text{var}(b) & \cdots & \text{cov}(b,z) \\
\vdots & \vdots & \vdots & \vdots \\
\text{cov}(z,a) & \text{cov}(z,b) & \cdots & \text{var}(z)
\end{bmatrix}
\]
Similarly, the \textbf{correlation matrix} is just a normalised version of the covariance matrix and shows the correlation between each pair of features:
\[
\underset{\{ a,b,\dots,z \}}{\text{correlation matrix}} =
\begin{bmatrix}
\text{corr}(a,a) & \text{corr}(a,b) & \cdots & \text{corr}(a,z) \\
\text{corr}(b,a) & \text{corr}(b,b) & \cdots & \text{corr}(b,z) \\
\vdots & \vdots & \cdots & \vdots \\
\text{corr}(z,a) & \text{corr}(z,b) & \cdots & \text{corr}(z,z)
\end{bmatrix}
\]
Correlation is a good measure of the relationship between two continuous features, but is not perfect.
Firstly, the correlation measure given earlier responds only to linear relationships between features.
In a linear relationship between two features, as one feature increases or decreases, the other feature increases or decreases by a corresponding amount.
Frequently, features will have a very strong non-linear relationship that correlation does not respond to.
Some limitations of measuring correlation are illustrated very clearly in the famous example of Anscombe's Quartet, published by the famous statistician Francis Anscombe in 1973.

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 179 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB