[CT4101]: Week 4 lecture notes

This commit is contained in:
2024-10-03 14:25:26 +01:00
parent aee5a1587d
commit 2975ff70bd
3 changed files with 103 additions and 0 deletions

View File

@ -17,6 +17,8 @@
\usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry} \usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry}
\setlength{\parindent}{0pt} \setlength{\parindent}{0pt}
\usepackage{tcolorbox}
\usepackage{amsmath}
\usepackage{fancyhdr} % Headers and footers \usepackage{fancyhdr} % Headers and footers
\fancyhead[R]{\normalfont \leftmark} \fancyhead[R]{\normalfont \leftmark}
\fancyhead[L]{} \fancyhead[L]{}
@ -25,6 +27,7 @@
\usepackage{microtype} % Slightly tweak font spacing for aesthetics \usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules \usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{xcolor} \usepackage{xcolor}
\setlength{\fboxsep}{0pt}
\definecolor{linkblue}{RGB}{0, 64, 128} \definecolor{linkblue}{RGB}{0, 64, 128}
\usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref} \usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref}
% \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}} % \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}}
@ -47,6 +50,16 @@
\usepackage[yyyymmdd]{datetime} \usepackage[yyyymmdd]{datetime}
\renewcommand{\dateseparator}{--} \renewcommand{\dateseparator}{--}
\usepackage[bottom]{footmisc}
\renewcommand{\footnoterule}{%
\hrule
\vspace{5pt}
}
% Remove superscript from footnote numbering
\renewcommand{\thefootnote}{\arabic{footnote}} % Use Arabic numbers
\renewcommand{\footnotelabel}{\thefootnote. } % Footnote label formatting
\usepackage{enumitem} \usepackage{enumitem}
\usepackage{titlesec} \usepackage{titlesec}
@ -651,6 +664,96 @@ Use of separate training \& test datasets is very important when developing an M
If you use all of your data for training, your model could potentially have good performance on the training data If you use all of your data for training, your model could potentially have good performance on the training data
but poor performance on new independent test data. but poor performance on new independent test data.
\subsection{$k$-NN Hyperparameters}
The $k$-NN algorithm also introduces a new concept to us that is very important for ML algorithms in general: hyperparameters.
In ML algorithms, a \textbf{hyperparameter} is a parameter set by the user that is used to control the behaviour of the learning process.
Many ML algorithms also have other parameters that are set by the algorithm during its learning process (e.g., the weights
assigned to connections between neurons in an artificial neural network).
Examples of hyperparameters include:
\begin{itemize}
\item Learning rate (typically denoted using the Greek letter $\alpha$).
\item Topology of a neural network (the number \& layout of neurons).
\item The choice of optimiser when updating the weights of a neural network.
\end{itemize}
Many ML algorithms are very sensitive to the choice of hyperparameters: poor choice of values yields poor performance.
Therefore, hyperparameter tuning (i.e., determining the values that yield the best performance) is an important topic in ML.
However, some simple ML algorithms do not have any hyperparameters.
\\\\
$k$-NN has several key hyperparameters that we must choose before applying it to a dataset:
\begin{itemize}
\item The number of neighbours $k$ to take into account when making a prediction: \mintinline{python}{n_neighbours} in the scikit-learn implementation of \mintinline{python}{KNeighboursClassifier}.
\item The method used to measure how similar instances are to one another: \mintinline{python}{metric} in scikit-learn.
\end{itemize}
\subsection{Measuring Similarity}
\subsubsection{Measuring Similarity Using Distance}
Consider the college athletes dataset from earlier.
How should we measure the similarity between instances in this case?
\textbf{Distance} is one option: plot the points in 2D space and draw a straight line between them.
We can think of each feature of interest as a dimension in hyperspace.
\\\\
A \textbf{metric} or distance function may be used to define the distance between any pair of elements in a set.
$\text{metric}(a,b)$ is a function that returns the distance between two instances $a$ \& $b$ in a set.
$a$ \& $b$ are vectors containing the values of the attributes we are interested in for the data points we wish to measure between.
\subsubsection{Euclidean Distance}
\textbf{Euclidean distance} is one of the best-known distance metrics.
It computes the length of a straight line between two points.
$$
\text{Euclidean}(a,b) = \sqrt{\sum^m_{i=1}(a[i] - b[i])^2}
$$
Here $m$ is the number of features / attributes to be used to calculate the distance (i.e., the dimensions of the vectors $a$ \& $b$).
Euclidean distance calculates the square root of the sum of squared differences for each feature.
\subsubsection{Manhattan Distance}
\textbf{Manhattan distance} (also known as ``taxicab distance'') is the distance between two points measured along axes at
right angles.
$$
\text{Manhattan}(a,b) = \sum^m_{i=1}\text{abs}(a[i] - b[i])
$$
As before, $m$ is the number of features / attributes to be used to calculate the distance (i.e., the dimension of the vectors $a$ \& $b$) and $\text{abs}()$ is a function which returns the absolute value of a number.
Manhattan distance calculates the sum of the absolute differences for each feature.
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Example: Calculating Distance}]
Calculate the distance between $d_{12} = [5.00, 2.50]$ \& $d_5 = [2.75, 7.50]$.
$$
\text{Euclidean}(d_{12}, d_5) = \sqrt{(5.00 - 2.75)^2 + (2.50 - 7.50)^2} = 5.483
$$
$$
\text{Manhattan}(d_{12}, d_5) = \text{abs}(5.00 - 2.75) + \text{abs}(2.50 - 7.50) = 7.25
$$
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{./images/calc_distance_example.png}
\caption{Euclidean vs Manhattan Distance}
\end{figure}
\end{tcolorbox}
\subsubsection{Minkowski Distance}
The \textbf{Minkowski distance} metric generalises both the Manhattan distance and the Euclidean distance metrics.
$$
\text{Minkowski}(a,b) = \left( \sum^m_{i=1} \text{abs}(a[i] - b[i])^p \right)^{\frac{1}{p}}
$$
As before, $m$ is the number of features / attributes to be used to calculate the distance (i.e., the dimension of the vectors $a$ \& $b$).
Minkowski distance calculates the absolute value of the differences for each feature.
\subsubsection{Similarity for Discrete Attributes}
Thus far we have considered similarity measures that only apply to continuous attributes\footnote{Note that discrete/continuous attributes are not to be confused with classification/regression}.
Many datasets have attributes that have a finite number of discrete values (e.g., Yes/No or True/False, survey responses, ratings).
One approach to handling discrete attributes is \textbf{Hamming distance}: the Hamming distance is calculated as 0 for each attribute where both cases have the same value and 1 for each attribute where they are different.
E.g., Hamming distance between the strings ``Ste\colorbox{yellow}{phe}n'' and ``Ste\colorbox{yellow}{fan}n'' is 3.
\subsubsection{Comparison of Distance Metrics}
Euclidean \& Manhattan distance are the most commonly used distance metrics although it is possible to define infinitely many distance metrics using the Minkowski distance.
Manhattan distance is cheaper to compute than Euclidean distance as it is not necessary to compute the squares of differences and a square root, so Manhattan distance may be a better choice for very large datasets if computational resources are limited.
It's worthwhile to try out several different distance metrics to see which is the most suitable for the dataset at hand.
Many other methods to measure similarity also exist, including cosine similarity, Russel-Rao, Sokal-Michener.

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB