diff --git a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf index a5f6a199..60ad7f26 100644 Binary files a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf and b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf differ diff --git a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex index 863ec5f2..db7f151f 100644 --- a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex +++ b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex @@ -17,6 +17,8 @@ \usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry} \setlength{\parindent}{0pt} +\usepackage{tcolorbox} +\usepackage{amsmath} \usepackage{fancyhdr} % Headers and footers \fancyhead[R]{\normalfont \leftmark} \fancyhead[L]{} @@ -25,6 +27,7 @@ \usepackage{microtype} % Slightly tweak font spacing for aesthetics \usepackage[english]{babel} % Language hyphenation and typographical rules \usepackage{xcolor} +\setlength{\fboxsep}{0pt} \definecolor{linkblue}{RGB}{0, 64, 128} \usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref} % \newcommand{\secref}[1]{\textbf{ยง~\nameref{#1}}} @@ -47,6 +50,16 @@ \usepackage[yyyymmdd]{datetime} \renewcommand{\dateseparator}{--} +\usepackage[bottom]{footmisc} +\renewcommand{\footnoterule}{% + \hrule + \vspace{5pt} +} + +% Remove superscript from footnote numbering +\renewcommand{\thefootnote}{\arabic{footnote}} % Use Arabic numbers +\renewcommand{\footnotelabel}{\thefootnote. } % Footnote label formatting + \usepackage{enumitem} \usepackage{titlesec} @@ -651,6 +664,96 @@ Use of separate training \& test datasets is very important when developing an M If you use all of your data for training, your model could potentially have good performance on the training data but poor performance on new independent test data. +\subsection{$k$-NN Hyperparameters} +The $k$-NN algorithm also introduces a new concept to us that is very important for ML algorithms in general: hyperparameters. +In ML algorithms, a \textbf{hyperparameter} is a parameter set by the user that is used to control the behaviour of the learning process. +Many ML algorithms also have other parameters that are set by the algorithm during its learning process (e.g., the weights +assigned to connections between neurons in an artificial neural network). +Examples of hyperparameters include: +\begin{itemize} + \item Learning rate (typically denoted using the Greek letter $\alpha$). + \item Topology of a neural network (the number \& layout of neurons). + \item The choice of optimiser when updating the weights of a neural network. +\end{itemize} + +Many ML algorithms are very sensitive to the choice of hyperparameters: poor choice of values yields poor performance. +Therefore, hyperparameter tuning (i.e., determining the values that yield the best performance) is an important topic in ML. +However, some simple ML algorithms do not have any hyperparameters. +\\\\ +$k$-NN has several key hyperparameters that we must choose before applying it to a dataset: +\begin{itemize} + \item The number of neighbours $k$ to take into account when making a prediction: \mintinline{python}{n_neighbours} in the scikit-learn implementation of \mintinline{python}{KNeighboursClassifier}. + \item The method used to measure how similar instances are to one another: \mintinline{python}{metric} in scikit-learn. +\end{itemize} + +\subsection{Measuring Similarity} +\subsubsection{Measuring Similarity Using Distance} +Consider the college athletes dataset from earlier. +How should we measure the similarity between instances in this case? +\textbf{Distance} is one option: plot the points in 2D space and draw a straight line between them. +We can think of each feature of interest as a dimension in hyperspace. +\\\\ +A \textbf{metric} or distance function may be used to define the distance between any pair of elements in a set. +$\text{metric}(a,b)$ is a function that returns the distance between two instances $a$ \& $b$ in a set. +$a$ \& $b$ are vectors containing the values of the attributes we are interested in for the data points we wish to measure between. + +\subsubsection{Euclidean Distance} +\textbf{Euclidean distance} is one of the best-known distance metrics. +It computes the length of a straight line between two points. +$$ +\text{Euclidean}(a,b) = \sqrt{\sum^m_{i=1}(a[i] - b[i])^2} +$$ + +Here $m$ is the number of features / attributes to be used to calculate the distance (i.e., the dimensions of the vectors $a$ \& $b$). +Euclidean distance calculates the square root of the sum of squared differences for each feature. + +\subsubsection{Manhattan Distance} +\textbf{Manhattan distance} (also known as ``taxicab distance'') is the distance between two points measured along axes at +right angles. +$$ +\text{Manhattan}(a,b) = \sum^m_{i=1}\text{abs}(a[i] - b[i]) +$$ + +As before, $m$ is the number of features / attributes to be used to calculate the distance (i.e., the dimension of the vectors $a$ \& $b$) and $\text{abs}()$ is a function which returns the absolute value of a number. +Manhattan distance calculates the sum of the absolute differences for each feature. + +\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Example: Calculating Distance}] + Calculate the distance between $d_{12} = [5.00, 2.50]$ \& $d_5 = [2.75, 7.50]$. + $$ + \text{Euclidean}(d_{12}, d_5) = \sqrt{(5.00 - 2.75)^2 + (2.50 - 7.50)^2} = 5.483 + $$ + $$ + \text{Manhattan}(d_{12}, d_5) = \text{abs}(5.00 - 2.75) + \text{abs}(2.50 - 7.50) = 7.25 + $$ + + \begin{figure}[H] + \centering + \includegraphics[width=0.5\textwidth]{./images/calc_distance_example.png} + \caption{Euclidean vs Manhattan Distance} + \end{figure} +\end{tcolorbox} + +\subsubsection{Minkowski Distance} +The \textbf{Minkowski distance} metric generalises both the Manhattan distance and the Euclidean distance metrics. +$$ +\text{Minkowski}(a,b) = \left( \sum^m_{i=1} \text{abs}(a[i] - b[i])^p \right)^{\frac{1}{p}} +$$ +As before, $m$ is the number of features / attributes to be used to calculate the distance (i.e., the dimension of the vectors $a$ \& $b$). +Minkowski distance calculates the absolute value of the differences for each feature. + +\subsubsection{Similarity for Discrete Attributes} +Thus far we have considered similarity measures that only apply to continuous attributes\footnote{Note that discrete/continuous attributes are not to be confused with classification/regression}. +Many datasets have attributes that have a finite number of discrete values (e.g., Yes/No or True/False, survey responses, ratings). +One approach to handling discrete attributes is \textbf{Hamming distance}: the Hamming distance is calculated as 0 for each attribute where both cases have the same value and 1 for each attribute where they are different. +E.g., Hamming distance between the strings ``Ste\colorbox{yellow}{phe}n'' and ``Ste\colorbox{yellow}{fan}n'' is 3. + +\subsubsection{Comparison of Distance Metrics} +Euclidean \& Manhattan distance are the most commonly used distance metrics although it is possible to define infinitely many distance metrics using the Minkowski distance. +Manhattan distance is cheaper to compute than Euclidean distance as it is not necessary to compute the squares of differences and a square root, so Manhattan distance may be a better choice for very large datasets if computational resources are limited. +It's worthwhile to try out several different distance metrics to see which is the most suitable for the dataset at hand. +Many other methods to measure similarity also exist, including cosine similarity, Russel-Rao, Sokal-Michener. + + diff --git a/year4/semester1/CT4101: Machine Learning/notes/images/calc_distance_example.png b/year4/semester1/CT4101: Machine Learning/notes/images/calc_distance_example.png new file mode 100644 index 00000000..2f58df5a Binary files /dev/null and b/year4/semester1/CT4101: Machine Learning/notes/images/calc_distance_example.png differ