[CT4101]: Week 5 lecture notes
@ -57,8 +57,8 @@
|
||||
}
|
||||
|
||||
% Remove superscript from footnote numbering
|
||||
\renewcommand{\thefootnote}{\arabic{footnote}} % Use Arabic numbers
|
||||
\renewcommand{\footnotelabel}{\thefootnote. } % Footnote label formatting
|
||||
% \renewcommand{\thefootnote}{\arabic{footnote}} % Use Arabic numbers
|
||||
% \renewcommand{\footnotelabel}{\thefootnote. } % Footnote label formatting
|
||||
|
||||
\usepackage{enumitem}
|
||||
|
||||
@ -664,6 +664,61 @@ Use of separate training \& test datasets is very important when developing an M
|
||||
If you use all of your data for training, your model could potentially have good performance on the training data
|
||||
but poor performance on new independent test data.
|
||||
|
||||
|
||||
\section{$k$-Nearest Neighbours Algorithm}
|
||||
\textbf{$k$-nearest neigbours} (or $k$-NN) is one of the simplest machine learning algorithms.
|
||||
It generates a hypothesis using a very simple principle: predictions for the label or value assigned to a \textit{query instance} should be made based on the most \textit{similar} instances in the training dataset.
|
||||
Hence, this is also known as \textbf{similarity-based learning}.
|
||||
\\\\
|
||||
$k$-NN can be used for both classificatoin \& regression tasks, although for now we will focus only on its application to classification tasks using the scikit-learn implementation \mintinline{python}{KNeighborsClassifier}.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.5\textwidth]{images/knn.png}
|
||||
\caption{$K$-Nearest Neighbour Example}
|
||||
\end{figure}
|
||||
|
||||
The operation of the $k$-NN algorithm is relatively easy to appreciate.
|
||||
The key insight is that each example is a point in the feature space.
|
||||
If samples are close to each other in the feature space, they should be close in their target values.
|
||||
This is related to \textit{code-based reasoning}.
|
||||
When you want to classify a new \textbf{query case}, you compare it to the stored set and retrieve the $k$ most similar instances.
|
||||
The query case is the given a label based on the most similar instances.
|
||||
\\\\
|
||||
The prediction for a query case is based on several ($k$) nearest neighbours.
|
||||
We compute the similarity of the query case to all stored cases, and pick the nearest $k$ neighbours;
|
||||
the simplest way to do this is to sort the instances by distance and pick the lowest $k$ instances.
|
||||
A more efficient way of doing this would be to identify the $k$ nearest instances in a single pass through the list of distances.
|
||||
The $k$ nearest neighbours then vote on the classification of the test case: prediction is the \textbf{majority} class voted for.
|
||||
|
||||
\subsection{The Nearest Neighbour Algorithm}
|
||||
The \textbf{1-nearest neighbour algorithm} is the simplest similarity-based / instance-based method.
|
||||
There is no real training phase, we just store the training cases.
|
||||
Given a query case with a value to be predicted, we compute the distance of the query case from all stored instances and select the nearest neighbour case.
|
||||
We then assign the test case the same label (class or regression value) as its nearest neighbour.
|
||||
The main problem with this approach is susceptibility to noise; to reduce susceptibility to noise, use more than one neighbour, i.e., the $k$-nearest neighbours algorithm.
|
||||
\\\\
|
||||
1NN with Euclidean distance as the distance metric is equivalent to partitioning the feature space into a \textbf{Voronoi Tessellation}:
|
||||
finding the predicted target class is equivalent to finding which Voronoi region it occupies.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.9\textwidth]{images/voronoi.png}
|
||||
\caption{Feature Space Plot (left) \& Corresponding Voronoi Tesselation (right)}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.9\textwidth]{images/1nnboundary.png}
|
||||
\caption{1NN Decision Boundary from Voronoi Tessellation}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.9\textwidth]{images/morevoronoi.png}
|
||||
\caption{Effect of Adding More Training Data to Voronoi Tessellation}
|
||||
\end{figure}
|
||||
|
||||
\subsection{$k$-NN Hyperparameters}
|
||||
The $k$-NN algorithm also introduces a new concept to us that is very important for ML algorithms in general: hyperparameters.
|
||||
In ML algorithms, a \textbf{hyperparameter} is a parameter set by the user that is used to control the behaviour of the learning process.
|
||||
@ -753,9 +808,125 @@ Manhattan distance is cheaper to compute than Euclidean distance as it is not ne
|
||||
It's worthwhile to try out several different distance metrics to see which is the most suitable for the dataset at hand.
|
||||
Many other methods to measure similarity also exist, including cosine similarity, Russel-Rao, Sokal-Michener.
|
||||
|
||||
\subsection{Choosing a Value for $k$}
|
||||
The appropriate value for $k$ is application dependent, and experimentation is needed to find the optimal value.
|
||||
Typically, it is $> 3$ and often in the range $5$ -- $21$.
|
||||
Increasing $K$ has a \textbf{smoothing effect}:
|
||||
\begin{itemize}
|
||||
\item If $k$ is too low, it tends to overfit if the data is noisy.
|
||||
\item If $k$ is too high, it tends to underfit.
|
||||
\end{itemize}
|
||||
|
||||
In imbalanced datasets, the majority target class tends to dominate for large $k$ values.
|
||||
It's important to note that $k$ does not affect computational cost much: most of the computation is in calculating the distances from the query to all stored instances.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{images/increasingk.png}
|
||||
\caption{Effect of Increasing $k$ (1)}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{images/increasingk2.png}
|
||||
\caption{Effect of Increasing $k$ (2)}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{images/smoothingeffectk.png}
|
||||
\caption{Smoothing Effect of $k$}
|
||||
\end{figure}
|
||||
|
||||
\subsubsection{Distance-Weighted $k$-NN}
|
||||
In \textbf{distance-weighted $k$-NN}, we give each neighbour a weight equal to the inverse of its distance from the target.
|
||||
We then take the weighted vote or weighted average to classify the target case.
|
||||
It's reasonable to use $k = [\text{all training cases}]$.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{images/distanceweightingknn.png}
|
||||
\caption{Effect of Distance Weighting}
|
||||
\end{figure}
|
||||
|
||||
\section{Decision Trees}
|
||||
\textbf{Decision trees} are a fundamental structure used in information-based machine learning.
|
||||
The idea is to use a decision tree as a predictive model to decide what category/label/class an item belongs to based on the values of its features.
|
||||
Decision trees consist of \textbf{nodes} (where two branches intersect) which are decision points which partition the data.
|
||||
Observations about an item (values of features) are represented using branches.
|
||||
The terminal nodes are called \textbf{leaves} and specify the target label for an item.
|
||||
The inductive learning of a decision tree is as follows:
|
||||
\begin{enumerate}
|
||||
\item For all attributes that have not yet been used in the tree, calculate their impurity (\textbf{entropy} or \textbf{Gini index}) and \textbf{information/Gini gain} values for the training samples.
|
||||
\item Select the attribute that has the \textbf{highest} information gain.
|
||||
\item Make a tree node containing that attribute.
|
||||
\item This node \textbf{partitions} the data: apply the algorithm recursively to each partition.
|
||||
\end{enumerate}
|
||||
|
||||
The main class used in scikit-learn to implement decision tree learning for classification tasks is \mintinline{python}{DecisionTreeClassifier}.
|
||||
The default measure of impurity is the Gini index, but entropy is also an option.
|
||||
|
||||
\subsection{Computing Entropy}
|
||||
We already saw how some descriptive features can more effectively discriminate between (or predict) classes which are present in the dataset.
|
||||
Decision trees partition the data at each node, so it makes sense to use features which have higher discriminatory power ``higher up'' in a decision tree.
|
||||
Therefore, we need to develop a formal measure of the discriminatory power of a given attribute.
|
||||
\\\\
|
||||
Claude Shannon (often referred to as ``the father of information theory'') proposed a measure of the impurity of the elements in the set called \textbf{entropy}.
|
||||
Entropy may be used to measure the uncertainty of a random variable.
|
||||
The term ``entropy'' generally refers to disorder or uncertainty, so the use of this term in the context of information theory is analogous to other well-known uses of the term such as in statistical themodynamics.
|
||||
The acquisition of information (\textbf{information gain}) corresponds to a \textbf{reduction in entropy}.
|
||||
\\\\
|
||||
The \textbf{entropy} of a dataset $S$ with $n$ different classes may be calculated as:
|
||||
\[
|
||||
\text{Ent}(S) = \sum^n_{i=1} -p_i \log_2 p_i
|
||||
\]
|
||||
where $p_i$ is the proportion of the class $i$ in the dataset.
|
||||
This is an example of a probability mass function.
|
||||
Entropy is typically measured in \textbf{bits} (note the $\log_2$ in the equation above):
|
||||
the lowest possible entropy output from this function is 0 ($\log_2 1 = 0$), while the highest possible entropy is $\log_2n$ (which is equal to 1 when there are only two classes).
|
||||
\\\\
|
||||
We use the binary logarithm because a useful measure of uncertainty should assign high uncertainty to outcomes with a low probability and assign low uncertainty values to outcomes with a high probability.
|
||||
$\log_2$ returns large negative values when $P$ is close to 0 and small negative values when $P$ is close to 1.
|
||||
We use $-\log_2$ for convenience, as it returns positive entropy values with 0 as the lowest entropy.
|
||||
|
||||
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Worked Entropy Example}]
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.6\textwidth]{images/anyonefortennis.png}
|
||||
\caption{Example Data}
|
||||
\end{figure}
|
||||
|
||||
Workings;
|
||||
\begin{align*}
|
||||
\text{Ent}(S) =& \text{Ent}([9+, 5-]) \\
|
||||
=& \frac{-9}{14} \log_2 \left( \frac{9}{14} \right) - \frac{5}{14} \log_2 \left( \frac{5}{14} \right) \\
|
||||
=& 0.9403
|
||||
\end{align*}
|
||||
|
||||
Note that if you are calculating entropy using a spreadsheet application such as Excel, make sure that you are using $\log_2$, e.g. \verb|LOG(9/14,2)|.
|
||||
\end{tcolorbox}
|
||||
|
||||
\subsection{Computing Information Gain}
|
||||
The \textbf{information gain} of an attribute is the reduction of entropy from partitioning the data according to that attribute:
|
||||
\[
|
||||
\text{Gain}(S,A) = \text{Ent}(S) - \sum_{v \in \text{Values}(A)} \frac{\left| S_v \right|}{\left| S \right|} \text{Ent}(S_v)
|
||||
\]
|
||||
|
||||
Here $S$ is the entire set of data being considered and $S_v$ refers to each partition of the data according to each possible value $v$ for the attribute.
|
||||
$\left| S \right|$ \& $\left| S_v \right|$ refer to the cardinality or size of the overall dataset, and the cardinality or size of a partition respectively.
|
||||
When selecting an attribute for a node in a decision tree, we use whichever attribute $A$ that gives the greatest information gain.
|
||||
|
||||
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Worked Information Gain Example}]
|
||||
Given $\left| S \right| = 14$, $\left| S_{\text{windy} = \text{true}} \right| = 14$, \& $\left| S_{\text{windy} = \text{false}} \right| = 14$, calculate the information gain of the attribute ``windy''.
|
||||
|
||||
\begin{align*}
|
||||
\text{Gain}(S, \text{windy}) =& \text{Ent}(S) - \frac{\left| S_{\text{windy} = \text{true}} \right|}{\left| S \right|} \text{Ent}(S_\text{windy} = \text{true})
|
||||
- \frac{\left| S_{\text{windy} = \text{false}} \right|}{\left| S \right|} \text{Ent}(S_\text{windy} = \text{false}) \\
|
||||
=& \text{Ent}(S) - \left( \frac{6}{14} \right) \text{Ent}(\left[3+,3-\right]) - \left( \frac{8}{14} \right) \text{Ent}(\left[ 6+,2- \right]) \\
|
||||
=& 0.940 - \left( \frac{6}{14} \right) 1.00 - \left( \frac{8}{14} \right) 0.811\\
|
||||
=& 0.048
|
||||
\end{align*}
|
||||
\end{tcolorbox}
|
||||
|
||||
|
||||
\end{document}
|
||||
|
After Width: | Height: | Size: 197 KiB |
After Width: | Height: | Size: 104 KiB |
After Width: | Height: | Size: 139 KiB |
After Width: | Height: | Size: 119 KiB |
After Width: | Height: | Size: 123 KiB |
BIN
year4/semester1/CT4101: Machine Learning/notes/images/knn.png
Normal file
After Width: | Height: | Size: 86 KiB |
After Width: | Height: | Size: 130 KiB |
After Width: | Height: | Size: 648 KiB |
After Width: | Height: | Size: 181 KiB |