diff --git a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf
index a8652f5a..263bec75 100644
Binary files a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf and b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.pdf differ
diff --git a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex
index 59840f0f..e952116b 100644
--- a/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex	
+++ b/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex	
@@ -1575,10 +1575,49 @@ To make such judgements without deep domain knowledge, a normalise \textbf{domai
 \\\\
 The \textbf{$R^2$ coefficient} is a domain-independent measure that compares the performance of a model on a test set with the performance of an imaginary model that always predicts the average values from the test set.
 $R^2$ values may be interpreted as the amount of variation in the target feature that is explained by the descriptive features in the model.
+\begin{align*}
+    \text{sum of squared errors} =& \frac{1}{2} \sum^n_{i=1} \left( t_i - \mathbb{M} \left( d_i \right) \right)^2 \\
+    \text{total sum of squares} =& \frac{1}{2} \sum^n_{i=1} \left( t_i - \overline{t} \right)^2 \\
+    R^2 =& \frac{\text{sum of squared errors}}{\text{total sum of squares}}
+\end{align*}
 
+where $\overline{t}$ is the average value of the target variable.
+\\\\
+$R^2$ values are usually in the range $[0,1]$, with larger values indicating better performance.
+However, $R^2$ values can be $< 0$ in certain rare cases (although 1 is always the maximum $R^2$ value).
+Negative $R^2$ values indicate a very poor model performance, i.e. that the model performs worse than the horizontal straight-line hypothesis that always predicts the average value of the target feature.
+For example, a negative $R^2$ on the test set with a positive $R^2$ value on the training set likely indicates that the model is overfit to the training data.
 
+\subsection{Applying $k$-NN to Regression Tasks}
+Previously, we have seen that the $k$-nearest neighbours algorithm bases its prediction on several ($k$) nearest neighbours by computing the distance from the query case to all stored cases, and picking the $k$ nearest neighbours.
+When $k$-NN is used for classification tasks, the neighbours vote on the classification of the test case.
+In \textbf{regression} tasks, the average value of the neighbours is taken as the label for the query case.
 
+\subsubsection{Uniform Weighting}
+Assuming that each neighbour is given an equal weighting:
 
+\begin{align*}
+    \text{prediction}(q) = \frac{1}{k} \sum^k_{i=1} t_i
+\end{align*}
+
+where $q$ is a vector containing the attribute values for the query instance, $k$ is the number of neighbours, $t_i$ is the target value of neighbour $i.
+
+\subsubsection{Distance Weighting}
+Assuming that each neighbour is given a weight based on the inverse square of its distance from the query instance:
+
+\begin{align*}
+    \text{prediction}(q) = \frac{ \sum^k_{i=1} \left( \frac{1}{ \text{dist}(q, d_i)^2  \times t_i } \right) }{ \sum^k_{i=1} \left( \frac{1}{\text{dist}(q, d_i)^2  } \right) }
+\end{align*}
+
+where $q$ is a vector containing the attribute values for the query instance and $\text{dist}(q,d_i)$ returns the distance between the query and the neighbour $i$.
+
+\subsection{Applying Decision Trees to Regression}
+\textbf{Regression trees} are constructed similarly to those for classification;
+the main change is that the function used to measure the quality of a split is changed so that it is a measure relevant to regression, e.g. variance, MSE, MAE, etc.
+This adapation is easily made to the ID3/C4.5 algorithm.
+\\\\
+The aim in regression trees is to group similar target values together at a leaf node.
+Typically, a regression tree returns the mean target value at a leaf node.