[CT4100]: Week 9 lecture notes

This commit is contained in:
2024-11-07 12:57:02 +00:00
parent 9397827eb0
commit 5c78e2342a
2 changed files with 180 additions and 0 deletions

View File

@ -32,6 +32,8 @@
\newcommand{\secref}[1]{\textbf{§\ref{#1}~\nameref{#1}}}
\usepackage{changepage} % adjust margins on the fly
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{minted}
\usemintedstyle{algol_nu}
@ -1335,6 +1337,184 @@ Sequential processing has been used in query understanding, retrieval, expansion
\\\\
In summary, neural approaches are powerful, typically more computationally expensive than traditional approaches, have good performance, but have issues with explainability.
\section{Clustering}
\subsection{Introduction}
\textbf{Document clustering} is the process of grouping a set of documents into clusters of similar documents.
Documents within a cluster should be similar, while documents from different clusters should be dissimilar.
Clustering is the most common form of \textbf{unsupervised learning}, i.e., there is no labelled or annotated data.
\subsubsection{Classification vs Clustering}
Classification is a supervised learning algorithm in which classes are human-defined and part of the input to the algorithm.
Clustering is an unsupervised learning algorithm in which clusters are inferred from the data without human input.
However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, etc.
\subsection{Clustering in IR}
The \textbf{cluster hypothesis} states that documents in the same cluster behave similarly with respect to relevance information needs.
All applications of clustering in IR are based (directly or indirectly) on the cluster hypothesis.
Van Rijsbergen's original wording of the cluster hypothesis was ``closely related documents tend to be relevant to the same requests''.
\\\\
Applications of clustering include:
\begin{itemize}
\item Search result clustering: search results are clustered to provide more effective information presentation to the user.
\item Scatter-gather clustering: (subsets of) the collection are clustered to provide an alternative user interface wherein the user can search without typing.
\item Collection clustering: the collection is clustered for effective information presentation for exploratory browsing.
\item Cluster-based retrieval: the collection is clustered to provide higher efficiency \& faster search.
\end{itemize}
\subsubsection{Clustering for Improving Recall}
To improve search recall:
\begin{enumerate}
\item Cluster documents in collection \textit{a priori}.
\item When a query matches a document $d$, also return other documents in the cluster that contains $d$.
\end{enumerate}
The hope is that if we do this, the query ``car'' will also return documents containing ``automobile'', as the clustering algorithm groups together documents containing ``car'' with those containing ``automobile''.
Both types of documents will contain words like ``parts'', ``dealer'', ``mercedes'', ``road trip''.
\subsubsection{Desiderata for Clustering}
\begin{itemize}
\item The general goal is to put related documents in the same cluster and to put unrelated documents in different clusters.
\item The number of clusters should be appropriate for the data set we are gathering
\item Secondary goals in clustering include:
\begin{itemize}
\item Avoid very small \& very large clusters.
\item Define clusters that are easy to explain to the user.
\end{itemize}
\end{itemize}
\subsubsection{Flat vs Hierarchical Clustering}
\textbf{Flat algorithms} usually start with a random (partial) partitioning of documents into groups and are refined iteratively.
The main example of this is $k$-means.
Flat algorithms compute a partition of $n$ documents into a set of $k$ clusters.
Given a set of documents and the number $k$, the goal is to find a partition into $k$ clusters that optimises the chosen partitioning criterion.
Global optimisation may be achieved by exhaustively enumerating partitions and picking the optimal one, however this is not tractable.
The $k$-means algorithm is an effective heuristic method.
\\\\
\textbf{Hierarchical algorithms} create a hierarchy: either top down, agglomerative, or top-down, divisive.
\subsubsection{Hard vs Soft Clustering}
In \textbf{hard clustering}, each document belongs to exactly one cluster.
This is more common, and easier to do.
\\\\
\textbf{Soft clustering:} a document can belong to more than one cluster;
this makes sense for applications like creating browsable hierarchies.
For example, you may want to put the word ``sneakers'' in two clusters: sports apparel \& shoes.
\subsection{$k$-Means}
\textbf{$k$-means} is perhaps the best-known clustering algorithm, as it is simple and works well in many cases.
It is used as a default / baseline for clustering documents.
Document representation in clustering are typically done using the vector space model, with the relatedness between vectors being measured by Euclidean distance.
\\\\
Each cluster in $k$-means is defined by a \textbf{centroid}, which is defined as:
\[
\overrightarrow{\mu}(\omega) = \frac{1}{| \omega | }\sum_{\overrightarrow{x} \in \omega} \overrightarrow{x}
\]
where $\omega$ defines a cluster.
\\\\
The partitioning criterion is to minimises the average squared difference from the centroid.
We try to find the minimum average squared difference by iterating two steps:
\begin{itemize}
\item \textbf{Reassignment:} assign each vector to its closest centroid.
\item \textbf{Recomputation:} recompute each centroid as the average of the vectors that were assigned to it in reassignmnet.
\end{itemize}
\begin{algorithm}[H]
\caption{$k$-means$ (\{ \vec{x}_1, \ldots, \vec{x}_N \}, k)$}
\begin{algorithmic}[1]
\State $(\vec{s}_1, \vec{s}_2, \ldots, \vec{s}_K) \gets \text{SelectRandomSeeds}(\{ \vec{x}_1, \ldots, \vec{x}_N \}, k)$
\For{$k \gets 1$ to $k$}
\State $\vec{\mu}_k \gets \vec{s}_k$
\EndFor
\While{stopping criterion has not been met}
\For{$k \gets 1$ to $k$}
\State $\omega_k \gets \{\}$
\EndFor
\For{$n \gets 1$ to $N$}
\State $j \gets \arg \min_j \| \vec{\mu}_j - \vec{x}_n \|$
\State $\omega_j \gets \omega_j \cup \{ \vec{x}_n \}$ \Comment{reassignment of vectors}
\EndFor
\For{$k \gets 1$ to $k$}
\State $\vec{\mu}_k \gets \frac{1}{|\omega_k|} \sum_{\vec{x} \in \omega_k} \vec{x}$ \Comment{recomputation of centroids}
\EndFor
\EndWhile
\State \Return $\{\vec{\mu}_1, \ldots, \vec{\mu}_k\}$
\end{algorithmic}
\end{algorithm}
\subsubsection{Proof that $k$-Means is Guaranteed to Converge}
\begin{enumerate}
\item RSS is the sum of all squared distances between the document vector and the closest centroid.
\item RSS decreases during each reassignment step because each vector is moved to closer a centroid.
\item RSS decreases during each recomputation step.
\item There is only a finite number of clusterings, thus we must reach a fixed point.
\end{enumerate}
However, we don't know how long convergence will take.
If we don't care about a few documents switching back and forth, then convergence is usually fast (around 10-20 iterations).
However, complete convergence can take many more iterations.
\\\\
The great weakness of $k$-means is that \textbf{convergence does not mean that we converge to the optimal clustering}.
If we start with a bad set of seeds, the resulting clustering can be poor.
\subsubsection{Initialisation of $k$-means}
Random seed selection is just one of many ways $k$-means can be initialised.
Random seed clustering is not very robust: it's very easy to get a sub-optimal clustering.
Better ways of computing initial centroids include:
\begin{itemize}
\item Select seeds not randomly, but using some heuristic, e.g., filter out outliers or find a set of seeds that has ``good coverage'' of the document space.
\item Use hierarchical clustering to find good seeds.
\item Select $i$ (e.g., $i = 10$ different random sets of seeds), and do a $k$-means clustering for each before selecting the clustering with the lowest RSS.
\end{itemize}
\subsection{Evaluation}
Internal criteria for a clustering, e.g. RSS in $k$-means often do not evaluate the actual utility of a clustering in the application.
An alternative to internal criteria is external criteria, i.e. to evaluate with respect to a human-defined classification.
\subsubsection{External Criteria for Clustering Quality}
External criteria for clustering quality are based on the ideal ``gold standard'' dataset, where the goal is that clustering should reproduce the classes in the gold standard.
\\\\
\textbf{Purity} measures how well we were able to reproduce the classes:
\[
\text{purity}(\Omega, C) = \frac{1}{N} \sum_k \text{max}_j | \omega_k \cap c_j |
\]
where $\Omega = \{ \omega_1, \omega_2, \dots, \omega_k \}$ is the set of clusters and $C = \{ c_1, c_2, \dots, c_j \}$ is the set of classes.
For each cluster $\omega_k$, find the class $c_j$ with the most members $n_{kj}$
in $\omega_k$.
Sum all $n_{kj}$ and divide by the total number of points.
\\\\
The \textbf{Rand Index} is defined as:
\[
\text{RI} = \frac{ \text{TP} + \text{TN} }{ \text{TP} + \text{FP} + \text{FN} + \text{TN} }
\]
It is based on a $2 \times 2$ contingency table of all pairs of documents:
$\text{TP} + \text{FP} + \text{FN} + \text{TN}$ is the total number of pairs.
There are $\binom{N}{2}$ pairs for $N$ documents.
Each pair is either positive or negative as the clustering either puts the documents in the same or in different clusters.
Each pair is also either ``true'' (correct) or ``false'' (incorrect), i.e., the clustering decision is either correct or incorrect.
\begin{table}[H]
\centering
\begin{tabular}{|c|c|c|}
\hline
& \textbf{same cluster} & \textbf{different clusters} \\ \hline
\textbf{same class} & TP & FN \\ \hline
\textbf{different classes} & FP & TN \\ \hline
\end{tabular}
\caption{$2 \times 2$ contingency table of all pairs of documents}
\end{table}
\subsubsection{How Many Clusters?}
The number of clusters $k$ is given in many applications.
\end{document}