[CT4100]: Add Week 4 lecture notes

2024-10-03 17:13:16 +01:00
parent 9bdce0aca5
commit cf84ca0366
3 changed files with 154 additions and 1 deletions
--- a/Retrieval/notes/CT4100-Notes.pdf
+++ b/Retrieval/notes/CT4100-Notes.pdf
--- a/Retrieval/notes/CT4100-Notes.tex
+++ b/Retrieval/notes/CT4100-Notes.tex
@ -473,7 +473,7 @@ plotted against recall.
 In an ideal system, we would have a precision value of 1 for a recall value of 1, i.e., all relevant documents
 have been returned and no irrelevant documents have been returned.
-\begin{tcolorbox}[colback=gray!10, colframe=black, title=Example]
+\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Example}]
    Given $|D| = 20$ \& $|R| = 10$ and a ranked list of length 10, let the returned ranked list be:
    $$
    \mathbf{d_1}, \mathbf{d_2}, d_3, \mathbf{d_4}, d_5, d_6, \mathbf{d_7}, d_8, d_9, d_{10}
@ -542,9 +542,162 @@ experience.
 Another closely related area is that of information visualisation: ow best to represent the retrieved data for a
 user etc.
 \section{Weighting Schemes}
 \subsection{Re-cap}
 The \textbf{vector space model} attempts to improve upon the Boolean model by removing the limitation of binary weights for index terms.
 Terms can have a non-binary value both in queries \& documents.
 Hence, we can represent documents \& queries as $n$-dimensional vectors:
 $$
 \vec{d_j} = \left( w_{1,j} , w_{2,j} , \dots , w_{n,j} \right)
 $$
 $$
 \vec{q} = \left( w_{1,q} , w_{2,q} , \dots , w_{n,q} \right)
 $$
 We can calculate the similarity between a document and a query by calculating the similarity between the vector representations.
 We can measure this similarity by measuring the cosine of the angle between the two vectors.
 We can derive a formula for this by starting with the formula for the inner product (dot product) of two vectors:
 \begin{align}
 a \cdot b = |a| |b| \cos(a,b) \\
 \Rightarrow
 \cos(a,b) = \frac{a \cdot b}{|a| |b|}
 \end{align}
 We can therefore calculate the similarity between a document and a query as:
 \begin{align*}
    \text{sim}(\vec{d_j}, \vec{q}) = &\frac{d_j \cdot q}{|d_j| |q|} \\
 \Rightarrow
    \text{sim}(\vec{d_j}, \vec{q}) = &\frac{\sum^n_{i=1} w_{i,j} \times w_{i,q}}{\sqrt{\sum^n_{i=1} w_{i,j}^2} \times \sqrt{\sum^n_{i=1} w_{i,q}^2}}
 \end{align*}
 We need a means to calculate the term weights in the document \& query vector representations.
 A term's frequency within a document quantifies how well a term describes a document.
 The more frequent a term occurs in a document, the better it is at describing that document and vice-versa.
 This frequency is known as the \textbf{term frequency} or \textbf{tf factor}.
 \\\\
 However, if a term occurs frequently across all the documents, then that term does little to distinguish one document from another.
 This factor is known as the \textbf{inverse document frequency} or \textbf{idf-frequency}.
 The most commonly used weighting schemes are known as \textbf{tf-idf} weighting schemes
 For all terms in a document, the weight assigned can be calculated by:
 \begin{align*}
    w_{i,j} = f_{i,j} \times \log \frac{N}{n_i}
 \end{align*}
 where $f_{i,j}$ is the normalised frequency of term $t_i$ in document $d_j$, $N$ is the number of documents in the collection, and $n_i$ is the number of documents that contain the term $t_i$.
 \\\\
 A similar weighting scheme can be used for queries.
 The main difference is that the tf \& idf are given less credence, and all terms have an initial value of 0.5 which is increased or decreased according to the tf-idf across the document collection (Salton 1983).
 \subsection{Text Properties}
 When considering the properties of a text document, it is important to note that not all words are equally important for capturing the meaning of a document and that text documents are comprised of symbols from a finite alphabet.
 \\\\
 Factors that affect the performance of information retrieval include:
 \begin{itemize}
    \item   What is the distribution of the frequency of different words?
    \item   How fast does vocabulary size grow with the size of a document collection?
 \end{itemize}
 These factors can be used to select appropriate term weights and other aspects of an IR system.
 \subsubsection{Word Frequencies}
 A few words are very common, e.g. the two most frequent words ``the'' \& ``of'' can together account for about 10\% of word occurrences.
 Most words are very rare: around half the words in a corpus appear only once, which is known as a ``heavy tailed'' or Zipfian distribution.
 \\\\
 \textbf{Zipf's law} gives an approximate model for the distribution of different words in a document.
 It states that when a list of measured values is sorted in decreasing order, the value of the $n^{\text{th}}$ entry is approximately inversely proportional to $n$.
 For a word with rank $r$ (the numerical position of the word in a list sorted in by decreasing frequency) and frequency $f$, Zipf's law states that $f \times r$ will equal a constant.
 It represents a power law, i.e. a straight line on a log-log plot.
 \begin{align*}
    \text{word frequency} \propto \frac{1}{\text{word rank}}
 \end{align*}
 \begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{./images/zipfs_law_brown_corpus.png}
    \caption{Zipf's Law Modelled on the Brown Corpus}
 \end{figure}
 As can be seen above, Zipf's law is an accurate model excepting the extremes.
 \subsection{Vocabulary Growth}
 The manner in which the size of the vocabulary increases with the size of the document collection has an impact on our choice of indexing strategy \& algorithms.
 However, it is important to note that the size of a vocabulary is not really bounded in the real world due to the existence of mispellings, proper names etc., \& document identifiers.
 \\\\
 If $V$ is the size of the vocabulary and $n$ is the length of the document collection in word occurrences, then
 \begin{align*}
    V = K \cdot n^\beta, \quad 0 < \beta < 1
 \end{align*}
 where $K$ is a constant scaling factor that determines the initial vocabulary size of a small collection, usually in the range 10 to 100, and $\beta$ is constant controlling the rate at which the vocabulary size increases usually in the range 0.4 to 0.6.
 \subsection{Weighting Schemes}
 The quality of performance of an IR system depends on the quality of the weighting scheme; we want to assign high weights to those terms with a high resolving power.
 tf-idf is one such approach wherein weight is increased for frequently occurring terms but decreased again for those that are frequent across the collection.
 The ``bag of words'' model is usually adopted, i.e., that a document can be treated as an unordered collection of words.
 The term independence assumption is also usually adopted, i.e., that the occurrence of each word in a document is independent of the occurrence of other words.
 \begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{``Bag of Words'' / Term Independence Example}]
    If Document 1 contains the text ``Mary is quicker than John'' and Document 2 contains the text ``John is quicker than Mary'', then Document 1 \& Document 2 are viewed as equivalent.
 \end{tcolorbox}
 However, it is unlikely that 30 occurrences of a term in a document truly carries thirty times the significance of a single occurrence of that term.
 A common modification is to use the logarithm of the term frequency:
 \begin{align*}
    \text{If } \textit{tf}_{i,d} > 0:&   \quad w_{i,d} = 1 + \log(\textit{tf}_{i,d})\\
    \text{Otherwise:}&          \quad w_{i,d} = 0
 \end{align*}
 \subsubsection{Maximum Term Normalisation}
 We often want to normalise term frequencies because we observe higher frequencies in longer documents merely because longer documents tend to repeat the same words more frequently.
 Consider a document $d^\prime$ created by concatenating a document $d$ to itself:
 $d^\prime$  is no more relevant to any query than document $d$, yet according to the vector space type similarity $\text{sim}(d^\prime, q) \geq \text{sim}(d,q) \, \forall \, q$.
 \\\\
 The formula for the \textbf{maximum term normalisation} of a term $i$ in a document $d$ is usually of the form
 \begin{align*}
 \textit{ntf} = a + \left( 1 - a \right) \frac{\textit{tf}_{i,d}}{\textit{tf}\text{max}(d)}
 \end{align*}
 where $a$ is a smoothing factor which can be used to dampen the impact of the second term.
 \\\\
 Problems with maximum term normalisation include:
 \begin{itemize}
    \item   Stopword removal may have effects on the distribution of terms: this normalisation is unstable and may require tuning per collection.
    \item   There is a possibility of outliers with unusually high frequency.
    \item   Those documents with a more even distribution of term frequencies should be treated differently to those with a skewed distribution.
 \end{itemize}
 More sophisticated forms of normalisation also exist, which we will explore in the future.
 \subsubsection{Modern Weighting Schemes}
 Many, if not all of the developed or learned weighting schemes can be represented in the following format
 \begin{align*}
    \text{sim}(q,d) = \sum_{t \in q \cap d} \left( \textit{ntf}(D) \times \textit{gw}_t(C) \times \textit{qw}_t(Q) \right)
 \end{align*}
 where 
 \begin{itemize}
    \item   $\textit{ntf}(D)$ is the normalised term frequency in a document.
    \item   $\textit{gw}_t(C)$ is the global weight of a term across a collection.
    \item   $\textit{qw}_t(Q)$ is the query weight of a term in a query $Q$.
 \end{itemize}
 The \textbf{Okapi BM25} weighting scheme is a standard benchmark weighting scheme with relatively good performance, although it needs to be tuned per collection:
 \begin{align*}
    \text{BM25}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{\textit{tf}_{t,D} \cdot \log \left( \frac{N - \textit{df}_t _ 0.5}{\textit{df} + 0.5} \right) \cdot \textit{tf}_{t, Q}}{\textit{tf}_{t,D} + k_1 \cdot \left( (1-b) + b \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}} \right)} \right)
 \end{align*}
 The \textbf{Pivoted Normalisation} weighting scheme is also as standard benchmark which needs to be tuned for collection, although it has its issues with normalisation:
 \begin{align*}
    \text{piv}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{1 + \log \left( 1 + \log \left( \textit{tf}_{t, D} \right) \right)}{(1 - s) + s \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}}} \right) \times \log \left( \frac{N+1}{\textit{df}_t} \right) \times \textit{tf}_{t, Q}
 \end{align*}
 The \textbf{Axiomatic Approach} to weighting consists of the following constraints:
 \begin{itemize}
    \item   \textbf{Constraint 1:} adding a query term to a document must always increase the score of that document.
    \item   \textbf{Constraint 2:} adding a non-query term to a document must always decrease the score of that document.
    \item   \textbf{Constraint 3:} adding successive occurrences of a term to a document must increase the score of that document less with each successive occurrence.
            Essentially, any term-frequency factor should be sub-linear.
    \item   \textbf{Constraint 4:} using a vector length should be a better normalisation factor for retrieval.
            However, using the vector length will violate one of the existing constraints.
            Therefore, ensuring that the document length factor is used in a sub-linear function will ensure that repeated appearances of non-query terms are weighted less.
 \end{itemize}
 New weighting schemes that adhere to all these constraints outperform the best known benchmarks.
 \end{document}
--- a/Retrieval/notes/images/zipfs_law_brown_corpus.png
+++ b/Retrieval/notes/images/zipfs_law_brown_corpus.png