diff --git a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf index 47bab3f1..524c8abc 100644 Binary files a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf and b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf differ diff --git a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex index 0ece6d48..bd3ca461 100644 --- a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex +++ b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex @@ -473,7 +473,7 @@ plotted against recall. In an ideal system, we would have a precision value of 1 for a recall value of 1, i.e., all relevant documents have been returned and no irrelevant documents have been returned. -\begin{tcolorbox}[colback=gray!10, colframe=black, title=Example] +\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Example}] Given $|D| = 20$ \& $|R| = 10$ and a ranked list of length 10, let the returned ranked list be: $$ \mathbf{d_1}, \mathbf{d_2}, d_3, \mathbf{d_4}, d_5, d_6, \mathbf{d_7}, d_8, d_9, d_{10} @@ -542,9 +542,162 @@ experience. Another closely related area is that of information visualisation: ow best to represent the retrieved data for a user etc. +\section{Weighting Schemes} +\subsection{Re-cap} +The \textbf{vector space model} attempts to improve upon the Boolean model by removing the limitation of binary weights for index terms. +Terms can have a non-binary value both in queries \& documents. +Hence, we can represent documents \& queries as $n$-dimensional vectors: +$$ +\vec{d_j} = \left( w_{1,j} , w_{2,j} , \dots , w_{n,j} \right) +$$ +$$ +\vec{q} = \left( w_{1,q} , w_{2,q} , \dots , w_{n,q} \right) +$$ +We can calculate the similarity between a document and a query by calculating the similarity between the vector representations. +We can measure this similarity by measuring the cosine of the angle between the two vectors. +We can derive a formula for this by starting with the formula for the inner product (dot product) of two vectors: +\begin{align} +a \cdot b = |a| |b| \cos(a,b) \\ +\Rightarrow +\cos(a,b) = \frac{a \cdot b}{|a| |b|} +\end{align} +We can therefore calculate the similarity between a document and a query as: +\begin{align*} + \text{sim}(\vec{d_j}, \vec{q}) = &\frac{d_j \cdot q}{|d_j| |q|} \\ +\Rightarrow + \text{sim}(\vec{d_j}, \vec{q}) = &\frac{\sum^n_{i=1} w_{i,j} \times w_{i,q}}{\sqrt{\sum^n_{i=1} w_{i,j}^2} \times \sqrt{\sum^n_{i=1} w_{i,q}^2}} +\end{align*} +We need a means to calculate the term weights in the document \& query vector representations. +A term's frequency within a document quantifies how well a term describes a document. +The more frequent a term occurs in a document, the better it is at describing that document and vice-versa. +This frequency is known as the \textbf{term frequency} or \textbf{tf factor}. +\\\\ +However, if a term occurs frequently across all the documents, then that term does little to distinguish one document from another. +This factor is known as the \textbf{inverse document frequency} or \textbf{idf-frequency}. +The most commonly used weighting schemes are known as \textbf{tf-idf} weighting schemes +For all terms in a document, the weight assigned can be calculated by: +\begin{align*} + w_{i,j} = f_{i,j} \times \log \frac{N}{n_i} +\end{align*} +where $f_{i,j}$ is the normalised frequency of term $t_i$ in document $d_j$, $N$ is the number of documents in the collection, and $n_i$ is the number of documents that contain the term $t_i$. +\\\\ +A similar weighting scheme can be used for queries. +The main difference is that the tf \& idf are given less credence, and all terms have an initial value of 0.5 which is increased or decreased according to the tf-idf across the document collection (Salton 1983). +\subsection{Text Properties} +When considering the properties of a text document, it is important to note that not all words are equally important for capturing the meaning of a document and that text documents are comprised of symbols from a finite alphabet. +\\\\ +Factors that affect the performance of information retrieval include: +\begin{itemize} + \item What is the distribution of the frequency of different words? + \item How fast does vocabulary size grow with the size of a document collection? +\end{itemize} + +These factors can be used to select appropriate term weights and other aspects of an IR system. + +\subsubsection{Word Frequencies} +A few words are very common, e.g. the two most frequent words ``the'' \& ``of'' can together account for about 10\% of word occurrences. +Most words are very rare: around half the words in a corpus appear only once, which is known as a ``heavy tailed'' or Zipfian distribution. +\\\\ +\textbf{Zipf's law} gives an approximate model for the distribution of different words in a document. +It states that when a list of measured values is sorted in decreasing order, the value of the $n^{\text{th}}$ entry is approximately inversely proportional to $n$. +For a word with rank $r$ (the numerical position of the word in a list sorted in by decreasing frequency) and frequency $f$, Zipf's law states that $f \times r$ will equal a constant. +It represents a power law, i.e. a straight line on a log-log plot. +\begin{align*} + \text{word frequency} \propto \frac{1}{\text{word rank}} +\end{align*} + +\begin{figure}[H] + \centering + \includegraphics[width=0.8\textwidth]{./images/zipfs_law_brown_corpus.png} + \caption{Zipf's Law Modelled on the Brown Corpus} +\end{figure} + +As can be seen above, Zipf's law is an accurate model excepting the extremes. + +\subsection{Vocabulary Growth} +The manner in which the size of the vocabulary increases with the size of the document collection has an impact on our choice of indexing strategy \& algorithms. +However, it is important to note that the size of a vocabulary is not really bounded in the real world due to the existence of mispellings, proper names etc., \& document identifiers. +\\\\ +If $V$ is the size of the vocabulary and $n$ is the length of the document collection in word occurrences, then +\begin{align*} + V = K \cdot n^\beta, \quad 0 < \beta < 1 +\end{align*} +where $K$ is a constant scaling factor that determines the initial vocabulary size of a small collection, usually in the range 10 to 100, and $\beta$ is constant controlling the rate at which the vocabulary size increases usually in the range 0.4 to 0.6. + +\subsection{Weighting Schemes} +The quality of performance of an IR system depends on the quality of the weighting scheme; we want to assign high weights to those terms with a high resolving power. +tf-idf is one such approach wherein weight is increased for frequently occurring terms but decreased again for those that are frequent across the collection. +The ``bag of words'' model is usually adopted, i.e., that a document can be treated as an unordered collection of words. +The term independence assumption is also usually adopted, i.e., that the occurrence of each word in a document is independent of the occurrence of other words. + +\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{``Bag of Words'' / Term Independence Example}] + If Document 1 contains the text ``Mary is quicker than John'' and Document 2 contains the text ``John is quicker than Mary'', then Document 1 \& Document 2 are viewed as equivalent. +\end{tcolorbox} + +However, it is unlikely that 30 occurrences of a term in a document truly carries thirty times the significance of a single occurrence of that term. +A common modification is to use the logarithm of the term frequency: +\begin{align*} + \text{If } \textit{tf}_{i,d} > 0:& \quad w_{i,d} = 1 + \log(\textit{tf}_{i,d})\\ + \text{Otherwise:}& \quad w_{i,d} = 0 +\end{align*} + +\subsubsection{Maximum Term Normalisation} +We often want to normalise term frequencies because we observe higher frequencies in longer documents merely because longer documents tend to repeat the same words more frequently. +Consider a document $d^\prime$ created by concatenating a document $d$ to itself: +$d^\prime$ is no more relevant to any query than document $d$, yet according to the vector space type similarity $\text{sim}(d^\prime, q) \geq \text{sim}(d,q) \, \forall \, q$. +\\\\ +The formula for the \textbf{maximum term normalisation} of a term $i$ in a document $d$ is usually of the form +\begin{align*} +\textit{ntf} = a + \left( 1 - a \right) \frac{\textit{tf}_{i,d}}{\textit{tf}\text{max}(d)} +\end{align*} +where $a$ is a smoothing factor which can be used to dampen the impact of the second term. +\\\\ +Problems with maximum term normalisation include: +\begin{itemize} + \item Stopword removal may have effects on the distribution of terms: this normalisation is unstable and may require tuning per collection. + \item There is a possibility of outliers with unusually high frequency. + \item Those documents with a more even distribution of term frequencies should be treated differently to those with a skewed distribution. +\end{itemize} + +More sophisticated forms of normalisation also exist, which we will explore in the future. + +\subsubsection{Modern Weighting Schemes} +Many, if not all of the developed or learned weighting schemes can be represented in the following format +\begin{align*} + \text{sim}(q,d) = \sum_{t \in q \cap d} \left( \textit{ntf}(D) \times \textit{gw}_t(C) \times \textit{qw}_t(Q) \right) +\end{align*} +where +\begin{itemize} + \item $\textit{ntf}(D)$ is the normalised term frequency in a document. + \item $\textit{gw}_t(C)$ is the global weight of a term across a collection. + \item $\textit{qw}_t(Q)$ is the query weight of a term in a query $Q$. +\end{itemize} + +The \textbf{Okapi BM25} weighting scheme is a standard benchmark weighting scheme with relatively good performance, although it needs to be tuned per collection: +\begin{align*} + \text{BM25}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{\textit{tf}_{t,D} \cdot \log \left( \frac{N - \textit{df}_t _ 0.5}{\textit{df} + 0.5} \right) \cdot \textit{tf}_{t, Q}}{\textit{tf}_{t,D} + k_1 \cdot \left( (1-b) + b \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}} \right)} \right) +\end{align*} + +The \textbf{Pivoted Normalisation} weighting scheme is also as standard benchmark which needs to be tuned for collection, although it has its issues with normalisation: +\begin{align*} + \text{piv}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{1 + \log \left( 1 + \log \left( \textit{tf}_{t, D} \right) \right)}{(1 - s) + s \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}}} \right) \times \log \left( \frac{N+1}{\textit{df}_t} \right) \times \textit{tf}_{t, Q} +\end{align*} + +The \textbf{Axiomatic Approach} to weighting consists of the following constraints: +\begin{itemize} + \item \textbf{Constraint 1:} adding a query term to a document must always increase the score of that document. + \item \textbf{Constraint 2:} adding a non-query term to a document must always decrease the score of that document. + \item \textbf{Constraint 3:} adding successive occurrences of a term to a document must increase the score of that document less with each successive occurrence. + Essentially, any term-frequency factor should be sub-linear. + \item \textbf{Constraint 4:} using a vector length should be a better normalisation factor for retrieval. + However, using the vector length will violate one of the existing constraints. + Therefore, ensuring that the document length factor is used in a sub-linear function will ensure that repeated appearances of non-query terms are weighted less. +\end{itemize} + +New weighting schemes that adhere to all these constraints outperform the best known benchmarks. \end{document} diff --git a/year4/semester1/CT4100: Information Retrieval/notes/images/zipfs_law_brown_corpus.png b/year4/semester1/CT4100: Information Retrieval/notes/images/zipfs_law_brown_corpus.png new file mode 100644 index 00000000..49a02dd0 Binary files /dev/null and b/year4/semester1/CT4100: Information Retrieval/notes/images/zipfs_law_brown_corpus.png differ