Files
uni/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex

551 lines
22 KiB
TeX

%! TeX program = lualatex
\documentclass[a4paper,11pt]{article}
% packages
\usepackage{censor}
\StopCensoring
\usepackage{fontspec}
\usepackage{tcolorbox}
\setmainfont{EB Garamond}
% for tironian et fallback
% % \directlua{luaotfload.add_fallback
% % ("emojifallback",
% % {"Noto Serif:mode=harf"}
% % )}
% % \setmainfont{EB Garamond}[RawFeature={fallback=emojifallback}]
\setmonofont[Scale=MatchLowercase]{Deja Vu Sans Mono}
\usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry}
\setlength{\parindent}{0pt}
\usepackage{fancyhdr} % Headers and footers
\fancyhead[R]{\normalfont \leftmark}
\fancyhead[L]{}
\pagestyle{fancy}
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage{amsmath}
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{xcolor}
\definecolor{linkblue}{RGB}{0, 64, 128}
\usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref}
% \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}}
\newcommand{\secref}[1]{\textbf{§\ref{#1}~\nameref{#1}}}
\usepackage{changepage} % adjust margins on the fly
\usepackage{minted}
\usemintedstyle{algol_nu}
\usepackage{pgfplots}
\pgfplotsset{width=\textwidth,compat=1.9}
\usepackage{caption}
\newenvironment{code}{\captionsetup{type=listing}}{}
\captionsetup[listing]{skip=0pt}
\setlength{\abovecaptionskip}{5pt}
\setlength{\belowcaptionskip}{5pt}
\usepackage[yyyymmdd]{datetime}
\renewcommand{\dateseparator}{--}
\usepackage{enumitem}
\usepackage{titlesec}
\author{Andrew Hayes}
\begin{document}
\begin{titlepage}
\begin{center}
\hrule
\vspace*{0.6cm}
\censor{\huge \textbf{CT4100}}
\vspace*{0.6cm}
\hrule
\LARGE
\vspace{0.5cm}
Information Retrieval
\vspace{0.5cm}
\hrule
\vfill
\vfill
\hrule
\begin{minipage}{0.495\textwidth}
\vspace{0.4em}
\raggedright
\normalsize
Name: Andrew Hayes \\
E-mail: \href{mailto://a.hayes18@universityofgalway.ie}{\texttt{a.hayes18@universityofgalway.ie}} \hfill\\
Student ID: 21321503 \hfill
\end{minipage}
\begin{minipage}{0.495\textwidth}
\raggedleft
\vspace*{0.8cm}
\Large
\today
\vspace*{0.6cm}
\end{minipage}
\medskip\hrule
\end{center}
\end{titlepage}
\pagenumbering{roman}
\newpage
\tableofcontents
\newpage
\setcounter{page}{1}
\pagenumbering{arabic}
\section{Introduction}
\subsection{Lecturer Contact Details}
\begin{itemize}
\item Colm O'Riordan.
\item \href{mailto://colm.oriordan@universityofgalway.ie}{\texttt{colm.oriordan@universityofgalway.ie}}.
\end{itemize}
\subsection{Motivations}
\begin{itemize}
\item To study/analyse techniques to deal suitably with the large amounts (\& types) of information.
\item Emphasis on research \& practice in Information Retrieval.
\end{itemize}
\subsection{Related Fields}
\begin{itemize}
\item Artificial Intelligence.
\item Database \& Information Systems.
\item Algorithms.
\item Human-Computer Interaction.
\end{itemize}
\subsection{Recommended Texts}
\begin{itemize}
\item \textit{Modern Information Retrieval} -- Riberio-Neto \& Baeza-Yates (several copies in library).
\item \textit{Information Retrieval} -- Grossman.
\item \textit{Introduction to Information Retrieval} -- Christopher Manning.
\item Extra resources such as research papers will be recommended as extra reading.
\end{itemize}
\subsection{Grading}
\begin{itemize}
\item Exam: 70\%.
\item Assignment 1: 30\%.
\item Assignment 2: 30\%.
\end{itemize}
There will be exercise sheets posted for most lecturers; these are not mandatory and are intended as a study aid.
\subsection{Introduction to Information Retrieval}
\textbf{Information Retrieval (IR)} deals with identifying relevant information based on users' information needs, e.g.
web search engines, digital libraries, \& recommender systems.
It is finding material (usually documents) of an unstructured nature that satisfies an information need within large
collections (usually stored on computers).
\section{Information Retrieval Models}
\subsection{Introduction to Information Retrieval Models}
\textbf{Data collections} are well-structured collections of related items; items are usually atomic with a
well-defined interpretation.
Data retrieval involves the selection of a fixed set of data based on a well-defined query (e.g., SQL, OQL).
\\\\
\textbf{Information collections} are usually semi-structured or unstructured.
Information Retrieval (IR) involves the retrieval of documents of natural language which is typically not
structured and may be semantically ambiguous.
\subsubsection{Information Retrieval vs Information Filtering}
The main differences between information retrieval \& information filtering are:
\begin{itemize}
\item The nature of the information need.
\item The nature of the document set.
\end{itemize}
Other than these two differences, the same models are used.
Documents \& queries are represented using the same set of techniques and similar comparison algorithms are also
used.
\subsubsection{User Role}
In traditional IR, the user role was reasonably well-defined in that a user:
\begin{itemize}
\item Formulated a query.
\item Viewed the results.
\item Potentially offered feedback.
\item Potentially reformulated their query and repeated steps.
\end{itemize}
In more recent systems, with the increasing popularity of the hypertext paradigm, users usually intersperse
browsing with the traditional querying.
This raises many new difficulties \& challenges.
\subsection{Pre-Processing}
\textbf{Document pre-processing} is the application of a set of well-known techniques to the documents \& queries
prior to any comparison.
This includes, among others:
\begin{itemize}
\item \textbf{Stemming:} the reduction of words to a potentially common root.
The most common stemming algorithms are Lovin's \& Porter's algorithms.
E.g. \textit{computerisation},
\textit{computing}, \textit{computers} could all be stemmed to the common form \textit{comput}.
\item \textbf{Stop-word removal:} the removal of very frequent terms from documents, which add little to the
semantics of meaning of the document.
\item \textbf{Thesaurus construction:} the manual or automatic creation of thesauri used to try to identify
synonyms within the documents.
\end{itemize}
\textbf{Representation} \& comparison technique depends on the information retrieval model chosen.
The choice of feedback techniques is also dependent on the model chosen.
\subsection{Models}
Retrieval models can be broadly categorised as:
\begin{itemize}
\item Boolean:
\begin{itemize}
\item Classical Boolean.
\item Fuzzy Set approach.
\item Extended Boolean.
\end{itemize}
\item Vector:
\begin{itemize}
\item Vector Space approach.
\item Latent Semantic indexing.
\item Neural Networks.
\end{itemize}
\item Probabilistic:
\begin{itemize}
\item Inference Network.
\item Belief Network.
\end{itemize}
\end{itemize}
We can view any IR model as being comprised of:
\begin{itemize}
\item $D$ is the set of logical representations within the documents.
\item $Q$ is the set of logical representations of the user information needs (queries).
\item $F$ is a framework for modelling representations ($D$ \& $Q$) and the relationship between $D$ \& $Q$.
\item $R$ is a ranking function which defines an ordering among the documents with regard to any query $q$.
\end{itemize}
We have a set of index terms:
$$
t_1, \dots , t_n
$$
A \textbf{weight} $w_{i,j}$ is assigned to each term $t_i$ occurring in the $d_j$.
We can view a document or query as a vector of weights:
$$
\vec{d_j} = (w_1, w_2, w_3, \dots)
$$
\subsection{Boolean Model}
The \textbf{Boolean model} of information retrieval is based on set theory \& Boolean algebra.
A query is viewed as a Boolean expression.
The model also assumes terms are present or absent, hence term weights $w_{i,j}$ are binary \& discrete, i.e.,
$w_{i,j}$ is an element of $\{0, 1\}$.
\\\\
Advantages of the Boolean model include:
\begin{itemize}
\item Clean formalism.
\item Widespread \& popular.
\item Relatively simple
\end{itemize}
Disadvantages of the Boolean model include:
\begin{itemize}
\item People often have difficulty formulating expressions, harbours some difficulty in use.
\item Documents are considered either relevant or irrelevant; no partial matching allowed.
\item Poor performance.
\item Suffers badly from natural language effects of synonymy etc.
\item No ranking of results.
\item Terms in a document are considered independent of each other.
\end{itemize}
\subsubsection{Example}
$$
q = t_1 \land (t_2 \lor (\neg t_3))
$$
\begin{minted}[linenos, breaklines, frame=single]{sql}
q = t1 AND (t2 OR (NOT t3))
\end{minted}
This can be mapped to what is termed \textbf{disjunctive normal form}, where we have a series of disjunctions
(or logical ORs) of conjunctions.
$$
q = 100 \lor 110 \lor 111
$$
If a document satisfies any of the components, the document is deemed relevant and returned.
\subsection{Vector Space Model}
The \textbf{vector space model} attempts to improve upon the Boolean model by removing the limitation of binary
weights for index terms.
Terms can have non-binary weights in both queries \& documents.
Hence, we can represent the documents \& the query as $n$-dimensional vectors.
$$
\vec{d_j} = (w_{1,j}, w_{2,j}, \dots, w_{n,j})
$$
$$
\vec{q} = (w_{1,q}, w_{2,q}, \dots, w_{n,q})
$$
We can calculate the similarity between a document \& a query by calculating the similarity between the vector
representations of the document \& query by measuring the cosine of the angle between the two vectors.
$$
\vec{a} \cdot \vec{b} = \mid \vec{a} \mid \mid \vec{b} \mid \cos (\vec{a}, \vec{b})
$$
$$
\Rightarrow \cos (\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\mid \vec{a} \mid \mid \vec{b} \mid}
$$
We can therefore calculate the similarity between a document and a query as:
$$
\text{sim}(q,d) = \cos (\vec{q}, \vec{d}) = \frac{\vec{q} \cdot \vec{d}}{\mid \vec{q} \mid \mid \vec{d} \mid}
$$
Considering term weights on the query and documents, we can calculate similarity between the document \& query as:
$$
\text{sim}(q,d) =
\frac
{\sum^N_{i=1} (w_{i,q} \times w_{i,d})}
{\sqrt{\sum^N_{i=1} (w_{i,q})^2} \times \sqrt{\sum^N_{i=1} (w_{i,d})^2} }
$$
Advantages of the vector space model over the Boolean model include:
\begin{itemize}
\item Improved performance due to weighting schemes.
\item Partial matching is allowed which gives a natural ranking.
\end{itemize}
The primary disadvantage of the vector space model is that terms are considered to be mutually independent.
\subsubsection{Weighting Schemes}
We need a means to calculate the term weights in the document and query vector representations.
A term's frequency within a document quantifies how well a term describes a document;
the more frequently a term occurs in a document, the better it is at describing that document and vice-versa.
This frequency is known as the \textbf{term frequency} or \textbf{tf factor}.
\\\\
If a term occurs frequently across all the documents, that term does little to distinguish one document from another.
This factor is known as the \textbf{inverse document frequency} or \textbf{idf-frequency}.
Traditionally, the most commonly-used weighting schemes are know as \textbf{tf-idf} weighting schemes.
\\\\
For all terms in a document, the weight assigned can be calculated as:
$$
w_{i,j} = f_{i,j} \times \log \left( \frac{N}{N_i} \right)
$$
where
\begin{itemize}
\item $f_{i,j}$ is the (possibly normalised) frequency of term $t_i$ in document $d_j$.
\item $N$ is the number of documents in the collection.
\item $N_i$ is the number of documents that contain term $t_i$.
\end{itemize}
\section{Evaluation of IR Systems}
When evaluating an IR system, we need to consider:
\begin{itemize}
\item The \textbf{functional requirements}: whether or not the system works as intended.
This is done with standard testing techniques.
\item The \textbf{performance:}
\begin{itemize}
\item Response time.
\item Space requirements.
\item Measure by empirical analysis, efficiency of algorithms \& data structures for compression,
indexing, etc.
\end{itemize}
\item The \textbf{retrieval performance:} how useful is the system?
IR is a highly empirical discipline and there is a long history of the evaluation of retrieval performance.
This is less of an issue in data retrieval systems wherein perfect matching is possible as there exists
a correct answer.
\end{itemize}
\subsection{Test Collections}
Evaluation of IR systems is usually based on a reference \textbf{test collection} involving human evaluations.
The test collection usually comprises:
\begin{itemize}
\item A collection of documents $D$.
\item A set of information needs that can be represented as queries.
\item A list of relevance judgements for each query-document pair.
\end{itemize}
Issues with using test collections include:
\begin{itemize}
\item It can be very costly to obtain relevance judgements.
\item Crowd sourcing.
\item Pooling approaches.
\item Relevance judgements don't have to be binary.
\item Agreement among judges.
\end{itemize}
\textbf{TREC (Text REtrieval Conference)} provides a means to empirically test the performance of systems in
different domains by providing \textit{tracks} consisting of a data set \& test problems.
These tracks include:
\begin{itemize}
\item \textbf{Ad-hoc retrieval:} different tracks have been proposed to test ad-hoc retrieval including the
Web track (retrieval on web corpora) and the Million Query track (large number of queries).
\item \textbf{Interactive Track}: users interact with the system for relevance feedback.
\item \textbf{Contextual Search:} multiple queries over time.
\item \textbf{Entity Retrieval:} the task is to retrieve entities (people, places, organisations).
\item \textbf{Spam Filtering:} identifying \& filtering out non-relevant or harmful content such as email
spam.
\item \textbf{Question Answering (QA):} the goal is to retrieve precise answers to user questions rather than
returning entire documents.
\item \textbf{Cross-Language Retrieval:} the goal is to retrieve relevant documents in a different language
from the query.
Requires machine translation.
\item \textbf{Conversational IR:} retrieving information in conversational IR systems.
\item \textbf{Sentiment Retrieval:} emphasis on identifying opinions \& sentiments.
\item \textbf{Fact Checking:} misinformation track.
\item \textbf{Domain-Specific Retrieval:} e.g., genomic data.
\item Summarisation Tasks.
\end{itemize}
Relevance is assessed for the information need and not the query.
Because tuning \& optimisation can occur for many IR systems, it is considered good practice to tune on one
collection and then test on another.
\\\\
Interaction with an IR system may be a one-off query or an interactive session.
For the former, \textit{quality} of the returned set is the important metric, while for interactive systems other
issues have to be considered: duration of the session, user effort required, etc.
These issues make evaluation of interactive sessions more difficult.
\subsection{Precision \& Recall}
The most commonly used metrics are \textbf{precision} \& \textbf{recall}.
\subsubsection{Unranked Sets}
Given a set $D$ and a query $Q$, let $R$ be the set of documents relevant to $Q$.
Let $A$ be the set actually returned by the system.
\begin{itemize}
\item \textbf{Precision} is defined as $\frac{|R \cap A|}{|A|} = \frac{\text{relevant retrieved documents}}{\text{all retrieved documents}}$, i.e. what fraction of the retrieved documents are relevant.
\item \textbf{Recall} is defined as $\frac{|R \cap A|}{|R|} = \frac{\text{relevant retrieved documents}}{\text{all relevant documents}}$, i.e. what fraction of the relevant documents were returned.
\end{itemize}
Having two separate measures is useful as different IR systems may have different user requirements.
For example, in web search precision is of the greatest importance, but in the legal domain recall is of the greatest
importance.
\\\\
There is a trade-off between the two measures; for example, by returning every document in the set, recall is
maximised (because all relevant documents will be returned) but precision will be poor (because many irrelevant documents will be returned).
Recall is non-decreasing as the number of documents returned increases, while precision usually decreases as the
number of documents returned increases.
\begin{table}[h!]
\centering
\begin{tabular}{|p{0.3\textwidth}|p{0.3\textwidth}|p{0.3\textwidth}|}
\hline
& \textbf{Relevant} & \textbf{Non-Relevant} \\
\hline
\textbf{Relevant} & True Positive (TP) & False Negative (FN) \\
\hline
\textbf{Non-Relevant} & False Positive (FP) & True Negative (TN) \\
\hline
\end{tabular}
\caption{Confusion Matrix of True/False Positives \& Negatives}
\end{table}
$$
\text{Precision } P = \frac{tp}{tp + fp} = \frac{\text{true positives}}{\text{true positives + false positives}}
$$
$$
\text{Recall } R = \frac{tp}{tp + fn} = \frac{\text{true positives}}{\text{true positives + false negatives}}
$$
The \textbf{accuracy} of a system is the fraction of these classifications that are correct:
$$
\text{Accuracy} = \frac{tp + tn}{tp +fp + fn + tn}
$$
Accuracy is a commonly used evaluation measure in machine learning classification work, but is not a very useful
measure in IR; for example, when searching for relevant documents in a very large set, the number of irrelevant
documents is usually much higher than the number of relevant documents, meaning that a high accuracy score is
attainable by getting true negatives by discarding most documents, even if there aren't many true positives.
\\\\
There are also many single-value measures that combine precision \& recall into one value:
\begin{itemize}
\item F-measure.
\item Balanced F-measure.
\end{itemize}
\subsubsection{Evaluation of Ranked Results}
In IR, returned documents are usually ranked.
One way of evaluating ranked results is to use \textbf{Precision-Recall plots}, wherein precision is typically
plotted against recall.
In an ideal system, we would have a precision value of 1 for a recall value of 1, i.e., all relevant documents
have been returned and no irrelevant documents have been returned.
\begin{tcolorbox}[colback=gray!10, colframe=black, title=Example]
Given $|D| = 20$ \& $|R| = 10$ and a ranked list of length 10, let the returned ranked list be:
$$
\mathbf{d_1}, \mathbf{d_2}, d_3, \mathbf{d_4}, d_5, d_6, \mathbf{d_7}, d_8, d_9, d_{10}
$$
where those in items in bold are those that are relevant.
\begin{itemize}
\item Considering the list as far as the first document: Precision = 1, Recall = 0.1.
\item As far as the first two documents: Precision = 1, Recall = 0.2.
\item As far as the first three documents: Precision = 0.67, Recall = 0.2.
\end{itemize}
We usually plot for recall values = 10\% ... 90\%.
\end{tcolorbox}
We typically calculate precision for these recall values over a set of queries to get a truer measure of a system's
performance:
$$
P(r) = \frac{1}{N} \sum^N_{i=1}P_i(r)
$$
Advantages of Precision-Recall include:
\begin{itemize}
\item Widespread use.
\item It gives a definable measure.
\item It summarises the behaviour of an IR system.
\end{itemize}
Disadvantages of Precision-Recall include:
\begin{itemize}
\item It's not always possible to calculate the recall measure effective of queries in batch mode.
\item Precision \& recall graphs can only be generated when we have ranking.
\item They're not necessarily of interest to the user.
\end{itemize}
Single-value measures for evaluating ranked results include:
\begin{itemize}
\item Evaluating precision when every new document is retrieved and averaging precision values.
\item Evaluating precision when the first relevant document is retrieved.
\item $R$-precision: calculate precision when the final document has been retrieved.
\item Precision at $k$ (P@k).
\item Mean Average Precision (MAP).
\end{itemize}
Precision histograms are used to compare two algorithms over a set of queries.
We calculate the $R$-precision (or possibly another single summary statistic) of two systems over all queries.
The difference between the two are plotted for each of the queries.
\subsection{User-Oriented Measures}
Let $D$ be the document set, $R$ be the set of relevant documents, $A$ be the answer set returned to the users,
and $U$ be the set of relevant documents previously known to the user.
Let $AU$ be the set of returned documents previously known to the user.
$$
\text{Coverage} = \frac{|AU|}{|U|}
$$
Let \textit{New} refer to the set of relevant documents returned to the user that were previously unknown to the user.
We can define \textbf{novelty} as:
$$
\text{Novelty} = \frac{|\text{New}|}{|\text{New}| + |AU|}
$$
The issues surrounding interactive sessions are much more difficult to assess.
Much of the work in measuring user satisfaction comes from the field of HCI.
The usability of these systems is usually measured by monitoring user behaviour or via surveys of the user's
experience.
Another closely related area is that of information visualisation: ow best to represent the retrieved data for a
user etc.
\end{document}