uni/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex

%! TeX program = lualatex
\documentclass[a4paper,11pt]{article}
% packages
\usepackage{censor}
\StopCensoring
\usepackage{fontspec}
\usepackage{tcolorbox}
\setmainfont{EB Garamond}
% for tironian et fallback
% % \directlua{luaotfload.add_fallback
% % ("emojifallback",
% %      {"Noto Serif:mode=harf"}
% % )}
% % \setmainfont{EB Garamond}[RawFeature={fallback=emojifallback}]

\setmonofont[Scale=MatchLowercase]{Deja Vu Sans Mono}
\usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry}
\setlength{\parindent}{0pt}

\usepackage{fancyhdr}       % Headers and footers
\fancyhead[R]{\normalfont \leftmark}
\fancyhead[L]{}
\pagestyle{fancy}

\usepackage{microtype}      % Slightly tweak font spacing for aesthetics
\usepackage{amsmath}
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{xcolor}
\definecolor{linkblue}{RGB}{0, 64, 128}
\usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref}
% \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}}
\newcommand{\secref}[1]{\textbf{§\ref{#1}~\nameref{#1}}}

\usepackage{changepage}     % adjust margins on the fly

\usepackage{minted}
\usemintedstyle{algol_nu}

\usepackage{pgfplots}
\pgfplotsset{width=\textwidth,compat=1.9}

\usepackage{caption}
\newenvironment{code}{\captionsetup{type=listing}}{}
\captionsetup[listing]{skip=0pt}
\setlength{\abovecaptionskip}{5pt}
\setlength{\belowcaptionskip}{5pt}

\usepackage[yyyymmdd]{datetime}
\renewcommand{\dateseparator}{--}

\usepackage{enumitem}

\usepackage{titlesec}

\author{Andrew Hayes}

\begin{document}
\begin{titlepage}
    \begin{center}
        \hrule
        \vspace*{0.6cm}
        \censor{\huge \textbf{CT4100}}
        \vspace*{0.6cm}
        \hrule
        \LARGE
        \vspace{0.5cm}
            Information Retrieval
        \vspace{0.5cm}
        \hrule

        \vfill
        \vfill

        \hrule
        \begin{minipage}{0.495\textwidth}
            \vspace{0.4em}
            \raggedright
            \normalsize
            Name: Andrew Hayes \\
            E-mail: \href{mailto://a.hayes18@universityofgalway.ie}{\texttt{a.hayes18@universityofgalway.ie}}  \hfill\\
            Student ID: 21321503 \hfill
        \end{minipage}
        \begin{minipage}{0.495\textwidth}
            \raggedleft
            \vspace*{0.8cm}
            \Large
            \today
            \vspace*{0.6cm}
        \end{minipage}
        \medskip\hrule
    \end{center}
\end{titlepage}

\pagenumbering{roman}
\newpage
\tableofcontents
\newpage
\setcounter{page}{1}
\pagenumbering{arabic}

\section{Introduction}
\subsection{Lecturer Contact Details}
\begin{itemize}
    \item   Colm O'Riordan.
    \item   \href{mailto://colm.oriordan@universityofgalway.ie}{\texttt{colm.oriordan@universityofgalway.ie}}.
\end{itemize}

\subsection{Motivations}
\begin{itemize}
    \item   To study/analyse techniques to deal suitably with the large amounts (\& types) of information.
    \item   Emphasis on research \& practice in Information Retrieval.
\end{itemize}

\subsection{Related Fields}
\begin{itemize}
    \item   Artificial Intelligence.
    \item   Database \& Information Systems.
    \item   Algorithms.
    \item   Human-Computer Interaction.
\end{itemize}

\subsection{Recommended Texts}
\begin{itemize}
    \item   \textit{Modern Information Retrieval} -- Riberio-Neto \& Baeza-Yates (several copies in library).
    \item   \textit{Information Retrieval} -- Grossman.
    \item   \textit{Introduction to Information Retrieval} -- Christopher Manning.
    \item   Extra resources such as research papers will be recommended as extra reading.
\end{itemize}

\subsection{Grading}
\begin{itemize}
    \item   Exam: 70\%.
    \item   Assignment 1: 30\%.
    \item   Assignment 2: 30\%.
\end{itemize}

There will be exercise sheets posted for most lecturers; these are not mandatory and are intended as a study aid.

\subsection{Introduction to Information Retrieval}
\textbf{Information Retrieval (IR)} deals with identifying relevant information based on users' information needs, e.g.
web search engines, digital libraries, \& recommender systems.
It is finding material (usually documents) of an unstructured nature that satisfies an information need within large
collections (usually stored on computers).

\section{Information Retrieval Models}
\subsection{Introduction to Information Retrieval Models}
\textbf{Data collections} are well-structured collections of related items; items are usually atomic with a
well-defined interpretation.
Data retrieval involves the selection of a fixed set of data based on a well-defined query (e.g., SQL, OQL).
\\\\
\textbf{Information collections} are usually semi-structured or unstructured.
Information Retrieval (IR) involves the retrieval of documents of natural language which is typically not
structured and may be semantically ambiguous.

\subsubsection{Information Retrieval vs Information Filtering}
The main differences between information retrieval \& information filtering are:
\begin{itemize}
    \item   The nature of the information need.
    \item   The nature of the document set.
\end{itemize}

Other than these two differences, the same models are used.
Documents \& queries are represented using the same set of techniques and similar comparison algorithms are also
used.

\subsubsection{User Role}
In traditional IR, the user role was reasonably well-defined in that a user:
\begin{itemize}
    \item   Formulated a query.
    \item   Viewed the results.
    \item   Potentially offered feedback.
    \item   Potentially reformulated their query and repeated steps.
\end{itemize}

In more recent systems, with the increasing popularity of the hypertext paradigm, users usually intersperse
browsing with the traditional querying.
This raises many new difficulties \& challenges.

\subsection{Pre-Processing}
\textbf{Document pre-processing} is the application of a set of well-known techniques to the documents \& queries
prior to any comparison.
This includes, among others:
\begin{itemize}
    \item   \textbf{Stemming:} the reduction of words to a potentially common root.
            The most common stemming algorithms are Lovin's \& Porter's algorithms.
            E.g. \textit{computerisation},
            \textit{computing}, \textit{computers} could all be stemmed to the common form \textit{comput}.
    \item   \textbf{Stop-word removal:} the removal of very frequent terms from documents, which add little to the
            semantics of meaning of the document.
    \item   \textbf{Thesaurus construction:} the manual or automatic creation of thesauri used to try to identify
            synonyms within the documents.
\end{itemize}

\textbf{Representation} \& comparison technique depends on the information retrieval model chosen.
The choice of feedback techniques is also dependent on the model chosen.

\subsection{Models}
Retrieval models can be broadly categorised as:
\begin{itemize}
    \item   Boolean:
            \begin{itemize}
                \item   Classical Boolean.
                \item   Fuzzy Set approach.
                \item   Extended Boolean.
            \end{itemize}

    \item   Vector:
            \begin{itemize}
                \item   Vector Space approach.
                \item   Latent Semantic indexing.
                \item   Neural Networks.
            \end{itemize}

    \item   Probabilistic:
            \begin{itemize}
                \item   Inference Network.
                \item   Belief Network.
            \end{itemize}
\end{itemize}

We can view any IR model as being comprised of:
\begin{itemize}
    \item   $D$ is the set of logical representations within the documents.
    \item   $Q$ is the set of logical representations of the user information needs (queries).
    \item   $F$ is a framework for modelling representations ($D$ \& $Q$) and the relationship between $D$ \& $Q$.
    \item   $R$ is a ranking function which defines an ordering among the documents with regard to any query $q$.
\end{itemize}

We have a set of index terms:
$$
t_1, \dots , t_n
$$

A \textbf{weight} $w_{i,j}$ is assigned to each term $t_i$ occurring in the $d_j$.
We can view a document or query as a vector of weights:
$$
\vec{d_j} = (w_1, w_2, w_3, \dots)
$$

\subsection{Boolean Model}
The \textbf{Boolean model} of information retrieval is based on set theory \& Boolean algebra.
A query is viewed as a Boolean expression.
The model also assumes terms are present or absent, hence term weights $w_{i,j}$ are binary \& discrete, i.e.,
$w_{i,j}$ is an element of $\{0, 1\}$.
\\\\
Advantages of the Boolean model include:
\begin{itemize}
    \item   Clean formalism.
    \item   Widespread \& popular.
    \item   Relatively simple
\end{itemize}

Disadvantages of the Boolean model include:
\begin{itemize}
    \item   People often have difficulty formulating expressions, harbours some difficulty in use.
    \item   Documents are considered either relevant or irrelevant; no partial matching allowed.
    \item   Poor performance.
    \item   Suffers badly from natural language effects of synonymy etc.
    \item   No ranking of results.
    \item   Terms in a document are considered independent of each other.
\end{itemize}

\subsubsection{Example}
$$
q = t_1 \land (t_2 \lor (\neg t_3))
$$

\begin{minted}[linenos, breaklines, frame=single]{sql}
q = t1 AND (t2 OR (NOT t3))
\end{minted}

This can be mapped to what is termed \textbf{disjunctive normal form}, where we have a series of disjunctions
(or logical ORs) of conjunctions.

$$
q = 100 \lor 110 \lor 111
$$

If a document satisfies any of the components, the document is deemed relevant and returned.

\subsection{Vector Space Model}
The \textbf{vector space model} attempts to improve upon the Boolean model by removing the limitation of binary
weights for index terms.
Terms can have non-binary weights in both queries \& documents.
Hence, we can represent the documents \& the query as $n$-dimensional vectors.

$$
\vec{d_j} = (w_{1,j}, w_{2,j}, \dots, w_{n,j})
$$
$$
\vec{q} = (w_{1,q}, w_{2,q}, \dots, w_{n,q})
$$

We can calculate the similarity between a document \& a query by calculating the similarity between the vector
representations of the document \& query by measuring the cosine of the angle between the two vectors.
$$
\vec{a} \cdot \vec{b} = \mid \vec{a} \mid \mid \vec{b} \mid \cos (\vec{a}, \vec{b})
$$
$$
\Rightarrow \cos (\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\mid \vec{a} \mid \mid \vec{b} \mid}
$$

We can therefore calculate the similarity between a document and a query as:
$$
\text{sim}(q,d) = \cos (\vec{q}, \vec{d}) = \frac{\vec{q} \cdot \vec{d}}{\mid \vec{q} \mid \mid \vec{d} \mid}
$$

Considering term weights on the query and documents, we can calculate similarity between the document \& query as:
$$
\text{sim}(q,d) =
\frac
{\sum^N_{i=1} (w_{i,q} \times w_{i,d})}
{\sqrt{\sum^N_{i=1} (w_{i,q})^2} \times \sqrt{\sum^N_{i=1} (w_{i,d})^2} }
$$

Advantages of the vector space model over the Boolean model include:
\begin{itemize}
    \item   Improved performance due to weighting schemes.
    \item   Partial matching is allowed which gives a natural ranking.
\end{itemize}

The primary disadvantage of the vector space model is that terms are considered to be mutually independent.

\subsubsection{Weighting Schemes}
We need a means to calculate the term weights in the document and query vector representations.
A term's frequency within a document quantifies how well a term describes a document;
the more frequently a term occurs in a document, the better it is at describing that document and vice-versa.
This frequency is known as the \textbf{term frequency} or \textbf{tf factor}.
\\\\
If a term occurs frequently across all the documents, that term does little to distinguish one document from another.
This factor is known as the \textbf{inverse document frequency} or \textbf{idf-frequency}.
Traditionally, the most commonly-used weighting schemes are know as \textbf{tf-idf} weighting schemes.
\\\\
For all terms in a document, the weight assigned can be calculated as:
$$
w_{i,j} = f_{i,j} \times \log \left( \frac{N}{N_i} \right)
$$
where
\begin{itemize}
    \item   $f_{i,j}$ is the (possibly normalised) frequency of term $t_i$ in document $d_j$.
    \item   $N$ is the number of documents in the collection.
    \item   $N_i$ is the number of documents that contain term $t_i$.
\end{itemize}

\section{Evaluation of IR Systems}
When evaluating an IR system, we need to consider:
\begin{itemize}
    \item   The \textbf{functional requirements}: whether or not the system works as intended.
            This is done with standard testing techniques.
    \item   The \textbf{performance:}
            \begin{itemize}
                \item   Response time.
                \item   Space requirements.
                \item   Measure by empirical analysis, efficiency of algorithms \& data structures for compression,
                        indexing, etc.
            \end{itemize}
    \item   The \textbf{retrieval performance:} how useful is the system?
            IR is a highly empirical discipline and there is a long history of the evaluation of retrieval performance.
            This is less of an issue in data retrieval systems wherein perfect matching is possible as there exists
            a correct answer.
\end{itemize}

\subsection{Test Collections}
Evaluation of IR systems is usually based on a reference \textbf{test collection} involving human evaluations.
The test collection usually comprises:
\begin{itemize}
    \item   A collection of documents $D$.
    \item   A set of information needs that can be represented as queries.
    \item   A list of relevance judgements for each query-document pair.
\end{itemize}

Issues with using test collections include:
\begin{itemize}
    \item   It can be very costly to obtain relevance judgements.
    \item   Crowd sourcing.
    \item   Pooling approaches.
    \item   Relevance judgements don't have to be binary.
    \item   Agreement among judges.
\end{itemize}

\textbf{TREC (Text REtrieval Conference)} provides a means to empirically test the performance of systems in
different domains by providing \textit{tracks} consisting of a data set \& test problems.
These tracks include:
\begin{itemize}
    \item   \textbf{Ad-hoc retrieval:} different tracks have been proposed to test ad-hoc retrieval including the
            Web track (retrieval on web corpora) and the Million Query track (large number of queries).
    \item   \textbf{Interactive Track}: users interact with the system for relevance feedback.
    \item   \textbf{Contextual Search:} multiple queries over time.
    \item   \textbf{Entity Retrieval:} the task is to retrieve entities (people, places, organisations).
    \item   \textbf{Spam Filtering:} identifying \& filtering out non-relevant or harmful content such as email
            spam.
    \item   \textbf{Question Answering (QA):} the goal is to retrieve precise answers to user questions rather than
            returning entire documents.
    \item   \textbf{Cross-Language Retrieval:} the goal is to retrieve relevant documents in a different language
            from the query.
            Requires machine translation.
    \item   \textbf{Conversational IR:} retrieving information in conversational IR systems.
    \item   \textbf{Sentiment Retrieval:} emphasis on identifying opinions \& sentiments.
    \item   \textbf{Fact Checking:} misinformation track.
    \item   \textbf{Domain-Specific Retrieval:} e.g., genomic data.
    \item   Summarisation Tasks.
\end{itemize}

Relevance is assessed for the information need and not the query.
Because tuning \& optimisation can occur for many IR systems, it is considered good practice to tune on one
collection and then test on another.
\\\\
Interaction with an IR system may be a one-off query or an interactive session.
For the former, \textit{quality} of the returned set is the important metric, while for interactive systems other
issues have to be considered: duration of the session, user effort required, etc.
These issues make evaluation of interactive sessions more difficult.

\subsection{Precision \& Recall}
The most commonly used metrics are \textbf{precision} \& \textbf{recall}.
\subsubsection{Unranked Sets}
Given a set $D$ and a query $Q$, let $R$ be the set of documents relevant to $Q$.
Let $A$ be the set actually returned by the system.
\begin{itemize}
    \item   \textbf{Precision} is defined as $\frac{|R \cap A|}{|A|} = \frac{\text{relevant retrieved documents}}{\text{all retrieved documents}}$, i.e. what fraction of the retrieved documents are relevant.
    \item   \textbf{Recall} is defined as $\frac{|R \cap A|}{|R|} = \frac{\text{relevant retrieved documents}}{\text{all relevant documents}}$, i.e. what fraction of the relevant documents were returned.
\end{itemize}

Having two separate measures is useful as different IR systems may have different user requirements.
For example, in web search precision is of the greatest importance, but in the legal domain recall is of the greatest
importance.
\\\\
There is a trade-off between the two measures; for example, by returning every document in the set, recall is
maximised (because all relevant documents will be returned) but precision will be poor (because many irrelevant documents will be returned).
Recall is non-decreasing as the number of documents returned increases, while precision usually decreases as the
number of documents returned increases.

\begin{table}[h!]
    \centering
    \begin{tabular}{|p{0.3\textwidth}|p{0.3\textwidth}|p{0.3\textwidth}|}
        \hline
        & \textbf{Relevant} & \textbf{Non-Relevant} \\
        \hline
        \textbf{Relevant} & True Positive (TP) & False Negative (FN) \\
        \hline
        \textbf{Non-Relevant} & False Positive (FP) & True Negative (TN) \\
        \hline
    \end{tabular}
    \caption{Confusion Matrix of True/False Positives \& Negatives}
\end{table}

$$
\text{Precision } P = \frac{tp}{tp + fp} = \frac{\text{true positives}}{\text{true positives + false positives}}
$$
$$
\text{Recall } R = \frac{tp}{tp + fn} = \frac{\text{true positives}}{\text{true positives + false negatives}}
$$

The \textbf{accuracy} of a system is the fraction of these classifications that are correct:
$$
\text{Accuracy} = \frac{tp + tn}{tp +fp + fn + tn}
$$

Accuracy is a commonly used evaluation measure in machine learning classification work, but is not a very useful
measure in IR; for example, when searching for relevant documents in a very large set, the number of irrelevant
documents is usually much higher than the number of relevant documents, meaning that a high accuracy score is
attainable by getting true negatives by discarding most documents, even if there aren't many true positives.
\\\\
There are also many single-value measures that combine precision \& recall into one value:
\begin{itemize}
    \item   F-measure.
    \item   Balanced F-measure.
\end{itemize}

\subsubsection{Evaluation of Ranked Results}
In IR, returned documents are usually ranked.
One way of evaluating ranked results is to use \textbf{Precision-Recall plots}, wherein precision is typically
plotted against recall.
In an ideal system, we would have a precision value of 1 for a recall value of 1, i.e., all relevant documents
have been returned and no irrelevant documents have been returned.

\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Example}]
    Given $|D| = 20$ \& $|R| = 10$ and a ranked list of length 10, let the returned ranked list be:
    $$
    \mathbf{d_1}, \mathbf{d_2}, d_3, \mathbf{d_4}, d_5, d_6, \mathbf{d_7}, d_8, d_9, d_{10}
    $$

    where those in items in bold are those that are relevant.
    \begin{itemize}
        \item   Considering the list as far as the first document: Precision = 1, Recall = 0.1.
        \item   As far as the first two documents: Precision = 1, Recall = 0.2.
        \item   As far as the first three documents: Precision = 0.67, Recall = 0.2.
    \end{itemize}

    We usually plot for recall values = 10\% ... 90\%.
\end{tcolorbox}

We typically calculate precision for these recall values over a set of queries to get a truer measure of a system's
performance:
$$
P(r) = \frac{1}{N} \sum^N_{i=1}P_i(r)
$$

Advantages of Precision-Recall include:
\begin{itemize}
    \item   Widespread use.
    \item   It gives a definable measure.
    \item   It summarises the behaviour of an IR system.
\end{itemize}

Disadvantages of Precision-Recall include:
\begin{itemize}
    \item   It's not always possible to calculate the recall measure effective of queries in batch mode.
    \item   Precision \& recall graphs can only be generated when we have ranking.
    \item   They're not necessarily of interest to the user.
\end{itemize}

Single-value measures for evaluating ranked results include:
\begin{itemize}
    \item   Evaluating precision when every new document is retrieved and averaging precision values.
    \item   Evaluating precision when the first relevant document is retrieved.
    \item   $R$-precision: calculate precision when the final document has been retrieved.
    \item   Precision at $k$ (P@k).
    \item   Mean Average Precision (MAP).
\end{itemize}

Precision histograms are used to compare two algorithms over a set of queries.
We calculate the $R$-precision (or possibly another single summary statistic) of two systems over all queries.
The difference between the two are plotted for each of the queries.

\subsection{User-Oriented Measures}
Let $D$ be the document set, $R$ be the set of relevant documents, $A$ be the answer set returned to the users,
and $U$ be the set of relevant documents previously known to the user.
Let $AU$ be the set of returned documents previously known to the user.
$$
\text{Coverage} = \frac{|AU|}{|U|}
$$
Let \textit{New} refer to the set of relevant documents returned to the user that were previously unknown to the user.
We can define \textbf{novelty} as:
$$
\text{Novelty} = \frac{|\text{New}|}{|\text{New}| + |AU|}
$$

The issues surrounding interactive sessions are much more difficult to assess.
Much of the work in measuring user satisfaction comes from the field of HCI.
The usability of these systems is usually measured by monitoring user behaviour or via surveys of the user's
experience.
Another closely related area is that of information visualisation: ow best to represent the retrieved data for a
user etc.

\section{Weighting Schemes}
\subsection{Re-cap}
The \textbf{vector space model} attempts to improve upon the Boolean model by removing the limitation of binary weights for index terms.
Terms can have a non-binary value both in queries \& documents.
Hence, we can represent documents \& queries as $n$-dimensional vectors:
$$
\vec{d_j} = \left( w_{1,j} , w_{2,j} , \dots , w_{n,j} \right)
$$
$$
\vec{q} = \left( w_{1,q} , w_{2,q} , \dots , w_{n,q} \right)
$$

We can calculate the similarity between a document and a query by calculating the similarity between the vector representations.
We can measure this similarity by measuring the cosine of the angle between the two vectors.
We can derive a formula for this by starting with the formula for the inner product (dot product) of two vectors:
\begin{align}
a \cdot b = |a| |b| \cos(a,b) \\
\Rightarrow
\cos(a,b) = \frac{a \cdot b}{|a| |b|}
\end{align}

We can therefore calculate the similarity between a document and a query as:
\begin{align*}
    \text{sim}(\vec{d_j}, \vec{q}) = &\frac{d_j \cdot q}{|d_j| |q|} \\
\Rightarrow
    \text{sim}(\vec{d_j}, \vec{q}) = &\frac{\sum^n_{i=1} w_{i,j} \times w_{i,q}}{\sqrt{\sum^n_{i=1} w_{i,j}^2} \times \sqrt{\sum^n_{i=1} w_{i,q}^2}}
\end{align*}

We need a means to calculate the term weights in the document \& query vector representations.
A term's frequency within a document quantifies how well a term describes a document.
The more frequent a term occurs in a document, the better it is at describing that document and vice-versa.
This frequency is known as the \textbf{term frequency} or \textbf{tf factor}.
\\\\
However, if a term occurs frequently across all the documents, then that term does little to distinguish one document from another.
This factor is known as the \textbf{inverse document frequency} or \textbf{idf-frequency}.
The most commonly used weighting schemes are known as \textbf{tf-idf} weighting schemes
For all terms in a document, the weight assigned can be calculated by:
\begin{align*}
    w_{i,j} = f_{i,j} \times \log \frac{N}{n_i}
\end{align*}
where $f_{i,j}$ is the normalised frequency of term $t_i$ in document $d_j$, $N$ is the number of documents in the collection, and $n_i$ is the number of documents that contain the term $t_i$.
\\\\
A similar weighting scheme can be used for queries.
The main difference is that the tf \& idf are given less credence, and all terms have an initial value of 0.5 which is increased or decreased according to the tf-idf across the document collection (Salton 1983).

\subsection{Text Properties}
When considering the properties of a text document, it is important to note that not all words are equally important for capturing the meaning of a document and that text documents are comprised of symbols from a finite alphabet.
\\\\
Factors that affect the performance of information retrieval include:
\begin{itemize}
    \item   What is the distribution of the frequency of different words?
    \item   How fast does vocabulary size grow with the size of a document collection?
\end{itemize}

These factors can be used to select appropriate term weights and other aspects of an IR system.

\subsubsection{Word Frequencies}
A few words are very common, e.g. the two most frequent words ``the'' \& ``of'' can together account for about 10\% of word occurrences.
Most words are very rare: around half the words in a corpus appear only once, which is known as a ``heavy tailed'' or Zipfian distribution.
\\\\
\textbf{Zipf's law} gives an approximate model for the distribution of different words in a document.
It states that when a list of measured values is sorted in decreasing order, the value of the $n^{\text{th}}$ entry is approximately inversely proportional to $n$.
For a word with rank $r$ (the numerical position of the word in a list sorted in by decreasing frequency) and frequency $f$, Zipf's law states that $f \times r$ will equal a constant.
It represents a power law, i.e. a straight line on a log-log plot.
\begin{align*}
    \text{word frequency} \propto \frac{1}{\text{word rank}}
\end{align*}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{./images/zipfs_law_brown_corpus.png}
    \caption{Zipf's Law Modelled on the Brown Corpus}
\end{figure}

As can be seen above, Zipf's law is an accurate model excepting the extremes.

\subsection{Vocabulary Growth}
The manner in which the size of the vocabulary increases with the size of the document collection has an impact on our choice of indexing strategy \& algorithms.
However, it is important to note that the size of a vocabulary is not really bounded in the real world due to the existence of mispellings, proper names etc., \& document identifiers.
\\\\
If $V$ is the size of the vocabulary and $n$ is the length of the document collection in word occurrences, then
\begin{align*}
    V = K \cdot n^\beta, \quad 0 < \beta < 1
\end{align*}
where $K$ is a constant scaling factor that determines the initial vocabulary size of a small collection, usually in the range 10 to 100, and $\beta$ is constant controlling the rate at which the vocabulary size increases usually in the range 0.4 to 0.6.

\subsection{Weighting Schemes}
The quality of performance of an IR system depends on the quality of the weighting scheme; we want to assign high weights to those terms with a high resolving power.
tf-idf is one such approach wherein weight is increased for frequently occurring terms but decreased again for those that are frequent across the collection.
The ``bag of words'' model is usually adopted, i.e., that a document can be treated as an unordered collection of words.
The term independence assumption is also usually adopted, i.e., that the occurrence of each word in a document is independent of the occurrence of other words.

\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{``Bag of Words'' / Term Independence Example}]
    If Document 1 contains the text ``Mary is quicker than John'' and Document 2 contains the text ``John is quicker than Mary'', then Document 1 \& Document 2 are viewed as equivalent.
\end{tcolorbox}

However, it is unlikely that 30 occurrences of a term in a document truly carries thirty times the significance of a single occurrence of that term.
A common modification is to use the logarithm of the term frequency:
\begin{align*}
    \text{If } \textit{tf}_{i,d} > 0 \text{:}&   \quad w_{i,d} = 1 + \log(\textit{tf}_{i,d})\\
    \text{Otherwise:}&          \quad w_{i,d} = 0
\end{align*}

\subsubsection{Maximum Term Normalisation}
We often want to normalise term frequencies because we observe higher frequencies in longer documents merely because longer documents tend to repeat the same words more frequently.
Consider a document $d^\prime$ created by concatenating a document $d$ to itself:
$d^\prime$  is no more relevant to any query than document $d$, yet according to the vector space type similarity $\text{sim}(d^\prime, q) \geq \text{sim}(d,q) \, \forall \, q$.
\\\\
The formula for the \textbf{maximum term normalisation} of a term $i$ in a document $d$ is usually of the form
\begin{align*}
\textit{ntf} = a + \left( 1 - a \right) \frac{\textit{tf}_{i,d}}{\textit{tf}\text{max}(d)}
\end{align*}
where $a$ is a smoothing factor which can be used to dampen the impact of the second term.
\\\\
Problems with maximum term normalisation include:
\begin{itemize}
    \item   Stopword removal may have effects on the distribution of terms: this normalisation is unstable and may require tuning per collection.
    \item   There is a possibility of outliers with unusually high frequency.
    \item   Those documents with a more even distribution of term frequencies should be treated differently to those with a skewed distribution.
\end{itemize}

More sophisticated forms of normalisation also exist, which we will explore in the future.

\subsubsection{Modern Weighting Schemes}
Many, if not all of the developed or learned weighting schemes can be represented in the following format
\begin{align*}
    \text{sim}(q,d) = \sum_{t \in q \cap d} \left( \textit{ntf}(D) \times \textit{gw}_t(C) \times \textit{qw}_t(Q) \right)
\end{align*}
where
\begin{itemize}
    \item   $\textit{ntf}(D)$ is the normalised term frequency in a document.
    \item   $\textit{gw}_t(C)$ is the global weight of a term across a collection.
    \item   $\textit{qw}_t(Q)$ is the query weight of a term in a query $Q$.
\end{itemize}

The \textbf{Okapi BM25} weighting scheme is a standard benchmark weighting scheme with relatively good performance, although it needs to be tuned per collection:
\begin{align*}
    \text{BM25}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{\textit{tf}_{t,D} \cdot \log \left( \frac{N - \textit{df}_t + 0.5}{\textit{df} + 0.5} \right) \cdot \textit{tf}_{t, Q}}{\textit{tf}_{t,D} + k_1 \cdot \left( (1-b) + b \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}} \right)} \right)
\end{align*}

The \textbf{Pivoted Normalisation} weighting scheme is also as standard benchmark which needs to be tuned for collection, although it has its issues with normalisation:
\begin{align*}
    \text{piv}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{1 + \log \left( 1 + \log \left( \textit{tf}_{t, D} \right) \right)}{(1 - s) + s \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}}} \right) \times \log \left( \frac{N+1}{\textit{df}_t} \right) \times \textit{tf}_{t, Q}
\end{align*}

The \textbf{Axiomatic Approach} to weighting consists of the following constraints:
\begin{itemize}
    \item   \textbf{Constraint 1:} adding a query term to a document must always increase the score of that document.
    \item   \textbf{Constraint 2:} adding a non-query term to a document must always decrease the score of that document.
    \item   \textbf{Constraint 3:} adding successive occurrences of a term to a document must increase the score of that document less with each successive occurrence.
            Essentially, any term-frequency factor should be sub-linear.
    \item   \textbf{Constraint 4:} using a vector length should be a better normalisation factor for retrieval.
            However, using the vector length will violate one of the existing constraints.
            Therefore, ensuring that the document length factor is used in a sub-linear function will ensure that repeated appearances of non-query terms are weighted less.
\end{itemize}

New weighting schemes that adhere to all these constraints outperform the best known benchmarks.

\section{Relevance Feedback}
We often attempt to improve the performance of an IR system by modifying the user query;
the new modified query is then re-submitted to the system.
Typically, the user examines the returned list of documents and marks those which are relevant.
The new query is usually created via incorporating new terms and re-weighting existing terms.
The feedback from the user is used to re-calculate the term weights.
Analysis of the document set can either be \textbf{local analysis} (on the returned set) or \textbf{global analysis} (on the whole document set).
This feedback allows for the re-formulation of the query, which has the advantage of shielding the user from the task of query reformulation and from the inner details of the comparison algorithm.

\subsection{Feedback in the Vector Space Model}
We assume that relevant documents have similarly weighted term vectors.
$D_r$ is the set of relevant documents returned, $D_n$ is the set of the non-relevant documents returned, and $C_r$ is the set of relevant documents in the entire collection.
If we assume that $C_r$ is known for a query $q$, then the best vector for a query to distinguish relevant documents from non-relevant documents is
\[
    \vec{q} = \left( \frac{1}{\left|C_r\right|} \sum_{d_j \in C_r}d_j \right) - \left( \frac{1}{N - \left|C_r\right|} \sum_{d_j \notin C_r} d_j \right)
\]

However, it is impossible to generate this query as we do not know $C_r$.
We can however estimate $C_r$ as we know $D_r$ which is a subset of $C_r$: the main approach for doing this is the \textbf{Rocchio Algorithm:}
\[
    \overrightarrow{q_{\text{new}}} = \alpha \overrightarrow{q_\text{orginal}} + \frac{\beta}{\left| D_r \right|} \sum_{d_j \in D_r} d_j - \frac{\gamma}{\left| D_n \right|} \sum_{d_j \in D_n}d_j
\]

where $\alpha$, $\beta$, \& $\gamma$ are constants which determine the importance of feedback and the relative importance of positive feedback over negative feedback.
Variants on this algorithm include:
\begin{itemize}
    \item   \textbf{IDE Regular:}
            \[
                \overrightarrow{q_\text{new}} = \alpha \overrightarrow{q_\text{old}} + \beta \sum_{d_j \in D_r} d_j - \gamma \sum_{d_j \in D_n} d_j
            \]

    \item   \textbf{IDE Dec Hi:} (based on the assumption that positive feedback is more useful than negative feedback)
            \[
                \overrightarrow{q_\text{new}} = \alpha \overrightarrow{q_\text{old}} + \beta \sum_{d_j \in D_r} d_j - \gamma \text{MAXNR}(d_j)
            \]
            where $\text{MAXNR}(d_j)$ is the highest ranked non-relevant document.
\end{itemize}

The use of these feedback mechanisms have shown marked improvement in the precision \& recall of IR systems.
Salton indicated in early work on the vector space model that these feedback mechanisms result in an average precision of at least 10\%.
\\\\
The precision-recall is re-calculated for the new returned set, often with respect to the returned document set less the set marked by the user.

\subsection{Pseudo-Feedback / Blind Feedback}
In \textbf{local analysis}, the retrieved documents are examined at query time to determine terms for query expansion.
We typically develop some form of term-term correlation matrix.
To quantify connection between two terms, we expand the query to include terms correlated to the query terms.

\subsubsection{Association Clusters}
To create an \textbf{association cluster}, first create a matrix $M$;
We can create term $\times$ term matrix to represent the level of association between terms.
This is usually weighted according to
\[
    M_{i,j} = \frac{\text{freq}_{i,j}}{\text{freq}_{i} + \text{freq}_{j} - \text{freq}_{i,j}}
\]

To perform query expansion with local analysis, we can develop an association cluster for each term $t_i$ in the query.
For each term $t_i \in q$ choose the $i^\text{th}$ query term and select the top $N$ values from its row in the term matrix.
For a query $q$, select a cluster for each query term so that $\left| q \right|$ clusters are formed.
$N$ is usually small to prevent generation of very large queries.
We may then either take all terms or just those with the highest summed correlation.

\subsubsection{Metric Clusters}
Association clusters do not take into account the position of terms within documents: \textbf{metric clusters} attempt to overcome this limitation.
Let $\text{dis}(t_i, t_j)$ be the distance between two terms $t_i$ \& $t_j$ in the same document.
If $t_i$ \& $t_j$ are in different documents, then $\text{dis}(t_i, t_j) = \inf$.
We can define the term-term correlation matrix by the following equation, and we can define clusters as before:
\[
    M_{i,j} = \sum_{t_i, t_j \in D_i} \frac{1}{\text{dis}(t_i, t_j)}
\]

\subsubsection{Scalar Clusters}
\textbf{Scalar clusters} are based on comparing sets of words:
if two terms have similar neighbourhoods then there is a high correlation between terms.
Similarity can be based on comparing the two vectors representing the neighbourhoods.
This measure can be used to define term-term correlation matrices and the procedure can continue as before.

\subsection{Global Analysis}
\textbf{Global analysis} is based on analysis of the whole document collection and not just the returned set.
A similarity matrix is created with a similar technique to the method used in the vector space comparison.
We then index each term by the documents in which the term is contained.
It is then possible to calculate the similarity between two terms by taking some measure of the two vectors, e.g. the dot product.
To use this to expand a query, we then:
\begin{enumerate}
    \item   Map the query to the document-term space.
    \item   Calculate the similarity between the query vector and vectors associated with query terms.
    \item   Rank the vectors $\vec{t_i}$ based on similarity.
    \item   Choose the top-ranked terms to add to the query.
\end{enumerate}

\subsection{Issues with Feedback}
The Rocchio \& IDE methods can be used in all vector-based approaches.
Feedback is an implicit component of many other IR models (e.g., neural networks \& probabilistic models).
The same approaches with some modifications are used in information filtering.
Problems that exist in obtaining user feedback include:
\begin{itemize}
    \item   Users tend not to give a high degree of feedback.
    \item   Users are typically inconsistent with their feedback.
    \item   Explicit user feedback does not have to be strictly binary, we can allow a range of values.
    \item   Implicit feedback can also be used, we can make assumptions that a user found an article useful if:
            \begin{itemize}
                \item   The user reads the article.
                \item   The user spends a certain amount of time reading the article.
                \item   The user saves or prints the article.
            \end{itemize}

            However, these metrics are rarely as trustworthy as explicit feedback.
\end{itemize}

\section{Collaborative Filtering}
\textbf{Content filtering} is based solely on matching content of items to user's information needs.
\textbf{Collaborative filtering} collects human judgements and matches people who share the same information needs \& tastes.
Users share their judgements \& opinions.
It echoes the ``word of mouth'' principle.
Advantages of collaborative filtering over content filtering include:
\begin{itemize}
    \item   Support for filtering / retrieval of items where contents cannot be easily analysed in an automated manner.
    \item   Ability to filter based on quality / taste.
    \item   Recommend items that do not contain content the user was expecting.
\end{itemize}

This approach has been successful in a number of domains -- mainly in recommending books/music/films and in e-commerce domains, e.g. Amazon, Netflix, Spotify, ebay, etc.
It has also been applied to collaborative browsing \& searching.
In fact, it can be applied whenever we have some notion of ``ratings'' or ``likes'' or ``relevance'' of items for a set of users.
\\\\
The data in collaborative filtering consists of \textbf{users} (a set of user identifiers), \textbf{items} (a set of item identifiers), \& \textbf{ratings by users of items} (numeric values in some pre-defined range).
We can usually view this as a user-by-item matrix.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{./images/userbyitemmatrix.png}
    \caption{User-By-Item Matrix Example (ratings from 1 to 5; 0 indicates no rating)}
\end{figure}

With \textbf{explicit ratings}, the user usually provides a single numeric value, although the user maybe unwilling to supply many explicit ratings.
\textbf{Universal queries} are when a gauge set of items is presented to the user for rating.
Choosing a good gauge set is an open question.
\textbf{User-selected queries} are when the user chooses which items to rate (often leaving a sparse ratings matrix with many null values).
\\\\
\textbf{Implicit ratings} are when the user's recommendation is obtained from purchase records, web logs, time spent reading an item, etc.
This implicit rating is usually mapped to some numeric scale.
\\\\
For \textbf{user-user recommendation} approaches, there are three general steps:
\begin{enumerate}
    \item   \textbf{Calculate user correlation:} Find how \textit{similar} each user is to every other user.
    \item   \textbf{Select neighbourhood:} form groups or \textit{neighbourhoods} of users who are similar.
    \item   \textbf{Generate prediction:} in each group, \textit{make recommendations} based on what other users in the group have rated.
\end{enumerate}

\subsection{Step 1: Calculate User Correlation}
Some approaches for finding how similar each user is to every other user include:
\begin{itemize}
    \item   Pearson correlation.
    \item   Constrained Pearson correlation.
    \item   The Spearman rank correlation.
    \item   Vector similarity.
\end{itemize}

\subsubsection{Pearson Correlation}
\textbf{Pearson correlation} is when a weighted average of deviations from the neighbour's mean is calculated.
\[
    w_{a,u} = \frac{\sum^m_{i=1} (r_{a,i} - \overline{r}_a) \times (r_{u,i} - \overline{r}_u)}
    {\sqrt{\sum^m_{i=1} (r_{u,i} - \overline{r}_u)^2} \times \sqrt{\sum^m_{i=1}(r_{a,i} - \overline{r}_a})^2}
\]
where for $m$ items:
\begin{itemize}
    \item   $r_{a,i}$ is the rating of a user $a$ for an item $i$.
    \item   $r_a$ is the average rating given by a user $a$.
    \item   $r_{u,i}$ is the rating of user $u$ for item $i$.
    \item   $r_u$ is the average rating given by user $u$.
\end{itemize}

\subsubsection{Vector Similarity}
\textbf{Vector similarity} uses the cosine measure between the user vectors (where users are represented by a vector of ratings for items in the data set) to calculate correlation.

\subsection{Step 2: Select Neighbourhood}
Some approaches for forming groups or \textbf{neighbourhoods} of users who are similar include:
\begin{itemize}
    \item   \textbf{Correlation thresholding:} all neighbours with absolute correlations greater than a specified threshold are selected, say 0.7 if correlations in range 0 to 1.
    \item   \textbf{Best-$n$ correlations:} the best $n$ correlates are chosen.
\end{itemize}

A large neighbourhood can result in low-precision results, while a small neighbourhood can result in few or now predictions.

\subsection{Step 3: Generate Predictions}
For some user (the active user) in a group, make recommendations based on what other users inthe group have rated which the active user has not rated.
Approaches for doing so include:
\begin{itemize}
    \item   \textbf{Compute the weighted average} of the user rating using the correlations as the weights.
            This weighted average approach makes an assumption that all users rate items with approximately the same distribution.
    \item   \textbf{Compute the weighted mean} of all neighbours' ratings.
            Rather than take the explicit numeric value of a rating, a rating's strength is interpreted as its distance from a neighbour's mean rating.
            This approach attempts to account for the lack of uniformity in ratings.

            \[
                P_{a,i} = \bar{r}_a + \frac{\sum^n_{u=1} (r_{u,i - \bar{r}_u}) \times w_{a,u}}{\sum^n_{u=1} w_{a,u}}
            \]
            where for $n$ neighbours:
            \begin{itemize}
                \item   $\bar{r}_a$ is the average rating given by active user $a$.
                \item   $r_{u,i}$ is the rating of user $u$ for item $i$.
                \item   $w_{a,u}$ is the similarity between user $u$ and $a$.
            \end{itemize}
\end{itemize}

Note that the Pearson Correlation formula does not explicitly take into account the number of co-rated items by users.
Thus it is possible to get a high correlation value based on only one co-rated item.
Often, the Pearson Correlation formula is adjusted to take this into account.

\subsection{Experimental Approach for Testing}
A known collection of ratings by users over a range of items is decomposed into two disjoint subsets.
The first set (usually the larger) is used to generate recommendations for items corresponding to those in the smaller set.
These recommendations are then compared to the actual ratings in the second subset.
The accuracy \& coverage of a system can thus be ascertained.

\subsubsection{Metrics}
The main metrics used to test the predictions produced are:
\begin{itemize}
    \item   \textbf{Coverage:} a measure of the ability of the system to provide a recommendation on a given item.
    \item   \textbf{Accuracy:} a measure of the correctness of the recommendations generated by the system.
\end{itemize}

\textbf{Statistical accuracy metrics} are usually calculated by comparing the ratings generated by the system to user-provided ratings.
The accuracy is usually presented as the mean absolute error (\textbf{MAE}) between ratings \& predictions.
\\\\
Typically, the value of the rating is not that important: it is more important to know if the rating is a useful or a non-useful rating.
\textbf{Decision support accuracy metrics} measure whether the recommendation is actually useful to the user.
Many other approaches also exist, including:
\begin{itemize}
    \item   Machine learning approaches.
            \begin{itemize}
                \item   Bayesian models.
                \item   Clustering models.
            \end{itemize}
    \item   Models of how people rate items.
    \item   Data mining approaches.
    \item   Hybrid models which combine collaborative filtering with content filtering.
    \item   Graph decomposition approaches.
\end{itemize}

\subsection{Collaborative Filtering Issues}
\begin{itemize}
    \item   \textbf{Sparsity of Matrix:} in a typical domain, there would be many users \& many items but any user would only have rated a small fraction of all items in the dataset.
            Using a technique such as \textbf{Singular Value Decomposition (SVD)}, the data space can be reduced, and due to this reduction a correlation may be found between similar users who do not have overlapping ratings in the original matrix of ratings.

    \item   \textbf{Size of Matrix:} in general, the matrix is very large, which can affect computational efficiency.
            SVD has been used to improve scalability by dimensionality reduction.

    \item   \textbf{Noise in Matrix:} we need to consider how would a user's ratings change for items and how to model time dependencies; are all ratings honest \& reliable?

    \item   \textbf{Size of Neighbourhood:} while the size of the neighbourhood affects predictions, there is no way to know the ``right'' size.
             Need to consider whether visualisation of the would neighbourhood help, whether summarisation of the main themes/feature of neighbourhoods help.

    \item   \textbf{How to Gather Ratings:} new users, new items: perhaps use weighted average of global mean \& users or items.
            What if the user is not similar to others?
\end{itemize}

\subsection{Combining Content \& Collaborative Filtering}
For most items rated in a collaborative filtering domain, content information is also available:
\begin{itemize}
    \item   Books: author, genre, plot summary, language, etc.
    \item   Music: artist, genre, sound samples, etc.
    \item   Films: director, genre, actors, year, country, etc.
\end{itemize}

Traditionally, content is not used in collaborative filtering, although it could be.
\\\\
Different approaches may suffer from different problems, so can consider combining multiple approaches.
We can also view collaborative filtering as a machine learning classification problem: for an item, do we classify it as relevant to a user or not?
\\\\
Much recent work has been focused on not only giving a recommendation, but also attempting to explain the recommendation to the user.
Questions arise in how best to ``explain'' or visualise the recommendation.

\section{Learning in Information Retrieval}
Many real-world problems are complex and it is difficult to specify (algorithmically) how to solve many of these problems.
Learning techniques are used in many domains to find solutions to problems that may not be obvious or clear to human users.
In general, machine learning involves searching a large space of potential hypotheses or potential solutions to find the hypotheses/solution that best \textit{explains} or \textit{fits} a set of data and any prior knowledge, or is the best solution, or the solution that we can say learns if it improves the performance.
\\\\
Machine learning techniques require a training stage before the learned solution can be used on new previously unseen data.
The training stage consists of a data set of examples which can either be:
\begin{itemize}
    \item   \textbf{Labelled} (supervised learning).
    \item   \textbf{Unlabelled} (unsupervised learning).
\end{itemize}

An additional data set must also be used to test the hypothesis/solution.
\\\\
\textbf{Symbolic knowledge} is represented in the form of the symbolic descriptions of the learned concepts, e.g., production rules or concept hierarchies.
\textbf{Sub-symbolic knowledge} is represented in sub-symbolic form not readable by a user, e.g., in the structure, weights, \& biases of the trained network.

\subsection{Genetic Algorithms}
\textbf{Genetic algorithms} are inspired by the Darwinian theory of evolution:
at each step of the algorithm, the best solutions are selected while the weaker solutions are discarded.
It uses operators based on crossover \& mutation as the basis of the algorithm to sample the space of solutions.
The steps of a genetic algorithm are as follows: first, create a random population.
Then, while a solution has not been found:
\begin{enumerate}
    \item   Calculate the fitness of each individual.
    \item   Select the population for reproduction:
            \begin{enumerate}[label=\roman*.]
                \item   Perform crossover.
                \item   Perform mutation.
            \end{enumerate}
    \item   Repeat.
\end{enumerate}

\tikzstyle{process} = [rectangle, minimum width=2cm, minimum height=1cm, text centered, draw=black]
\tikzstyle{arrow} = [thick,->,>=stealth]
% \usetikzlibrary{patterns}

\begin{figure}[H]
    \centering
    \begin{tikzpicture}[node distance=2cm]
        \node (reproduction)    [process] at (0, 2.5)  {Reproduction, Crossover, Mutation};
        \node (population)      [process] at (-2.5, 0) {population};
        \node (fitness)         [process] at (0, -2.5) {Calculate Fitness};
        \node (select)          [process] at (2.5, 0)  {Select Population};

        \draw [arrow] (population) -- (fitness);
        \draw [arrow] (fitness) -- (select);
        \draw [arrow] (select) -- (reproduction);
        \draw [arrow] (reproduction) -- (population);
    \end{tikzpicture}
    \caption{Genetic Algorithm Steps}
\end{figure}

Traditionally, solutions are represented in binary.
A \textbf{phenotype} is the decoding or manifestation of a \textbf{genotype} which is the encoding or representation of a phenotype.
We need an evaluation function which will discriminate between better and worse solutions.
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Crossover Examples}]
Example  of one-point crossover:
\texttt{11001\underline{011}} and \texttt{11011\underline{111}} gives \texttt{11001\underline{111}} and \texttt{11011\underline{011}}.
\\\\
Example of $n$-point crossover: \texttt{\underline{110}110\underline{11}0} and \texttt{0001001000} gives \texttt{\underline{110}100\texttt{11}00} and \texttt{000\underline{110}10\underline{01}}.
\end{tcolorbox}

\textbf{Mutation} occurs in the genetic algorithm at a much lower rate than crossover.
It is important to add some diversity to the population in the hope that new better solutions are discovered and therefore it aids in the evolution of the population.
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Mutation Example}]
Example  of mutation: \texttt{1\underline{1}001001} $\rightarrow$ \texttt{1\underline{0}001001}.
\end{tcolorbox}

There are two types of selection:
\begin{itemize}
    \item   \textbf{Roulette wheel selection:} each sector in the wheel is proportional to an individual's fitness.
            Select $n$ individuals by means of $n$ roulette turns.
            Each individual is drawn independently.
        \item   \textbf{Tournament selection:} a number of individuals are selected at random with replacement from the population.
                The individual with the best score is selected.
                This is repeated $n$ times.
\end{itemize}

Issues with genetic algorithms include:
\begin{itemize}
    \item   Choice of representation for encoding individuals.
    \item   Definition of fitness function.
    \item   Definition of selection scheme.
    \item   Definition of suitable genetic operators.
    \item   Setting of parameters:
            \begin{itemize}
                \item   Size of population.
                \item   Number of generations.
                \item   Probability of crossover.
                \item   Probability of mutation.
            \end{itemize}
\end{itemize}

\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Case Study 1: Application of Genetic Algorithms to IR}]
    The effectiveness of an IR system is dependent on the quality of the weights assigned to terms in documents.
    We have seen heuristic-based approaches \& their effectiveness and we've seen axiomatic approaches that could be considered.
    \\\\
    Why not learn the weights?
    We have a definition of relevant \& non-relevant documents; we can use MAP or precision@$k$ as fitness.
    Each genotype can be a set of vectors of length $N$ (the size of the lexicon).
    Set all rates randomly initially.
    Run the system with a set of queries to obtain fitness; select good chromosomes; crossover; mutate.
    Effectively searching the landscape for weights to give a good ranking.
\end{tcolorbox}


\subsection{Genetic Programming}
\textbf{Genetic programming} applies the approach of the genetic algorithm to the space of possible computer programs.
``Virtually all problems in artificial intelligence, machine learning, adaptive systems, \& automated learning can be recast as a search for a computer program.
Genetic programming provides a way to successfully conduct the search for a computer program in the space of computer programs.'' -- Koza.
\\\\
A random population of solutions is created which are modelled in a tree structure with operators as internal nodes and operands as leaf nodes.


\begin{figure}[H]
    \centering
    \usetikzlibrary{trees}
    \begin{tikzpicture}
        [
        every node/.style = {draw, shape=rectangle, align=center},
        level distance = 1.5cm,
        sibling distance = 1.5cm,
        edge from parent/.style={draw,-latex}
        ]
        \node {+}
            child { node {1} }
            child { node {2} }
            child { node {\textsc{if}}
                child { node {>}
                    child { node {\textsc{time}} }
                    child { node {10} }
                }
                child { node {3} }
                child { node {4} }
            };
    \end{tikzpicture}
    \caption{\texttt{(+ 1 2 (IF (> TIME 10) 3 4))}}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.4\textwidth]{./images/crossover.png}
    \caption{Crossover Example}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.4\textwidth]{./images/mutation.png}
    \caption{Mutation Example}
\end{figure}

The genetic programming flow is as follows:
\begin{enumerate}
    \item   Trees are (usually) created at random.
    \item   Evaluate how each tree performs in its environment (using a fitness function).
    \item   Selection occurs based on fitness (tournament selection).
    \item   Crossover of selected solutions to create new individuals.
    \item   Repeat until population is replaced.
    \item   Repeat for $N$ generations.
\end{enumerate}

\subsubsection{Anatomy of a Term-Weighting Scheme}
Typical components of term weighting schemes include:
\begin{itemize}
    \item   Term frequency aspect.
    \item   ``Inverse document'' score.
    \item   Normalisation factor.
\end{itemize}

The search space should be decomposed accordingly.

\subsubsection{Why Separate Learning into Stages?}
The search space using primitive measures \& functions is extremely large;
reducing the search space is advantageous as efficiency is increased.
It eases the analysis of the solutions produced at each stage.
Comparisons to existing benchmarks at each of these stages can be used to determine if the GP is finding novel solutions or variations on existing solutions.
It can then be identified from where any improvement in performance is coming.

\subsubsection{Learning Each of the Three Parts in Turn}
\begin{enumerate}
    \item   Learn a term-discrimination scheme (i.e., some type of idf) using primitive global measures.
            \begin{itemize}
                \item   8 terminals \& 8 functions.
                \item   $T = \{\textit{df}, \textit{cf}, N, V, C, 1, 10, 0.5\}$.
                \item   $F = \{+, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}()\}$.
            \end{itemize}

    \item   Use this global measure and learn a term-frequency aspect.
            \begin{itemize}
                \item   4 terminals \& 8 functions.
                \item   $T = \{\textit{tf}, 1, 10, 0.4\}$.
                \item   $F = \{+, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}()\}$.
            \end{itemize}

    \item   Finally, learn a normalisation scheme.
            \begin{itemize}
                \item   6 terminals \& 8 functions.
                \item   $T = \{ \text{dl}, \text{dl}_{\text{avg}}, \text{dl}_\text{dev}, 1, 10, 0.5 \}$.
                \item   $F = \{ +, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}() \}$.
            \end{itemize}
\end{enumerate}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.6\textwidth]{./images/threestages.png}
    \caption{Learning Each of the Three Stages in Turn}
\end{figure}

\subsubsection{Details of the Learning Approach}
\begin{itemize}
    \item   7 global functions were developed on \~32,000 OHSUMED documents.
            \begin{itemize}
                \item   All validated on a larger unseen collection and the best function taken.
                \item   Random population of 100 for 50 generations.
                \item   The fitness function used was MAP.
            \end{itemize}

    \item   7 tf functions were developed on \~32,000 LATIMES documents.
            \begin{itemize}
                \item   All validated on a larger unseen collection and the best function taken.
                \item   Random population of 200 for 25 generations.
                \item   The fitness function used was MAP.
            \end{itemize}

    \item   7 normalisation functions were developed 3 $\times$ \~ 10,000 LATIMES documents.
            \begin{itemize}
                \item   All validated on a larger unseen collection and the best function taken.
                \item   Random population of 200 for 25 generations.
                \item   Fitness function used was average MAP over the 3 collections.
            \end{itemize}
\end{itemize}

\subsubsection{Analysis}
The global function $w_3$ always produces a positive number:
\[
    w_3 = \sqrt{\frac{\textit{cf}^3_t \cdot N}{\textit{df}^4_t}}
\]


\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Case Study 1: Application of Genetic Programming to IR}]
    Evolutionary computing approaches include:
    \begin{itemize}
        \item   Evolutionary strategies.
        \item   Genetic algorithms.
        \item   Genetic programming.
    \end{itemize}

    Why genetic programming for IR?
    \begin{itemize}
        \item   Produces a symbolic representation of a solution which is useful for further analysis.
        \item   Using training data, MAP can be directly optimised (i.e., used as the fitness function).
        \item   Solutions produced are often generalisable as solution length (size) can be controlled.
    \end{itemize}
\end{tcolorbox}


\end{document}