%! TeX program = lualatex \documentclass[a4paper,11pt]{article} % packages \usepackage{censor} \StopCensoring \usepackage{fontspec} \setmainfont{EB Garamond} % for tironian et fallback % % \directlua{luaotfload.add_fallback % % ("emojifallback", % % {"Noto Serif:mode=harf"} % % )} % % \setmainfont{EB Garamond}[RawFeature={fallback=emojifallback}] \setmonofont[Scale=MatchLowercase]{Deja Vu Sans Mono} \usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry} \setlength{\parindent}{0pt} \usepackage{fancyhdr} % Headers and footers \fancyhead[R]{\normalfont \leftmark} \fancyhead[L]{} \pagestyle{fancy} \usepackage{microtype} % Slightly tweak font spacing for aesthetics \usepackage{amsmath} \usepackage[english]{babel} % Language hyphenation and typographical rules \usepackage{xcolor} \definecolor{linkblue}{RGB}{0, 64, 128} \usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref} % \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}} \newcommand{\secref}[1]{\textbf{§\ref{#1}~\nameref{#1}}} \usepackage{changepage} % adjust margins on the fly \usepackage{minted} \usemintedstyle{algol_nu} \usepackage{pgfplots} \pgfplotsset{width=\textwidth,compat=1.9} \usepackage{caption} \newenvironment{code}{\captionsetup{type=listing}}{} \captionsetup[listing]{skip=0pt} \setlength{\abovecaptionskip}{5pt} \setlength{\belowcaptionskip}{5pt} \usepackage[yyyymmdd]{datetime} \renewcommand{\dateseparator}{--} \usepackage{enumitem} \usepackage{titlesec} \author{Andrew Hayes} \begin{document} \begin{titlepage} \begin{center} \hrule \vspace*{0.6cm} \censor{\huge \textbf{CT4100}} \vspace*{0.6cm} \hrule \LARGE \vspace{0.5cm} Information Retrieval \vspace{0.5cm} \hrule \vfill \vfill \hrule \begin{minipage}{0.495\textwidth} \vspace{0.4em} \raggedright \normalsize Name: Andrew Hayes \\ E-mail: \href{mailto://a.hayes18@universityofgalway.ie}{\texttt{a.hayes18@universityofgalway.ie}} \hfill\\ Student ID: 21321503 \hfill \end{minipage} \begin{minipage}{0.495\textwidth} \raggedleft \vspace*{0.8cm} \Large \today \vspace*{0.6cm} \end{minipage} \medskip\hrule \end{center} \end{titlepage} \pagenumbering{roman} \newpage \tableofcontents \newpage \setcounter{page}{1} \pagenumbering{arabic} \section{Introduction} \subsection{Lecturer Contact Details} \begin{itemize} \item Colm O'Riordan. \item \href{mailto://colm.oriordan@universityofgalway.ie}{\texttt{colm.oriordan@universityofgalway.ie}}. \end{itemize} \subsection{Motivations} \begin{itemize} \item To study/analyse techniques to deal suitably with the large amounts (\& types) of information. \item Emphasis on research \& practice in Information Retrieval. \end{itemize} \subsection{Related Fields} \begin{itemize} \item Artificial Intelligence. \item Database \& Information Systems. \item Algorithms. \item Human-Computer Interaction. \end{itemize} \subsection{Recommended Texts} \begin{itemize} \item \textit{Modern Information Retrieval} -- Riberio-Neto \& Baeza-Yates (several copies in library). \item \textit{Information Retrieval} -- Grossman. \item \textit{Introduction to Information Retrieval} -- Christopher Manning. \item Extra resources such as research papers will be recommended as extra reading. \end{itemize} \subsection{Grading} \begin{itemize} \item Exam: 70\%. \item Assignment 1: 30\%. \item Assignment 2: 30\%. \end{itemize} There will be exercise sheets posted for most lecturers; these are not mandatory and are intended as a study aid. \subsection{Introduction to Information Retrieval} \textbf{Information Retrieval (IR)} deals with identifying relevant information based on users' information needs, e.g. web search engines, digital libraries, \& recommender systems. It is finding material (usually documents) of an unstructured nature that satisfies an information need within large collections (usually stored on computers). \section{Information Retrieval Models} \subsection{Introduction to Information Retrieval Models} \textbf{Data collections} are well-structured collections of related items; items are usually atomic with a well-defined interpretation. Data retrieval involves the selection of a fixed set of data based on a well-defined query (e.g., SQL, OQL). \\\\ \textbf{Information collections} are usually semi-structured or unstructured. Information Retrieval (IR) involves the retrieval of documents of natural language which is typically not structured and may be semantically ambiguous. \subsubsection{Information Retrieval vs Information Filtering} The main differences between information retrieval \& information filtering are: \begin{itemize} \item The nature of the information need. \item The nature of the document set. \end{itemize} Other than these two differences, the same models are used. Documents \& queries are represented using the same set of techniques and similar comparison algorithms are also used. \subsubsection{User Role} In traditional IR, the user role was reasonably well-defined in that a user: \begin{itemize} \item Formulated a query. \item Viewed the results. \item Potentially offered feedback. \item Potentially reformulated their query and repeated steps. \end{itemize} In more recent systems, with the increasing popularity of the hypertext paradigm, users usually intersperse browsing with the traditional querying. This raises many new difficulties \& challenges. \subsection{Pre-Processing} \textbf{Document pre-processing} is the application of a set of well-known techniques to the documents \& queries prior to any comparison. This includes, among others: \begin{itemize} \item \textbf{Stemming:} the reduction of words to a potentially common root. The most common stemming algorithms are Lovin's \& Porter's algorithms. E.g. \textit{computerisation}, \textit{computing}, \textit{computers} could all be stemmed to the common form \textit{comput}. \item \textbf{Stop-word removal:} the removal of very frequent terms from documents, which add little to the semantics of meaning of the document. \item \textbf{Thesaurus construction:} the manual or automatic creation of thesauri used to try to identify synonyms within the documents. \end{itemize} \textbf{Representation} \& comparison technique depends on the information retrieval model chosen. The choice of feedback techniques is also dependent on the model chosen. \subsection{Models} Retrieval models can be broadly categorised as: \begin{itemize} \item Boolean: \begin{itemize} \item Classical Boolean. \item Fuzzy Set approach. \item Extended Boolean. \end{itemize} \item Vector: \begin{itemize} \item Vector Space approach. \item Latent Semantic indexing. \item Neural Networks. \end{itemize} \item Probabilistic: \begin{itemize} \item Inference Network. \item Belief Network. \end{itemize} \end{itemize} We can view any IR model as being comprised of: \begin{itemize} \item $D$ is the set of logical representations within the documents. \item $Q$ is the set of logical representations of the user information needs (queries). \item $F$ is a framework for modelling representations ($D$ \& $Q$) and the relationship between $D$ \& $Q$. \item $R$ is a ranking function which defines an ordering among the documents with regard to any query $q$. \end{itemize} We have a set of index terms: $$ t_1, \dots , t_n $$ A \textbf{weight} $w_{i,j}$ is assigned to each term $t_i$ occurring in the $d_j$. We can view a document or query as a vector of weights: $$ \vec{d_j} = (w_1, w_2, w_3, \dots) $$ \subsection{Boolean Model} The \textbf{Boolean model} of information retrieval is based on set theory \& Boolean algebra. A query is viewed as a Boolean expression. The model also assumes terms are present or absent, hence term weights $w_{i,j}$ are binary \& discrete, i.e., $w_{i,j}$ is an element of $\{0, 1\}$. \\\\ Advantages of the Boolean model include: \begin{itemize} \item Clean formalism. \item Widespread \& popular. \item Relatively simple \end{itemize} Disadvantages of the Boolean model include: \begin{itemize} \item People often have difficulty formulating expressions, harbours some difficulty in use. \item Documents are considered either relevant or irrelevant; no partial matching allowed. \item Poor performance. \item Suffers badly from natural language effects of synonymy etc. \item No ranking of results. \item Terms in a document are considered independent of each other. \end{itemize} \subsubsection{Example} $$ q = t_1 \land (t_2 \lor (\neg t_3)) $$ \begin{minted}[linenos, breaklines, frame=single]{sql} q = t1 AND (t2 OR (NOT t3)) \end{minted} This can be mapped to what is termed \textbf{disjunctive normal form}, where we have a series of disjunctions (or logical ORs) of conjunctions. $$ q = 100 \lor 110 \lor 111 $$ If a document satisfies any of the components, the document is deemed relevant and returned. \subsection{Vector Space Model} The \textbf{vector space model} attempts to improve upon the Boolean model by removing the limitation of binary weights for index terms. Terms can have non-binary weights in both queries \& documents. Hence, we can represent the documents \& the query as $n$-dimensional vectors. $$ \vec{d_j} = (w_{1,j}, w_{2,j}, \dots, w_{n,j}) $$ $$ \vec{q} = (w_{1,q}, w_{2,q}, \dots, w_{n,q}) $$ We can calculate the similarity between a document \& a query by calculating the similarity between the vector representations of the document \& query by measuring the cosine of the angle between the two vectors. $$ \vec{a} \cdot \vec{b} = \mid \vec{a} \mid \mid \vec{b} \mid \cos (\vec{a}, \vec{b}) $$ $$ \Rightarrow \cos (\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\mid \vec{a} \mid \mid \vec{b} \mid} $$ We can therefore calculate the similarity between a document and a query as: $$ \text{sim}(q,d) = \cos (\vec{q}, \vec{d}) = \frac{\vec{q} \cdot \vec{d}}{\mid \vec{q} \mid \mid \vec{d} \mid} $$ Considering term weights on the query and documents, we can calculate similarity between the document \& query as: $$ \text{sim}(q,d) = \frac {\sum^N_{i=1} (w_{i,q} \times w_{i,d})} {\sqrt{\sum^N_{i=1} (w_{i,q})^2} \times \sqrt{\sum^N_{i=1} (w_{i,d})^2} } $$ Advantages of the vector space model over the Boolean model include: \begin{itemize} \item Improved performance due to weighting schemes. \item Partial matching is allowed which gives a natural ranking. \end{itemize} The primary disadvantage of the vector space model is that terms are considered to be mutually independent. \subsubsection{Weighting Schemes} We need a means to calculate the term weights in the document and query vector representations. A term's frequency within a document quantifies how well a term describes a document; the more frequently a term occurs in a document, the better it is at describing that document and vice-versa. This frequency is known as the \textbf{term frequency} or \textbf{tf factor}. \\\\ If a term occurs frequently across all the documents, that term does little to distinguish one document from another. This factor is known as the \textbf{inverse document frequency} or \textbf{idf-frequency}. Traditionally, the most commonly-used weighting schemes are know as \textbf{tf-idf} weighting schemes. \\\\ For all terms in a document, the weight assigned can be calculated as: $$ w_{i,j} = f_{i,j} \times \log \left( \frac{N}{N_i} \right) $$ where \begin{itemize} \item $f_{i,j}$ is the (possibly normalised) frequency of term $t_i$ in document $d_j$. \item $N$ is the number of documents in the collection. \item $N_i$ is the number of documents that contain term $t_i$. \end{itemize} \end{document}