[CT4100]: Week 11 lecture notes + slides

2024-11-21 21:23:58 +00:00
parent 7a9125523d
commit 149816f885
3 changed files with 259 additions and 0 deletions
--- a/Retrieval/materials/09.
+++ b/Retrieval/materials/09.
--- a/Retrieval/notes/CT4100-Notes.pdf
+++ b/Retrieval/notes/CT4100-Notes.pdf
--- a/Retrieval/notes/CT4100-Notes.tex
+++ b/Retrieval/notes/CT4100-Notes.tex
@ -1605,7 +1605,266 @@ Can you identify approaches that may be of use in:
    \item   Predicting whether a query expansion technique has improved the results.
 \end{itemize}

+\section{Web Search}
+In classical IR, the collection is relatively static.
+The goal is to retrieve documents with content that is relevant to the user's information need.
+Classic measures of relevance tend to ignore both \textit{context} \& \textit{individuals}.
+In web search, the corpus contains both static \& dynamic information.
+The goal is to retrieve high-quality results that are relevant to the current need.
+Information may be:
+\begin{itemize}
+    \item   Informational;
+    \item   Navigational; 
+    \item   Transactional.
+\end{itemize}

+Newer problems also emerge with respect to web search:
+\begin{itemize}
+    \item   Distributed data;
+    \item   Volatile data;
+    \item   Large volumes, scaling issues;
+    \item   Redundancy (circa 30-40\% documents are (near) duplicates);
+    \item   Quality (hundreds of millions of pages of spam);
+    \item   Diversity (many languages, encodings);
+    \item   Complex graph structure / topology.
+\end{itemize}
+
+Web search users:
+\begin{itemize}
+    \item   Tend to make ill-defined queries that are short, low-effort, contain imprecise terms, and have sub-optimal syntax;
+    \item   Have a wide variance in terms of needs, expectations, knowledge, \& bandwidth;
+    \item   Have specific behaviour: circa 85\% of people do not look at the second screen of results and circa 80\% of people do not modify the query and instead just follow links.
+\end{itemize}
+
+\subsection{Evolution of Web Search}
+\begin{itemize}
+    \item   First generation: use only ``on-page'' information such as word frequency etc.
+    \item   Second generation: link analysis, click-through data.
+    \item   Third generation: semantic analysis, integrate multiple sources of information, context analysis (spatial, query stream, personal profiling), aiding the user (re-spelling, query refinement, query suggestion), representation of results / query / collection.
+\end{itemize}
+
+\subsection{Anatomy of a Search Engine}
+\begin{itemize}
+    \item   \textbf{Spider} (robot/crawler): builds the corpus by recursively following links.
+    \item   \textbf{The indexer:} processes the data and indexes it (fully inverted list).
+            Different approaches are adopted with respect to stemming, phrases, etc.
+    \item   \textbf{Query processor:} query re-formulation, stemming, handling Boolean operators, finds matching documents and ranks them.
+\end{itemize}
+
+\subsection{History: Citation Analysis}
+Initial work in this area can be traced back to the domain of \textbf{citation analysis}.
+The \textbf{impact factor} of a journal is defined as 
+\[
+    \text{impact factor} = \frac{A}{B}
+\]
+where $A$ is the number of current-year citations to articles appearing in the journal during the previous years two years and $B$ is the number of articles published in the journal during the previous two years.
+\\\\
+\textbf{Co-citation} if a paper cites two papers $A$ \& $B$, then they are related or associated.
+The strength of co-citation between $A$ \& $B$ is the number of times they are co-cited.
+\\\\
+The \textbf{bibliographic coupling} of two documents $A$ \& $B$ is the number of documents cited by \textit{both} $A$ \& $B$.
+
+\subsection{Mining the Web Link Structure}
+The main approach in first generation search engines was the \textbf{indexing approach}.
+Although useful and effective, there are potentially too many links returned -- how does one return the most relevant?
+We usually want to select the most ``authoritative'' pages; hence, the search entails identifying pages that have relevancy \& quality.
+In addition to content, web pages also contain many links that connect one page to another.
+The web structure implicitly contains a large number of human annotations which can be exploited to infer notions of authority (and by extension quality).
+Any link to a page $p$ is considered to be a positive recommendation for that page $p$.
+
+\subsection{The HITS Algorithm}
+The \textbf{HITS algorithm} computes hubs \& authorities for a particular topic specified by a normal query.
+It first determines a set of relevant pages for that query called the \textbf{base set} $S$.
+It then analyses the link structure of the web sub-graph defined by $S$ to find authority \& hub pages in the set.
+\\\\
+The HITS algorithm analyses hyperlinks to identify:
+\begin{itemize}
+    \item   Authoritative pages (best sources);
+    \item   Hubs (collections of links).
+\end{itemize}
+
+We develop algorithms to exploit the implicit social organisation available in the web link structure.
+There exist many problems with identifying authoritative pages:
+\begin{itemize}
+    \item   Authoritative pages do not necessarily refer to themselves as such;
+    \item   Many links are purely for navigational purposes;
+    \item   Advertising links.
+\end{itemize}
+
+Mutually recursive heuristics are used:
+\begin{itemize}
+    \item   A good ``authority page'' is one that is pointed to by a number of sources; 
+    \item   A good ``hub'' is one that contains many links.
+\end{itemize}
+
+\subsubsection{Constructing the Sub-Graph}
+\begin{enumerate}
+    \item   For a specific query $Q$, let the set of documents returned by a standard search engine (e.g., vector space approach) be called the \textit{root set} $R$.
+    \item   Initialise $S$ to $R$.
+    \item   Add to $S$ all pages pointed to by any page in $R$.
+    \item   Add to $S$ all pages that point to any page in $R$.
+\end{enumerate}
+
+\subsubsection{Iterative Algorithm}
+Assign to each page $p \in S$:
+\begin{itemize}
+    \item   An authority score $a_p$ (vector $a$);
+    \item   A hub score $h_p$ (vector $h$).
+\end{itemize}
+
+Initialise all $a_p = h_p$ to some constant value.
+
+\subsubsection{HITS Update Rules}
+Authorities are pointed to by lots of good hubs:
+\[
+    a_p = \sum_{q: q \rightarrow p } h_q
+\]
+
+Hubs point to lots of good authorities:
+\[
+    h_p = \sum_{q: p \rightarrow q} a_q
+\]
+
+Define $M$ to be the adjacency matrix for the sub-graph defined by $S$: $M_{i,j} = 1$ for $i \in S, j \in S$ if and only if $i \rightarrow j$.
+We can calculate the authority vector $a$ from the matrix $M^{\text{T}}M$.
+Similarly, the hub vector $h$ can be calculated from the matrix $MM^{\text{T}}$.
+
+\subsubsection{Other Issues}
+Limitations of a link-only approach include:
+\begin{itemize}
+    \item   On narrowly-focused query topics, there may not be many exact references and the hubs may provide links to more general pages.
+    \item   Potential drift from main topic.
+            All links are treated as being equally important; if there is a range of topics in a hub, the focus of the search may drift.
+    \item   Timeliness of recommendation is hard to identify.
+    \item   Sensitivity of malicious attack.
+    \item   Edges with wrong semantics.
+\end{itemize}
+
+\subsection{PageRank}
+\subsubsection{Markov Chains}
+A \textbf{Markov chain} has two components:
+\begin{itemize}
+    \item   A graph / network structure where each node is called a state.
+    \item   A transition probability of traversing a link given that the chain is in a state.
+\end{itemize}
+
+A sequence of steps through the chain is called a \textit{random walk}.
+
+\subsubsection{Random Surfer Model}
+Assume that the web is a Markov chain.
+Surfers randomly click on links, where the probability of an outlink from page $A$ is $\frac{1}{n}$ where there are $n$ outlinks from $A$.
+The surfer occasionally gets \textit{bored} and is moved to another web page (teleported), say $B$, where $B$ is equally likely to be any page.
+The \textbf{PageRank} of a web page is the probability that the surfer will visit that page.
+
+\subsubsection{Dangling Pages}
+A \textbf{dangling page} is a page with not outgoing links that therefore cannot pass on rank.
+The solution to this is to assume that the page has links to all web pages with equal probability.
+
+\subsubsection{Rank Sink}
+Pages in a loop accumulate rank but do not distribute it; this is called a \textbf{rank sink}.
+The solution to this is ``teleportation'', i.e., with a certain probability, the surfer can jump to any other web pages ot get out of the loop. 
+
+\subsubsection{PageRank Definition}
+\begin{align*}
+    \text{PR}(W) = \frac{T}{N} + \left( 1 - T \right) \left( \frac{\text{PR}(W_1)}{O(W_1)} + \frac{\text{PR}(W_2)}{O(W_2)}  + \cdots + \frac{\text{PR}(W_n)}{O(W_n)} \right)
+\end{align*}
+where 
+\begin{itemize}
+    \item   $W$ is a web page;
+    \item   $W_i$ are the web pages that have a link to $W$;
+    \item   $O(W_i)$ is the number of outlinks from $W_i$;
+    \item   $T$ is the teleportation probability;
+    \item   $N$ is the size of the web.
+\end{itemize}
+
+\subsubsection{Efficiency}
+Early experiments on Google showed convergence in 52 iterations on a collection with 322 million links;
+the number of iterations required for convergence is empirically $O(\log n)$ where $n$ is the number of links.
+This is quite efficient.
+
+\subsubsection{Personalised Page Rank}
+We can bias the behaviour of page rank by changing the notion of random jumps.
+Instead of jumping to a random page on the web, we jump probabilistically to a page chosen from a seed set defined for a user.
+We add rank to pages of interest to the user rather than a random page.
+
+\subsubsection{Semantically / Content-Biased Page Rank}
+Page rank treats all edges as being equally important in its random surfer model (excluding links identified as navigation \& advertising links).
+That is, page rank values are distributed equally across all outgoing edges.
+An extra heuristic is that a surfer is more likely to follow a link relating to the content of the current page or passage.
+\\\\
+The page rank values propagated from a page sum to one; standard page rank gives equal values.
+We can measure a similarity between the context of a link and the linked-to page.
+This gives a measure of semantic relatedness between pages / passages.
+If users are more likely to navigate to a related page, we can assign page rank values in proportion to the relative similarity.
+
+\subsection{Temporal Link Analysis}
+Link-analysis techniques (e.g., PageRank, HITS) do not take into account associated temporal aspects of the web content.
+The goal is to incorporate temporal aspects (e.g., freshness, rate of change) into link-analysis techniques.
+Ranking based on the pages' authority values as that value changes over time.
+\\\\
+Approach:
+\begin{itemize}
+    \item   Add annotations to the graph, i.e., for every edge in the graph and for every vertex of the graph, maintain a set of values regarding the temporal aspects such as:
+            \begin{itemize}
+                \item   Creation time;
+                \item   Modification times;
+                \item   Last modification time.
+            \end{itemize}
+    \item   We can define a window of interest, freshness of an edge or node, and activity of an edge or node.
+\end{itemize}
+
+\subsection{Ranking Signals}
+There are many sources of evidence or \textbf{signals} that can be used to rank a page:
+\begin{itemize}
+    \item   Content signals (BM25 \& variants used);
+    \item   Structural signals (anchor text etc.);
+    \item   Web usage (implicit feedback, temporal context);
+    \item   Link-based ranking.
+\end{itemize}
+
+How best to combine signals is an open question.
+A simple approach is to combine PageRank (link analysis) and BM25 (content signal):
+\[
+    R(p,q) = (a) \text{BM25}(p,q) + (1-a) \text{PR}(p)
+\]
+
+More recently, much attention has been paid to using machine learning to rank approaches.
+Effectively, we attempt to learn the optimal way to combine the signals.
+Many approaches have been adopted:
+\begin{itemize}
+    \item   Learning the ranking (NN, SVM, Bayesian networks);
+    \item   Learning the ranking function (Genetic programming).
+\end{itemize}
+
+\subsubsection{Evaluation}
+In order to compare systems \& algorithms and to be able to guide any learning approaches, we need some means of evaluation.
+Commonly used choices include: Precision@5, Precision@10:
+\begin{itemize}
+    \item   Impossible to obtain recall;
+    \item   Users tend not to care about recall.
+\end{itemize}
+
+In IR, we have test collections \& human evaluations.
+In web search, we can exploit click-through data.
+Issues with this include heavy-tailed distribution of queries and having sufficient evidence.
+A related issue is the evaluation of snippets.
+Other issues in web search include:
+\begin{itemize}
+    \item   How to deal with duplicated data?
+    \item   How to deal with near-duplicates?
+    \item   Query suggestions?
+            \begin{itemize}
+                \item   Diversity?
+                \item   Appropriate suggestions;
+                \item   Predictive accuracy.
+            \end{itemize}
+\end{itemize}
+
+\textbf{Adversarial search} is the conflict between web search engine designers / creators and the ``search engine optimisation'' community:
+\begin{itemize}
+    \item   Recognising spam links;
+    \item   Augmenting link analysis algorithms to deal with such manipulation.
+\end{itemize}