[CT4100]: Week 11 lecture notes + slides

2024-11-21 21:23:58 +00:00
parent 7a9125523d
commit 149816f885
3 changed files with 259 additions and 0 deletions
--- a/Retrieval/materials/09.
+++ b/Retrieval/materials/09.
--- a/Retrieval/notes/CT4100-Notes.pdf
+++ b/Retrieval/notes/CT4100-Notes.pdf
--- a/Retrieval/notes/CT4100-Notes.tex
+++ b/Retrieval/notes/CT4100-Notes.tex
@ -1605,7 +1605,266 @@ Can you identify approaches that may be of use in:
    \item   Predicting whether a query expansion technique has improved the results.
 \end{itemize}
 \section{Web Search}
 In classical IR, the collection is relatively static.
 The goal is to retrieve documents with content that is relevant to the user's information need.
 Classic measures of relevance tend to ignore both \textit{context} \& \textit{individuals}.
 In web search, the corpus contains both static \& dynamic information.
 The goal is to retrieve high-quality results that are relevant to the current need.
 Information may be:
 \begin{itemize}
    \item   Informational;
    \item   Navigational; 
    \item   Transactional.
 \end{itemize}
 Newer problems also emerge with respect to web search:
 \begin{itemize}
    \item   Distributed data;
    \item   Volatile data;
    \item   Large volumes, scaling issues;
    \item   Redundancy (circa 30-40\% documents are (near) duplicates);
    \item   Quality (hundreds of millions of pages of spam);
    \item   Diversity (many languages, encodings);
    \item   Complex graph structure / topology.
 \end{itemize}
 Web search users:
 \begin{itemize}
    \item   Tend to make ill-defined queries that are short, low-effort, contain imprecise terms, and have sub-optimal syntax;
    \item   Have a wide variance in terms of needs, expectations, knowledge, \& bandwidth;
    \item   Have specific behaviour: circa 85\% of people do not look at the second screen of results and circa 80\% of people do not modify the query and instead just follow links.
 \end{itemize}
 \subsection{Evolution of Web Search}
 \begin{itemize}
    \item   First generation: use only ``on-page'' information such as word frequency etc.
    \item   Second generation: link analysis, click-through data.
    \item   Third generation: semantic analysis, integrate multiple sources of information, context analysis (spatial, query stream, personal profiling), aiding the user (re-spelling, query refinement, query suggestion), representation of results / query / collection.
 \end{itemize}
 \subsection{Anatomy of a Search Engine}
 \begin{itemize}
    \item   \textbf{Spider} (robot/crawler): builds the corpus by recursively following links.
    \item   \textbf{The indexer:} processes the data and indexes it (fully inverted list).
            Different approaches are adopted with respect to stemming, phrases, etc.
    \item   \textbf{Query processor:} query re-formulation, stemming, handling Boolean operators, finds matching documents and ranks them.
 \end{itemize}
 \subsection{History: Citation Analysis}
 Initial work in this area can be traced back to the domain of \textbf{citation analysis}.
 The \textbf{impact factor} of a journal is defined as 
 \[
    \text{impact factor} = \frac{A}{B}
 \]
 where $A$ is the number of current-year citations to articles appearing in the journal during the previous years two years and $B$ is the number of articles published in the journal during the previous two years.
 \\\\
 \textbf{Co-citation} if a paper cites two papers $A$ \& $B$, then they are related or associated.
 The strength of co-citation between $A$ \& $B$ is the number of times they are co-cited.
 \\\\
 The \textbf{bibliographic coupling} of two documents $A$ \& $B$ is the number of documents cited by \textit{both} $A$ \& $B$.
 \subsection{Mining the Web Link Structure}
 The main approach in first generation search engines was the \textbf{indexing approach}.
 Although useful and effective, there are potentially too many links returned -- how does one return the most relevant?
 We usually want to select the most ``authoritative'' pages; hence, the search entails identifying pages that have relevancy \& quality.
 In addition to content, web pages also contain many links that connect one page to another.
 The web structure implicitly contains a large number of human annotations which can be exploited to infer notions of authority (and by extension quality).
 Any link to a page $p$ is considered to be a positive recommendation for that page $p$.
 \subsection{The HITS Algorithm}
 The \textbf{HITS algorithm} computes hubs \& authorities for a particular topic specified by a normal query.
 It first determines a set of relevant pages for that query called the \textbf{base set} $S$.
 It then analyses the link structure of the web sub-graph defined by $S$ to find authority \& hub pages in the set.
 \\\\
 The HITS algorithm analyses hyperlinks to identify:
 \begin{itemize}
    \item   Authoritative pages (best sources);
    \item   Hubs (collections of links).
 \end{itemize}
 We develop algorithms to exploit the implicit social organisation available in the web link structure.
 There exist many problems with identifying authoritative pages:
 \begin{itemize}
    \item   Authoritative pages do not necessarily refer to themselves as such;
    \item   Many links are purely for navigational purposes;
    \item   Advertising links.
 \end{itemize}
 Mutually recursive heuristics are used:
 \begin{itemize}
    \item   A good ``authority page'' is one that is pointed to by a number of sources; 
    \item   A good ``hub'' is one that contains many links.
 \end{itemize}
 \subsubsection{Constructing the Sub-Graph}
 \begin{enumerate}
    \item   For a specific query $Q$, let the set of documents returned by a standard search engine (e.g., vector space approach) be called the \textit{root set} $R$.
    \item   Initialise $S$ to $R$.
    \item   Add to $S$ all pages pointed to by any page in $R$.
    \item   Add to $S$ all pages that point to any page in $R$.
 \end{enumerate}
 \subsubsection{Iterative Algorithm}
 Assign to each page $p \in S$:
 \begin{itemize}
    \item   An authority score $a_p$ (vector $a$);
    \item   A hub score $h_p$ (vector $h$).
 \end{itemize}
 Initialise all $a_p = h_p$ to some constant value.
 \subsubsection{HITS Update Rules}
 Authorities are pointed to by lots of good hubs:
 \[
    a_p = \sum_{q: q \rightarrow p } h_q
 \]
 Hubs point to lots of good authorities:
 \[
    h_p = \sum_{q: p \rightarrow q} a_q
 \]
 Define $M$ to be the adjacency matrix for the sub-graph defined by $S$: $M_{i,j} = 1$ for $i \in S, j \in S$ if and only if $i \rightarrow j$.
 We can calculate the authority vector $a$ from the matrix $M^{\text{T}}M$.
 Similarly, the hub vector $h$ can be calculated from the matrix $MM^{\text{T}}$.
 \subsubsection{Other Issues}
 Limitations of a link-only approach include:
 \begin{itemize}
    \item   On narrowly-focused query topics, there may not be many exact references and the hubs may provide links to more general pages.
    \item   Potential drift from main topic.
            All links are treated as being equally important; if there is a range of topics in a hub, the focus of the search may drift.
    \item   Timeliness of recommendation is hard to identify.
    \item   Sensitivity of malicious attack.
    \item   Edges with wrong semantics.
 \end{itemize}
 \subsection{PageRank}
 \subsubsection{Markov Chains}
 A \textbf{Markov chain} has two components:
 \begin{itemize}
    \item   A graph / network structure where each node is called a state.
    \item   A transition probability of traversing a link given that the chain is in a state.
 \end{itemize}
 A sequence of steps through the chain is called a \textit{random walk}.
 \subsubsection{Random Surfer Model}
 Assume that the web is a Markov chain.
 Surfers randomly click on links, where the probability of an outlink from page $A$ is $\frac{1}{n}$ where there are $n$ outlinks from $A$.
 The surfer occasionally gets \textit{bored} and is moved to another web page (teleported), say $B$, where $B$ is equally likely to be any page.
 The \textbf{PageRank} of a web page is the probability that the surfer will visit that page.
 \subsubsection{Dangling Pages}
 A \textbf{dangling page} is a page with not outgoing links that therefore cannot pass on rank.
 The solution to this is to assume that the page has links to all web pages with equal probability.
 \subsubsection{Rank Sink}
 Pages in a loop accumulate rank but do not distribute it; this is called a \textbf{rank sink}.
 The solution to this is ``teleportation'', i.e., with a certain probability, the surfer can jump to any other web pages ot get out of the loop. 
 \subsubsection{PageRank Definition}
 \begin{align*}
    \text{PR}(W) = \frac{T}{N} + \left( 1 - T \right) \left( \frac{\text{PR}(W_1)}{O(W_1)} + \frac{\text{PR}(W_2)}{O(W_2)}  + \cdots + \frac{\text{PR}(W_n)}{O(W_n)} \right)
 \end{align*}
 where 
 \begin{itemize}
    \item   $W$ is a web page;
    \item   $W_i$ are the web pages that have a link to $W$;
    \item   $O(W_i)$ is the number of outlinks from $W_i$;
    \item   $T$ is the teleportation probability;
    \item   $N$ is the size of the web.
 \end{itemize}
 \subsubsection{Efficiency}
 Early experiments on Google showed convergence in 52 iterations on a collection with 322 million links;
 the number of iterations required for convergence is empirically $O(\log n)$ where $n$ is the number of links.
 This is quite efficient.
 \subsubsection{Personalised Page Rank}
 We can bias the behaviour of page rank by changing the notion of random jumps.
 Instead of jumping to a random page on the web, we jump probabilistically to a page chosen from a seed set defined for a user.
 We add rank to pages of interest to the user rather than a random page.
 \subsubsection{Semantically / Content-Biased Page Rank}
 Page rank treats all edges as being equally important in its random surfer model (excluding links identified as navigation \& advertising links).
 That is, page rank values are distributed equally across all outgoing edges.
 An extra heuristic is that a surfer is more likely to follow a link relating to the content of the current page or passage.
 \\\\
 The page rank values propagated from a page sum to one; standard page rank gives equal values.
 We can measure a similarity between the context of a link and the linked-to page.
 This gives a measure of semantic relatedness between pages / passages.
 If users are more likely to navigate to a related page, we can assign page rank values in proportion to the relative similarity.
 \subsection{Temporal Link Analysis}
 Link-analysis techniques (e.g., PageRank, HITS) do not take into account associated temporal aspects of the web content.
 The goal is to incorporate temporal aspects (e.g., freshness, rate of change) into link-analysis techniques.
 Ranking based on the pages' authority values as that value changes over time.
 \\\\
 Approach:
 \begin{itemize}
    \item   Add annotations to the graph, i.e., for every edge in the graph and for every vertex of the graph, maintain a set of values regarding the temporal aspects such as:
            \begin{itemize}
                \item   Creation time;
                \item   Modification times;
                \item   Last modification time.
            \end{itemize}
    \item   We can define a window of interest, freshness of an edge or node, and activity of an edge or node.
 \end{itemize}
 \subsection{Ranking Signals}
 There are many sources of evidence or \textbf{signals} that can be used to rank a page:
 \begin{itemize}
    \item   Content signals (BM25 \& variants used);
    \item   Structural signals (anchor text etc.);
    \item   Web usage (implicit feedback, temporal context);
    \item   Link-based ranking.
 \end{itemize}
 How best to combine signals is an open question.
 A simple approach is to combine PageRank (link analysis) and BM25 (content signal):
 \[
    R(p,q) = (a) \text{BM25}(p,q) + (1-a) \text{PR}(p)
 \]
 More recently, much attention has been paid to using machine learning to rank approaches.
 Effectively, we attempt to learn the optimal way to combine the signals.
 Many approaches have been adopted:
 \begin{itemize}
    \item   Learning the ranking (NN, SVM, Bayesian networks);
    \item   Learning the ranking function (Genetic programming).
 \end{itemize}
 \subsubsection{Evaluation}
 In order to compare systems \& algorithms and to be able to guide any learning approaches, we need some means of evaluation.
 Commonly used choices include: Precision@5, Precision@10:
 \begin{itemize}
    \item   Impossible to obtain recall;
    \item   Users tend not to care about recall.
 \end{itemize}
 In IR, we have test collections \& human evaluations.
 In web search, we can exploit click-through data.
 Issues with this include heavy-tailed distribution of queries and having sufficient evidence.
 A related issue is the evaluation of snippets.
 Other issues in web search include:
 \begin{itemize}
    \item   How to deal with duplicated data?
    \item   How to deal with near-duplicates?
    \item   Query suggestions?
            \begin{itemize}
                \item   Diversity?
                \item   Appropriate suggestions;
                \item   Predictive accuracy.
            \end{itemize}
 \end{itemize}
 \textbf{Adversarial search} is the conflict between web search engine designers / creators and the ``search engine optimisation'' community:
 \begin{itemize}
    \item   Recognising spam links;
    \item   Augmenting link analysis algorithms to deal with such manipulation.
 \end{itemize}