diff --git a/year4/semester1/CT4100: Information Retrieval/materials/09. Web Search/websearch.pdf b/year4/semester1/CT4100: Information Retrieval/materials/09. Web Search/websearch.pdf new file mode 100644 index 00000000..d9e23095 Binary files /dev/null and b/year4/semester1/CT4100: Information Retrieval/materials/09. Web Search/websearch.pdf differ diff --git a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf index 3ccd5354..32dba6fa 100644 Binary files a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf and b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.pdf differ diff --git a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex index 303e9a60..ea9eec6a 100644 --- a/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex +++ b/year4/semester1/CT4100: Information Retrieval/notes/CT4100-Notes.tex @@ -1605,7 +1605,266 @@ Can you identify approaches that may be of use in: \item Predicting whether a query expansion technique has improved the results. \end{itemize} +\section{Web Search} +In classical IR, the collection is relatively static. +The goal is to retrieve documents with content that is relevant to the user's information need. +Classic measures of relevance tend to ignore both \textit{context} \& \textit{individuals}. +In web search, the corpus contains both static \& dynamic information. +The goal is to retrieve high-quality results that are relevant to the current need. +Information may be: +\begin{itemize} + \item Informational; + \item Navigational; + \item Transactional. +\end{itemize} +Newer problems also emerge with respect to web search: +\begin{itemize} + \item Distributed data; + \item Volatile data; + \item Large volumes, scaling issues; + \item Redundancy (circa 30-40\% documents are (near) duplicates); + \item Quality (hundreds of millions of pages of spam); + \item Diversity (many languages, encodings); + \item Complex graph structure / topology. +\end{itemize} + +Web search users: +\begin{itemize} + \item Tend to make ill-defined queries that are short, low-effort, contain imprecise terms, and have sub-optimal syntax; + \item Have a wide variance in terms of needs, expectations, knowledge, \& bandwidth; + \item Have specific behaviour: circa 85\% of people do not look at the second screen of results and circa 80\% of people do not modify the query and instead just follow links. +\end{itemize} + +\subsection{Evolution of Web Search} +\begin{itemize} + \item First generation: use only ``on-page'' information such as word frequency etc. + \item Second generation: link analysis, click-through data. + \item Third generation: semantic analysis, integrate multiple sources of information, context analysis (spatial, query stream, personal profiling), aiding the user (re-spelling, query refinement, query suggestion), representation of results / query / collection. +\end{itemize} + +\subsection{Anatomy of a Search Engine} +\begin{itemize} + \item \textbf{Spider} (robot/crawler): builds the corpus by recursively following links. + \item \textbf{The indexer:} processes the data and indexes it (fully inverted list). + Different approaches are adopted with respect to stemming, phrases, etc. + \item \textbf{Query processor:} query re-formulation, stemming, handling Boolean operators, finds matching documents and ranks them. +\end{itemize} + +\subsection{History: Citation Analysis} +Initial work in this area can be traced back to the domain of \textbf{citation analysis}. +The \textbf{impact factor} of a journal is defined as +\[ + \text{impact factor} = \frac{A}{B} +\] +where $A$ is the number of current-year citations to articles appearing in the journal during the previous years two years and $B$ is the number of articles published in the journal during the previous two years. +\\\\ +\textbf{Co-citation} if a paper cites two papers $A$ \& $B$, then they are related or associated. +The strength of co-citation between $A$ \& $B$ is the number of times they are co-cited. +\\\\ +The \textbf{bibliographic coupling} of two documents $A$ \& $B$ is the number of documents cited by \textit{both} $A$ \& $B$. + +\subsection{Mining the Web Link Structure} +The main approach in first generation search engines was the \textbf{indexing approach}. +Although useful and effective, there are potentially too many links returned -- how does one return the most relevant? +We usually want to select the most ``authoritative'' pages; hence, the search entails identifying pages that have relevancy \& quality. +In addition to content, web pages also contain many links that connect one page to another. +The web structure implicitly contains a large number of human annotations which can be exploited to infer notions of authority (and by extension quality). +Any link to a page $p$ is considered to be a positive recommendation for that page $p$. + +\subsection{The HITS Algorithm} +The \textbf{HITS algorithm} computes hubs \& authorities for a particular topic specified by a normal query. +It first determines a set of relevant pages for that query called the \textbf{base set} $S$. +It then analyses the link structure of the web sub-graph defined by $S$ to find authority \& hub pages in the set. +\\\\ +The HITS algorithm analyses hyperlinks to identify: +\begin{itemize} + \item Authoritative pages (best sources); + \item Hubs (collections of links). +\end{itemize} + +We develop algorithms to exploit the implicit social organisation available in the web link structure. +There exist many problems with identifying authoritative pages: +\begin{itemize} + \item Authoritative pages do not necessarily refer to themselves as such; + \item Many links are purely for navigational purposes; + \item Advertising links. +\end{itemize} + +Mutually recursive heuristics are used: +\begin{itemize} + \item A good ``authority page'' is one that is pointed to by a number of sources; + \item A good ``hub'' is one that contains many links. +\end{itemize} + +\subsubsection{Constructing the Sub-Graph} +\begin{enumerate} + \item For a specific query $Q$, let the set of documents returned by a standard search engine (e.g., vector space approach) be called the \textit{root set} $R$. + \item Initialise $S$ to $R$. + \item Add to $S$ all pages pointed to by any page in $R$. + \item Add to $S$ all pages that point to any page in $R$. +\end{enumerate} + +\subsubsection{Iterative Algorithm} +Assign to each page $p \in S$: +\begin{itemize} + \item An authority score $a_p$ (vector $a$); + \item A hub score $h_p$ (vector $h$). +\end{itemize} + +Initialise all $a_p = h_p$ to some constant value. + +\subsubsection{HITS Update Rules} +Authorities are pointed to by lots of good hubs: +\[ + a_p = \sum_{q: q \rightarrow p } h_q +\] + +Hubs point to lots of good authorities: +\[ + h_p = \sum_{q: p \rightarrow q} a_q +\] + +Define $M$ to be the adjacency matrix for the sub-graph defined by $S$: $M_{i,j} = 1$ for $i \in S, j \in S$ if and only if $i \rightarrow j$. +We can calculate the authority vector $a$ from the matrix $M^{\text{T}}M$. +Similarly, the hub vector $h$ can be calculated from the matrix $MM^{\text{T}}$. + +\subsubsection{Other Issues} +Limitations of a link-only approach include: +\begin{itemize} + \item On narrowly-focused query topics, there may not be many exact references and the hubs may provide links to more general pages. + \item Potential drift from main topic. + All links are treated as being equally important; if there is a range of topics in a hub, the focus of the search may drift. + \item Timeliness of recommendation is hard to identify. + \item Sensitivity of malicious attack. + \item Edges with wrong semantics. +\end{itemize} + +\subsection{PageRank} +\subsubsection{Markov Chains} +A \textbf{Markov chain} has two components: +\begin{itemize} + \item A graph / network structure where each node is called a state. + \item A transition probability of traversing a link given that the chain is in a state. +\end{itemize} + +A sequence of steps through the chain is called a \textit{random walk}. + +\subsubsection{Random Surfer Model} +Assume that the web is a Markov chain. +Surfers randomly click on links, where the probability of an outlink from page $A$ is $\frac{1}{n}$ where there are $n$ outlinks from $A$. +The surfer occasionally gets \textit{bored} and is moved to another web page (teleported), say $B$, where $B$ is equally likely to be any page. +The \textbf{PageRank} of a web page is the probability that the surfer will visit that page. + +\subsubsection{Dangling Pages} +A \textbf{dangling page} is a page with not outgoing links that therefore cannot pass on rank. +The solution to this is to assume that the page has links to all web pages with equal probability. + +\subsubsection{Rank Sink} +Pages in a loop accumulate rank but do not distribute it; this is called a \textbf{rank sink}. +The solution to this is ``teleportation'', i.e., with a certain probability, the surfer can jump to any other web pages ot get out of the loop. + +\subsubsection{PageRank Definition} +\begin{align*} + \text{PR}(W) = \frac{T}{N} + \left( 1 - T \right) \left( \frac{\text{PR}(W_1)}{O(W_1)} + \frac{\text{PR}(W_2)}{O(W_2)} + \cdots + \frac{\text{PR}(W_n)}{O(W_n)} \right) +\end{align*} +where +\begin{itemize} + \item $W$ is a web page; + \item $W_i$ are the web pages that have a link to $W$; + \item $O(W_i)$ is the number of outlinks from $W_i$; + \item $T$ is the teleportation probability; + \item $N$ is the size of the web. +\end{itemize} + +\subsubsection{Efficiency} +Early experiments on Google showed convergence in 52 iterations on a collection with 322 million links; +the number of iterations required for convergence is empirically $O(\log n)$ where $n$ is the number of links. +This is quite efficient. + +\subsubsection{Personalised Page Rank} +We can bias the behaviour of page rank by changing the notion of random jumps. +Instead of jumping to a random page on the web, we jump probabilistically to a page chosen from a seed set defined for a user. +We add rank to pages of interest to the user rather than a random page. + +\subsubsection{Semantically / Content-Biased Page Rank} +Page rank treats all edges as being equally important in its random surfer model (excluding links identified as navigation \& advertising links). +That is, page rank values are distributed equally across all outgoing edges. +An extra heuristic is that a surfer is more likely to follow a link relating to the content of the current page or passage. +\\\\ +The page rank values propagated from a page sum to one; standard page rank gives equal values. +We can measure a similarity between the context of a link and the linked-to page. +This gives a measure of semantic relatedness between pages / passages. +If users are more likely to navigate to a related page, we can assign page rank values in proportion to the relative similarity. + +\subsection{Temporal Link Analysis} +Link-analysis techniques (e.g., PageRank, HITS) do not take into account associated temporal aspects of the web content. +The goal is to incorporate temporal aspects (e.g., freshness, rate of change) into link-analysis techniques. +Ranking based on the pages' authority values as that value changes over time. +\\\\ +Approach: +\begin{itemize} + \item Add annotations to the graph, i.e., for every edge in the graph and for every vertex of the graph, maintain a set of values regarding the temporal aspects such as: + \begin{itemize} + \item Creation time; + \item Modification times; + \item Last modification time. + \end{itemize} + \item We can define a window of interest, freshness of an edge or node, and activity of an edge or node. +\end{itemize} + +\subsection{Ranking Signals} +There are many sources of evidence or \textbf{signals} that can be used to rank a page: +\begin{itemize} + \item Content signals (BM25 \& variants used); + \item Structural signals (anchor text etc.); + \item Web usage (implicit feedback, temporal context); + \item Link-based ranking. +\end{itemize} + +How best to combine signals is an open question. +A simple approach is to combine PageRank (link analysis) and BM25 (content signal): +\[ + R(p,q) = (a) \text{BM25}(p,q) + (1-a) \text{PR}(p) +\] + +More recently, much attention has been paid to using machine learning to rank approaches. +Effectively, we attempt to learn the optimal way to combine the signals. +Many approaches have been adopted: +\begin{itemize} + \item Learning the ranking (NN, SVM, Bayesian networks); + \item Learning the ranking function (Genetic programming). +\end{itemize} + +\subsubsection{Evaluation} +In order to compare systems \& algorithms and to be able to guide any learning approaches, we need some means of evaluation. +Commonly used choices include: Precision@5, Precision@10: +\begin{itemize} + \item Impossible to obtain recall; + \item Users tend not to care about recall. +\end{itemize} + +In IR, we have test collections \& human evaluations. +In web search, we can exploit click-through data. +Issues with this include heavy-tailed distribution of queries and having sufficient evidence. +A related issue is the evaluation of snippets. +Other issues in web search include: +\begin{itemize} + \item How to deal with duplicated data? + \item How to deal with near-duplicates? + \item Query suggestions? + \begin{itemize} + \item Diversity? + \item Appropriate suggestions; + \item Predictive accuracy. + \end{itemize} +\end{itemize} + +\textbf{Adversarial search} is the conflict between web search engine designers / creators and the ``search engine optimisation'' community: +\begin{itemize} + \item Recognising spam links; + \item Augmenting link analysis algorithms to deal with such manipulation. +\end{itemize}