[CT4100]: Add Week 5 lecture materials & notes

2024-10-17 08:20:49 +01:00
parent 1c8fe23c17
commit c62071cae4
3 changed files with 112 additions and 1 deletions
--- a/Mechanisms/Lecture_5_Relevance_Feedback_slides.pdf
+++ b/Mechanisms/Lecture_5_Relevance_Feedback_slides.pdf
--- a/Retrieval/notes/CT4100-Notes.pdf
+++ b/Retrieval/notes/CT4100-Notes.pdf
--- a/Retrieval/notes/CT4100-Notes.tex
+++ b/Retrieval/notes/CT4100-Notes.tex
@ -679,7 +679,7 @@ where

 The \textbf{Okapi BM25} weighting scheme is a standard benchmark weighting scheme with relatively good performance, although it needs to be tuned per collection:
 \begin{align*}
-    \text{BM25}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{\textit{tf}_{t,D} \cdot \log \left( \frac{N - \textit{df}_t _ 0.5}{\textit{df} + 0.5} \right) \cdot \textit{tf}_{t, Q}}{\textit{tf}_{t,D} + k_1 \cdot \left( (1-b) + b \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}} \right)} \right)
+    \text{BM25}(Q,D) = \sum_{t \in Q \cap D} \left( \frac{\textit{tf}_{t,D} \cdot \log \left( \frac{N - \textit{df}_t + 0.5}{\textit{df} + 0.5} \right) \cdot \textit{tf}_{t, Q}}{\textit{tf}_{t,D} + k_1 \cdot \left( (1-b) + b \cdot \frac{\textit{dl}}{\textit{dl}_\text{avg}} \right)} \right)
 \end{align*}

 The \textbf{Pivoted Normalisation} weighting scheme is also as standard benchmark which needs to be tuned for collection, although it has its issues with normalisation:
@ -700,4 +700,115 @@ The \textbf{Axiomatic Approach} to weighting consists of the following constrain

 New weighting schemes that adhere to all these constraints outperform the best known benchmarks.

+\section{Relevance Feedback}
+We often attempt to improve the performance of an IR system by modifying the user query;
+the new modified query is then re-submitted to the system.
+Typically, the user examines the returned list of documents and marks those which are relevant.
+The new query is usually created via incorporating new terms and re-weighting existing terms.
+The feedback from the user is used to re-calculate the term weights.
+Analysis of the document set can either be \textbf{local analysis} (on the returned set) or \textbf{global analysis} (on the whole document set).
+This feedback allows for the re-formulation of the query, which has the advantage of shielding the user from the task of query reformulation and from the inner details of the comparison algorithm.
+
+\subsection{Feedback in the Vector Space Model}
+We assume that relevant documents have similarly weighted term vectors.
+$D_r$ is the set of relevant documents returned, $D_n$ is the set of the non-relevant documents returned, and $C_r$ is the set of relevant documents in the entire collection.
+If we assume that $C_r$ is known for a query $q$, then the best vector for a query to distinguish relevant documents from non-relevant documents is
+\[
+    \vec{q} = \left( \frac{1}{\left|C_r\right|} \sum_{d_j \in C_r}d_j \right) - \left( \frac{1}{N - \left|C_r\right|} \sum_{d_j \notin C_r} d_j \right)
+\]
+
+However, it is impossible to generate this query as we do not know $C_r$.
+We can however estimate $C_r$ as we know $D_r$ which is a subset of $C_r$: the main approach for doing this is the \textbf{Rocchio Algorithm:}
+\[
+    \overrightarrow{q_{\text{new}}} = \alpha \overrightarrow{q_\text{orginal}} + \frac{\beta}{\left| D_r \right|} \sum_{d_j \in D_r} d_j - \frac{\gamma}{\left| D_n \right|} \sum_{d_j \in D_n}d_j
+\]
+
+where $\alpha$, $\beta$, \& $\gamma$ are constants which determine the importance of feedback and the relative importance of positive feedback over negative feedback.
+Variants on this algorithm include:
+\begin{itemize}
+    \item   \textbf{IDE Regular:}
+            \[
+                \overrightarrow{q_\text{new}} = \alpha \overrightarrow{q_\text{old}} + \beta \sum_{d_j \in D_r} d_j - \gamma \sum_{d_j \in D_n} d_j
+            \]
+
+    \item   \textbf{IDE Dec Hi:} (based on the assumption that positive feedback is more useful than negative feedback)
+            \[
+                \overrightarrow{q_\text{new}} = \alpha \overrightarrow{q_\text{old}} + \beta \sum_{d_j \in D_r} d_j - \gamma \text{MAXNR}(d_j)
+            \]
+            where $\text{MAXNR}(d_j)$ is the highest ranked non-relevant document.
+\end{itemize}
+
+The use of these feedback mechanisms have shown marked improvement in the precision \& recall of IR systems.
+Salton indicated in early work on the vector space model that these feedback mechanisms result in an average precision of at least 10\%.
+\\\\
+The precision-recall is re-calculated for the new returned set, often with respect to the returned document set less the set marked by the user.
+
+\subsection{Pseudo-Feedback / Blind Feedback}
+In \textbf{local analysis}, the retrieved documents are examined at query time to determine terms for query expansion.
+We typically develop some form of term-term correlation matrix. 
+To quantify connection between two terms, we expand the query to include terms correlated to the query terms.
+
+\subsubsection{Association Clusters}
+To create an \textbf{association cluster}, first create a matrix $M$;
+We can create term $\times$ term matrix to represent the level of association between terms.
+This is usually weighted according to
+\[
+    M_{i,j} = \frac{\text{freq}_{i,j}}{\text{freq}_{i} + \text{freq}_{j} - \text{freq}_{i,j}}
+\]
+
+To perform query expansion with local analysis, we can develop an association cluster for each term $t_i$ in the query.
+For each term $t_i \in q$ choose the $i^\text{th}$ query term and select the top $N$ values from its row in the term matrix.
+For a query $q$, select a cluster for each query term so that $\left| q \right|$ clusters are formed.
+$N$ is usually small to prevent generation of very large queries.
+We may then either take all terms or just those with the highest summed correlation.
+
+\subsubsection{Metric Clusters}
+Association clusters do not take into account the position of terms within documents: \textbf{metric clusters} attempt to overcome this limitation.
+Let $\text{dis}(t_i, t_j)$ be the distance between two terms $t_i$ \& $t_j$ in the same document.
+If $t_i$ \& $t_j$ are in different documents, then $\text{dis}(t_i, t_j) = \inf$.
+We can define the term-term correlation matrix by the following equation, and we can define clusters as before:
+\[
+    M_{i,j} = \sum_{t_i, t_j \in D_i} \frac{1}{\text{dis}(t_i, t_j)}
+\]
+
+\subsubsection{Scalar Clusters}
+\textbf{Scalar clusters} are based on comparing sets of words:
+if two terms have similar neighbourhoods then there is a high correlation between terms.
+Similarity can be based on comparing the two vectors representing the neighbourhoods.
+This measure can be used to define term-term correlation matrices and the procedure can continue as before.
+
+\subsection{Global Analysis}
+\textbf{Global analysis} is based on analysis of the whole document collection and not just the returned set.
+A similarity matrix is created with a similar technique to the method used in the vector space comparison.
+We then index each term by the documents in which the term is contained.
+It is then possible to calculate the similarity between two terms by taking some measure of the two vectors, e.g. the dot product.
+To use this to expand a query, we then:
+\begin{enumerate}
+    \item   Map the query to the document-term space.
+    \item   Calculate the similarity between the query vector and vectors associated with query terms.
+    \item   Rank the vectors $\vec{t_i}$ based on similarity.
+    \item   Choose the top-ranked terms to add to the query.
+\end{enumerate}
+
+\subsection{Issues with Feedback}
+The Rocchio \& IDE methods can be used in all vector-based approaches.
+Feedback is an implicit component of many other IR models (e.g., neural networks \& probabilistic models).
+The same approaches with some modifications are used in information filtering.
+Problems that exist in obtaining user feedback include:
+\begin{itemize}
+    \item   Users tend not to give a high degree of feedback.
+    \item   Users are typically inconsistent with their feedback.
+    \item   Explicit user feedback does not have to be strictly binary, we can allow a range of values.
+    \item   Implicit feedback can also be used, we can make assumptions that a user found an article useful if:
+            \begin{itemize}
+                \item   The user reads the article.
+                \item   The user spends a certain amount of time reading the article.
+                \item   The user saves or prints the article.
+            \end{itemize}
+
+            However, these metrics are rarely as trustworthy as explicit feedback.
+\end{itemize}
+
+
+
 \end{document}