[CT4100]: Add Week 7 lecture notes

2024-10-24 20:41:12 +01:00
parent df4f467d78
commit 865505673a
5 changed files with 255 additions and 0 deletions
--- a/Retrieval/notes/CT4100-Notes.pdf
+++ b/Retrieval/notes/CT4100-Notes.pdf
--- a/Retrieval/notes/CT4100-Notes.tex
+++ b/Retrieval/notes/CT4100-Notes.tex
@ -973,6 +973,261 @@ We can also view collaborative filtering as a machine learning classification pr
 Much recent work has been focused on not only giving a recommendation, but also attempting to explain the recommendation to the user.
 Questions arise in how best to ``explain'' or visualise the recommendation.
 \section{Learning in Information Retrieval}
 Many real-world problems are complex and it is difficult to specify (algorithmically) how to solve many of these problems.
 Learning techniques are used in many domains to find solutions to problems that may not be obvious or clear to human users.
 In general, machine learning involves searching a large space of potential hypotheses or potential solutions to find the hypotheses/solution that best \textit{explains} or \textit{fits} a set of data and any prior knowledge, or is the best solution, or the solution that we can say learns if it improves the performance. 
 \\\\
 Machine learning techniques require a training stage before the learned solution can be used on new previously unseen data.
 The training stage consists of a data set of examples which can either be:
 \begin{itemize}
    \item   \textbf{Labelled} (supervised learning).
    \item   \textbf{Unlabelled} (unsupervised learning).
 \end{itemize}
 An additional data set must also be used to test the hypothesis/solution.
 \\\\
 \textbf{Symbolic knowledge} is represented in the form of the symbolic descriptions of the learned concepts, e.g., production rules or concept hierarchies.
 \textbf{Sub-symbolic knowledge} is represented in sub-symbolic form not readable by a user, e.g., in the structure, weights, \& biases of the trained network.
 \subsection{Genetic Algorithms}
 \textbf{Genetic algorithms} are inspired by the Darwinian theory of evolution:
 at each step of the algorithm, the best solutions are selected while the weaker solutions are discarded.
 It uses operators based on crossover \& mutation as the basis of the algorithm to sample the space of solutions.
 The steps of a genetic algorithm are as follows: first, create a random population. 
 Then, while a solution has not been found:
 \begin{enumerate}
    \item   Calculate the fitness of each individual.
    \item   Select the population for reproduction:
            \begin{enumerate}[label=\roman*.]
                \item   Perform crossover.
                \item   Perform mutation.
            \end{enumerate}
    \item   Repeat.
 \end{enumerate}
 \tikzstyle{process} = [rectangle, minimum width=2cm, minimum height=1cm, text centered, draw=black]
 \tikzstyle{arrow} = [thick,->,>=stealth]
 % \usetikzlibrary{patterns}
 \begin{figure}[H]
    \centering
    \begin{tikzpicture}[node distance=2cm]
        \node (reproduction)    [process] at (0, 2.5)  {Reproduction, Crossover, Mutation};
        \node (population)      [process] at (-2.5, 0) {population};
        \node (fitness)         [process] at (0, -2.5) {Calculate Fitness};
        \node (select)          [process] at (2.5, 0)  {Select Population};
        \draw [arrow] (population) -- (fitness);
        \draw [arrow] (fitness) -- (select);
        \draw [arrow] (select) -- (reproduction);
        \draw [arrow] (reproduction) -- (population);
    \end{tikzpicture}
    \caption{Genetic Algorithm Steps}
 \end{figure}
 Traditionally, solutions are represented in binary.
 A \textbf{phenotype} is the decoding or manifestation of a \textbf{genotype} which is the encoding or representation of a phenotype.
 We need an evaluation function which will discriminate between better and worse solutions.
 \begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Crossover Examples}]
 Example  of one-point crossover:
 \texttt{11001\underline{011}} and \texttt{11011\underline{111}} gives \texttt{11001\underline{111}} and \texttt{11011\underline{011}}.
 \\\\
 Example of $n$-point crossover: \texttt{\underline{110}110\underline{11}0} and \texttt{0001001000} gives \texttt{\underline{110}100\texttt{11}00} and \texttt{000\underline{110}10\underline{01}}.
 \end{tcolorbox}
 \textbf{Mutation} occurs in the genetic algorithm at a much lower rate than crossover.
 It is important to add some diversity to the population in the hope that new better solutions are discovered and therefore it aids in the evolution of the population.
 \begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Mutation Example}]
 Example  of mutation: \texttt{1\underline{1}001001} $\rightarrow$ \texttt{1\underline{0}001001}.
 \end{tcolorbox}
 There are two types of selection:
 \begin{itemize}
    \item   \textbf{Roulette wheel selection:} each sector in the wheel is proportional to an individual's fitness.
            Select $n$ individuals by means of $n$ roulette turns.
            Each individual is drawn independently.
        \item   \textbf{Tournament selection:} a number of individuals are selected at random with replacement from the population.
                The individual with the best score is selected.
                This is repeated $n$ times.
 \end{itemize}
 Issues with genetic algorithms include:
 \begin{itemize}
    \item   Choice of representation for encoding individuals.
    \item   Definition of fitness function.
    \item   Definition of selection scheme.
    \item   Definition of suitable genetic operators.
    \item   Setting of parameters:
            \begin{itemize}
                \item   Size of population.
                \item   Number of generations.
                \item   Probability of crossover.
                \item   Probability of mutation.
            \end{itemize}
 \end{itemize}
 \begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Case Study 1: Application of Genetic Algorithms to IR}]
    The effectiveness of an IR system is dependent on the quality of the weights assigned to terms in documents.
    We have seen heuristic-based approaches \& their effectiveness and we've seen axiomatic approaches that could be considered.
    \\\\
    Why not learn the weights?
    We have a definition of relevant \& non-relevant documents; we can use MAP or precision@$k$ as fitness.
    Each genotype can be a set of vectors of length $N$ (the size of the lexicon).
    Set all rates randomly initially.
    Run the system with a set of queries to obtain fitness; select good chromosomes; crossover; mutate.
    Effectively searching the landscape for weights to give a good ranking.
 \end{tcolorbox}
 \subsection{Genetic Programming}
 \textbf{Genetic programming} applies the approach of the genetic algorithm to the space of possible computer programs.
 ``Virtually all problems in artificial intelligence, machine learning, adaptive systems, \& automated learning can be recast as a search for a computer program.
 Genetic programming provides a way to successfully conduct the search for a computer program in the space of computer programs.'' -- Koza.
 \\\\
 A random population of solutions is created which are modelled in a tree structure with operators as internal nodes and operands as leaf nodes.
 \begin{figure}[H]
    \centering
    \usetikzlibrary{trees}
    \begin{tikzpicture}
        [
        every node/.style = {draw, shape=rectangle, align=center},
        level distance = 1.5cm,
        sibling distance = 1.5cm,
        edge from parent/.style={draw,-latex}
        ]
        \node {+}
            child { node {1} }
            child { node {2} }
            child { node {\textsc{if}}
                child { node {>}
                    child { node {\textsc{time}} }
                    child { node {10} }
                }
                child { node {3} }
                child { node {4} }
            };
    \end{tikzpicture}
    \caption{\texttt{(+ 1 2 (IF (> TIME 10) 3 4))}}
 \end{figure}
 \begin{figure}[H]
    \centering
    \includegraphics[width=0.4\textwidth]{./images/crossover.png}
    \caption{Crossover Example}
 \end{figure}
 \begin{figure}[H]
    \centering
    \includegraphics[width=0.4\textwidth]{./images/mutation.png}
    \caption{Mutation Example}
 \end{figure}
 The genetic programming flow is as follows:
 \begin{enumerate}
    \item   Trees are (usually) created at random.
    \item   Evaluate how each tree performs in its environment (using a fitness function).
    \item   Selection occurs based on fitness (tournament selection).
    \item   Crossover of selected solutions to create new individuals.
    \item   Repeat until population is replaced.
    \item   Repeat for $N$ generations.
 \end{enumerate}
 \subsubsection{Anatomy of a Term-Weighting Scheme}
 Typical components of term weighting schemes include:
 \begin{itemize}
    \item   Term frequency aspect.
    \item   ``Inverse document'' score.
    \item   Normalisation factor.
 \end{itemize}
 The search space should be decomposed accordingly.
 \subsubsection{Why Separate Learning into Stages?}
 The search space using primitive measures \& functions is extremely large;
 reducing the search space is advantageous as efficiency is increased.
 It eases the analysis of the solutions produced at each stage.
 Comparisons to existing benchmarks at each of these stages can be used to determine if the GP is finding novel solutions or variations on existing solutions.
 It can then be identified from where any improvement in performance is coming.
 \subsubsection{Learning Each of the Three Parts in Turn}
 \begin{enumerate}
    \item   Learn a term-discrimination scheme (i.e., some type of idf) using primitive global measures.
            \begin{itemize}
                \item   8 terminals \& 8 functions.
                \item   $T = \{\textit{df}, \textit{cf}, N, V, C, 1, 10, 0.5\}$.
                \item   $F = \{+, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}()\}$.
            \end{itemize}
    \item   Use this global measure and learn a term-frequency aspect.
            \begin{itemize}
                \item   4 terminals \& 8 functions.
                \item   $T = \{\textit{tf}, 1, 10, 0.4\}$.
                \item   $F = \{+, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}()\}$.
            \end{itemize}
    \item   Finally, learn a normalisation scheme.
            \begin{itemize}
                \item   6 terminals \& 8 functions.
                \item   $T = \{ \text{dl}, \text{dl}_{\text{avg}}, \text{dl}_\text{dev}, 1, 10, 0.5 \}$.
                \item   $F = \{ +, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}() \}$.
            \end{itemize}
 \end{enumerate}
 \begin{figure}[H]
    \centering
    \includegraphics[width=0.6\textwidth]{./images/threestages.png}
    \caption{Learning Each of the Three Stages in Turn}
 \end{figure}
 \subsubsection{Details of the Learning Approach}
 \begin{itemize}
    \item   7 global functions were developed on \~32,000 OHSUMED documents.
            \begin{itemize}
                \item   All validated on a larger unseen collection and the best function taken.
                \item   Random population of 100 for 50 generations.
                \item   The fitness function used was MAP.
            \end{itemize}
    \item   7 tf functions were developed on \~32,000 LATIMES documents.
            \begin{itemize}
                \item   All validated on a larger unseen collection and the best function taken.
                \item   Random population of 200 for 25 generations.
                \item   The fitness function used was MAP.
            \end{itemize}
    \item   7 normalisation functions were developed 3 $\times$ \~ 10,000 LATIMES documents.
            \begin{itemize}
                \item   All validated on a larger unseen collection and the best function taken.
                \item   Random population of 200 for 25 generations.
                \item   Fitness function used was average MAP over the 3 collections.
            \end{itemize}
 \end{itemize}
 \subsubsection{Analysis}
 The global function $w_3$ always produces a positive number:
 \[
    w_3 = \sqrt{\frac{\textit{cf}^3_t \cdot N}{\textit{df}^4_t}}
 \]
 \begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Case Study 1: Application of Genetic Programming to IR}]
    Evolutionary computing approaches include:
    \begin{itemize}
        \item   Evolutionary strategies.
        \item   Genetic algorithms.
        \item   Genetic programming.
    \end{itemize}
    Why genetic programming for IR?
    \begin{itemize}
        \item   Produces a symbolic representation of a solution which is useful for further analysis.
        \item   Using training data, MAP can be directly optimised (i.e., used as the fitness function).
        \item   Solutions produced are often generalisable as solution length (size) can be controlled.
    \end{itemize}
 \end{tcolorbox}
 \end{document}
--- a/Retrieval/notes/images/crossover.png
+++ b/Retrieval/notes/images/crossover.png
--- a/Retrieval/notes/images/mutation.png
+++ b/Retrieval/notes/images/mutation.png
--- a/Retrieval/notes/images/threestages.png
+++ b/Retrieval/notes/images/threestages.png