[CT4100]: Add Week 7 lecture notes
This commit is contained in:
Binary file not shown.
@ -973,6 +973,261 @@ We can also view collaborative filtering as a machine learning classification pr
|
|||||||
Much recent work has been focused on not only giving a recommendation, but also attempting to explain the recommendation to the user.
|
Much recent work has been focused on not only giving a recommendation, but also attempting to explain the recommendation to the user.
|
||||||
Questions arise in how best to ``explain'' or visualise the recommendation.
|
Questions arise in how best to ``explain'' or visualise the recommendation.
|
||||||
|
|
||||||
|
\section{Learning in Information Retrieval}
|
||||||
|
Many real-world problems are complex and it is difficult to specify (algorithmically) how to solve many of these problems.
|
||||||
|
Learning techniques are used in many domains to find solutions to problems that may not be obvious or clear to human users.
|
||||||
|
In general, machine learning involves searching a large space of potential hypotheses or potential solutions to find the hypotheses/solution that best \textit{explains} or \textit{fits} a set of data and any prior knowledge, or is the best solution, or the solution that we can say learns if it improves the performance.
|
||||||
|
\\\\
|
||||||
|
Machine learning techniques require a training stage before the learned solution can be used on new previously unseen data.
|
||||||
|
The training stage consists of a data set of examples which can either be:
|
||||||
|
\begin{itemize}
|
||||||
|
\item \textbf{Labelled} (supervised learning).
|
||||||
|
\item \textbf{Unlabelled} (unsupervised learning).
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
An additional data set must also be used to test the hypothesis/solution.
|
||||||
|
\\\\
|
||||||
|
\textbf{Symbolic knowledge} is represented in the form of the symbolic descriptions of the learned concepts, e.g., production rules or concept hierarchies.
|
||||||
|
\textbf{Sub-symbolic knowledge} is represented in sub-symbolic form not readable by a user, e.g., in the structure, weights, \& biases of the trained network.
|
||||||
|
|
||||||
|
\subsection{Genetic Algorithms}
|
||||||
|
\textbf{Genetic algorithms} are inspired by the Darwinian theory of evolution:
|
||||||
|
at each step of the algorithm, the best solutions are selected while the weaker solutions are discarded.
|
||||||
|
It uses operators based on crossover \& mutation as the basis of the algorithm to sample the space of solutions.
|
||||||
|
The steps of a genetic algorithm are as follows: first, create a random population.
|
||||||
|
Then, while a solution has not been found:
|
||||||
|
\begin{enumerate}
|
||||||
|
\item Calculate the fitness of each individual.
|
||||||
|
\item Select the population for reproduction:
|
||||||
|
\begin{enumerate}[label=\roman*.]
|
||||||
|
\item Perform crossover.
|
||||||
|
\item Perform mutation.
|
||||||
|
\end{enumerate}
|
||||||
|
\item Repeat.
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
\tikzstyle{process} = [rectangle, minimum width=2cm, minimum height=1cm, text centered, draw=black]
|
||||||
|
\tikzstyle{arrow} = [thick,->,>=stealth]
|
||||||
|
% \usetikzlibrary{patterns}
|
||||||
|
|
||||||
|
\begin{figure}[H]
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}[node distance=2cm]
|
||||||
|
\node (reproduction) [process] at (0, 2.5) {Reproduction, Crossover, Mutation};
|
||||||
|
\node (population) [process] at (-2.5, 0) {population};
|
||||||
|
\node (fitness) [process] at (0, -2.5) {Calculate Fitness};
|
||||||
|
\node (select) [process] at (2.5, 0) {Select Population};
|
||||||
|
|
||||||
|
\draw [arrow] (population) -- (fitness);
|
||||||
|
\draw [arrow] (fitness) -- (select);
|
||||||
|
\draw [arrow] (select) -- (reproduction);
|
||||||
|
\draw [arrow] (reproduction) -- (population);
|
||||||
|
\end{tikzpicture}
|
||||||
|
\caption{Genetic Algorithm Steps}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
Traditionally, solutions are represented in binary.
|
||||||
|
A \textbf{phenotype} is the decoding or manifestation of a \textbf{genotype} which is the encoding or representation of a phenotype.
|
||||||
|
We need an evaluation function which will discriminate between better and worse solutions.
|
||||||
|
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Crossover Examples}]
|
||||||
|
Example of one-point crossover:
|
||||||
|
\texttt{11001\underline{011}} and \texttt{11011\underline{111}} gives \texttt{11001\underline{111}} and \texttt{11011\underline{011}}.
|
||||||
|
\\\\
|
||||||
|
Example of $n$-point crossover: \texttt{\underline{110}110\underline{11}0} and \texttt{0001001000} gives \texttt{\underline{110}100\texttt{11}00} and \texttt{000\underline{110}10\underline{01}}.
|
||||||
|
\end{tcolorbox}
|
||||||
|
|
||||||
|
\textbf{Mutation} occurs in the genetic algorithm at a much lower rate than crossover.
|
||||||
|
It is important to add some diversity to the population in the hope that new better solutions are discovered and therefore it aids in the evolution of the population.
|
||||||
|
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Mutation Example}]
|
||||||
|
Example of mutation: \texttt{1\underline{1}001001} $\rightarrow$ \texttt{1\underline{0}001001}.
|
||||||
|
\end{tcolorbox}
|
||||||
|
|
||||||
|
There are two types of selection:
|
||||||
|
\begin{itemize}
|
||||||
|
\item \textbf{Roulette wheel selection:} each sector in the wheel is proportional to an individual's fitness.
|
||||||
|
Select $n$ individuals by means of $n$ roulette turns.
|
||||||
|
Each individual is drawn independently.
|
||||||
|
\item \textbf{Tournament selection:} a number of individuals are selected at random with replacement from the population.
|
||||||
|
The individual with the best score is selected.
|
||||||
|
This is repeated $n$ times.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Issues with genetic algorithms include:
|
||||||
|
\begin{itemize}
|
||||||
|
\item Choice of representation for encoding individuals.
|
||||||
|
\item Definition of fitness function.
|
||||||
|
\item Definition of selection scheme.
|
||||||
|
\item Definition of suitable genetic operators.
|
||||||
|
\item Setting of parameters:
|
||||||
|
\begin{itemize}
|
||||||
|
\item Size of population.
|
||||||
|
\item Number of generations.
|
||||||
|
\item Probability of crossover.
|
||||||
|
\item Probability of mutation.
|
||||||
|
\end{itemize}
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Case Study 1: Application of Genetic Algorithms to IR}]
|
||||||
|
The effectiveness of an IR system is dependent on the quality of the weights assigned to terms in documents.
|
||||||
|
We have seen heuristic-based approaches \& their effectiveness and we've seen axiomatic approaches that could be considered.
|
||||||
|
\\\\
|
||||||
|
Why not learn the weights?
|
||||||
|
We have a definition of relevant \& non-relevant documents; we can use MAP or precision@$k$ as fitness.
|
||||||
|
Each genotype can be a set of vectors of length $N$ (the size of the lexicon).
|
||||||
|
Set all rates randomly initially.
|
||||||
|
Run the system with a set of queries to obtain fitness; select good chromosomes; crossover; mutate.
|
||||||
|
Effectively searching the landscape for weights to give a good ranking.
|
||||||
|
\end{tcolorbox}
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Genetic Programming}
|
||||||
|
\textbf{Genetic programming} applies the approach of the genetic algorithm to the space of possible computer programs.
|
||||||
|
``Virtually all problems in artificial intelligence, machine learning, adaptive systems, \& automated learning can be recast as a search for a computer program.
|
||||||
|
Genetic programming provides a way to successfully conduct the search for a computer program in the space of computer programs.'' -- Koza.
|
||||||
|
\\\\
|
||||||
|
A random population of solutions is created which are modelled in a tree structure with operators as internal nodes and operands as leaf nodes.
|
||||||
|
|
||||||
|
|
||||||
|
\begin{figure}[H]
|
||||||
|
\centering
|
||||||
|
\usetikzlibrary{trees}
|
||||||
|
\begin{tikzpicture}
|
||||||
|
[
|
||||||
|
every node/.style = {draw, shape=rectangle, align=center},
|
||||||
|
level distance = 1.5cm,
|
||||||
|
sibling distance = 1.5cm,
|
||||||
|
edge from parent/.style={draw,-latex}
|
||||||
|
]
|
||||||
|
\node {+}
|
||||||
|
child { node {1} }
|
||||||
|
child { node {2} }
|
||||||
|
child { node {\textsc{if}}
|
||||||
|
child { node {>}
|
||||||
|
child { node {\textsc{time}} }
|
||||||
|
child { node {10} }
|
||||||
|
}
|
||||||
|
child { node {3} }
|
||||||
|
child { node {4} }
|
||||||
|
};
|
||||||
|
\end{tikzpicture}
|
||||||
|
\caption{\texttt{(+ 1 2 (IF (> TIME 10) 3 4))}}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\begin{figure}[H]
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=0.4\textwidth]{./images/crossover.png}
|
||||||
|
\caption{Crossover Example}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\begin{figure}[H]
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=0.4\textwidth]{./images/mutation.png}
|
||||||
|
\caption{Mutation Example}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
The genetic programming flow is as follows:
|
||||||
|
\begin{enumerate}
|
||||||
|
\item Trees are (usually) created at random.
|
||||||
|
\item Evaluate how each tree performs in its environment (using a fitness function).
|
||||||
|
\item Selection occurs based on fitness (tournament selection).
|
||||||
|
\item Crossover of selected solutions to create new individuals.
|
||||||
|
\item Repeat until population is replaced.
|
||||||
|
\item Repeat for $N$ generations.
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
\subsubsection{Anatomy of a Term-Weighting Scheme}
|
||||||
|
Typical components of term weighting schemes include:
|
||||||
|
\begin{itemize}
|
||||||
|
\item Term frequency aspect.
|
||||||
|
\item ``Inverse document'' score.
|
||||||
|
\item Normalisation factor.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
The search space should be decomposed accordingly.
|
||||||
|
|
||||||
|
\subsubsection{Why Separate Learning into Stages?}
|
||||||
|
The search space using primitive measures \& functions is extremely large;
|
||||||
|
reducing the search space is advantageous as efficiency is increased.
|
||||||
|
It eases the analysis of the solutions produced at each stage.
|
||||||
|
Comparisons to existing benchmarks at each of these stages can be used to determine if the GP is finding novel solutions or variations on existing solutions.
|
||||||
|
It can then be identified from where any improvement in performance is coming.
|
||||||
|
|
||||||
|
\subsubsection{Learning Each of the Three Parts in Turn}
|
||||||
|
\begin{enumerate}
|
||||||
|
\item Learn a term-discrimination scheme (i.e., some type of idf) using primitive global measures.
|
||||||
|
\begin{itemize}
|
||||||
|
\item 8 terminals \& 8 functions.
|
||||||
|
\item $T = \{\textit{df}, \textit{cf}, N, V, C, 1, 10, 0.5\}$.
|
||||||
|
\item $F = \{+, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}()\}$.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\item Use this global measure and learn a term-frequency aspect.
|
||||||
|
\begin{itemize}
|
||||||
|
\item 4 terminals \& 8 functions.
|
||||||
|
\item $T = \{\textit{tf}, 1, 10, 0.4\}$.
|
||||||
|
\item $F = \{+, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}()\}$.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\item Finally, learn a normalisation scheme.
|
||||||
|
\begin{itemize}
|
||||||
|
\item 6 terminals \& 8 functions.
|
||||||
|
\item $T = \{ \text{dl}, \text{dl}_{\text{avg}}, \text{dl}_\text{dev}, 1, 10, 0.5 \}$.
|
||||||
|
\item $F = \{ +, \times, \div, -, \text{square}(), \text{sqrt}(), \text{ln}(), \text{exp}() \}$.
|
||||||
|
\end{itemize}
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
\begin{figure}[H]
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=0.6\textwidth]{./images/threestages.png}
|
||||||
|
\caption{Learning Each of the Three Stages in Turn}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\subsubsection{Details of the Learning Approach}
|
||||||
|
\begin{itemize}
|
||||||
|
\item 7 global functions were developed on \~32,000 OHSUMED documents.
|
||||||
|
\begin{itemize}
|
||||||
|
\item All validated on a larger unseen collection and the best function taken.
|
||||||
|
\item Random population of 100 for 50 generations.
|
||||||
|
\item The fitness function used was MAP.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\item 7 tf functions were developed on \~32,000 LATIMES documents.
|
||||||
|
\begin{itemize}
|
||||||
|
\item All validated on a larger unseen collection and the best function taken.
|
||||||
|
\item Random population of 200 for 25 generations.
|
||||||
|
\item The fitness function used was MAP.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\item 7 normalisation functions were developed 3 $\times$ \~ 10,000 LATIMES documents.
|
||||||
|
\begin{itemize}
|
||||||
|
\item All validated on a larger unseen collection and the best function taken.
|
||||||
|
\item Random population of 200 for 25 generations.
|
||||||
|
\item Fitness function used was average MAP over the 3 collections.
|
||||||
|
\end{itemize}
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsubsection{Analysis}
|
||||||
|
The global function $w_3$ always produces a positive number:
|
||||||
|
\[
|
||||||
|
w_3 = \sqrt{\frac{\textit{cf}^3_t \cdot N}{\textit{df}^4_t}}
|
||||||
|
\]
|
||||||
|
|
||||||
|
|
||||||
|
\begin{tcolorbox}[colback=gray!10, colframe=black, title=\textbf{Case Study 1: Application of Genetic Programming to IR}]
|
||||||
|
Evolutionary computing approaches include:
|
||||||
|
\begin{itemize}
|
||||||
|
\item Evolutionary strategies.
|
||||||
|
\item Genetic algorithms.
|
||||||
|
\item Genetic programming.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Why genetic programming for IR?
|
||||||
|
\begin{itemize}
|
||||||
|
\item Produces a symbolic representation of a solution which is useful for further analysis.
|
||||||
|
\item Using training data, MAP can be directly optimised (i.e., used as the fitness function).
|
||||||
|
\item Solutions produced are often generalisable as solution length (size) can be controlled.
|
||||||
|
\end{itemize}
|
||||||
|
\end{tcolorbox}
|
||||||
|
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
|
Binary file not shown.
After Width: | Height: | Size: 78 KiB |
Binary file not shown.
After Width: | Height: | Size: 37 KiB |
Binary file not shown.
After Width: | Height: | Size: 126 KiB |
Reference in New Issue
Block a user