Files
uni/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex

387 lines
18 KiB
TeX

%! TeX program = lualatex
\documentclass[a4paper,11pt]{article}
% packages
\usepackage{censor}
\StopCensoring
\usepackage{fontspec}
\setmainfont{EB Garamond}
% for tironian et fallback
% % \directlua{luaotfload.add_fallback
% % ("emojifallback",
% % {"Noto Serif:mode=harf"}
% % )}
% % \setmainfont{EB Garamond}[RawFeature={fallback=emojifallback}]
\setmonofont[Scale=MatchLowercase]{Deja Vu Sans Mono}
\usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry}
\setlength{\parindent}{0pt}
\usepackage{fancyhdr} % Headers and footers
\fancyhead[R]{\normalfont \leftmark}
\fancyhead[L]{}
\pagestyle{fancy}
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{xcolor}
\definecolor{linkblue}{RGB}{0, 64, 128}
\usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref}
% \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}}
\newcommand{\secref}[1]{\textbf{§\ref{#1}~\nameref{#1}}}
\usepackage{changepage} % adjust margins on the fly
\usepackage{minted}
\usemintedstyle{algol_nu}
\usepackage{pgfplots}
\pgfplotsset{width=\textwidth,compat=1.9}
\usepackage{caption}
\newenvironment{code}{\captionsetup{type=listing}}{}
\captionsetup[listing]{skip=0pt}
\setlength{\abovecaptionskip}{5pt}
\setlength{\belowcaptionskip}{5pt}
\usepackage[yyyymmdd]{datetime}
\renewcommand{\dateseparator}{--}
\usepackage{enumitem}
\usepackage{titlesec}
\author{Andrew Hayes}
\begin{document}
\begin{titlepage}
\begin{center}
\hrule
\vspace*{0.6cm}
\censor{\huge \textbf{CT4101}}
\vspace*{0.6cm}
\hrule
\LARGE
\vspace{0.5cm}
Machine Learning
\vspace{0.5cm}
\hrule
\vfill
\vfill
\hrule
\begin{minipage}{0.495\textwidth}
\vspace{0.4em}
\raggedright
\normalsize
Name: Andrew Hayes \\
E-mail: \href{mailto://a.hayes18@universityofgalway.ie}{\texttt{a.hayes18@universityofgalway.ie}} \hfill\\
Student ID: 21321503 \hfill
\end{minipage}
\begin{minipage}{0.495\textwidth}
\raggedleft
\vspace*{0.8cm}
\Large
\today
\vspace*{0.6cm}
\end{minipage}
\medskip\hrule
\end{center}
\end{titlepage}
\pagenumbering{roman}
\newpage
\tableofcontents
\newpage
\setcounter{page}{1}
\pagenumbering{arabic}
\section{Introduction}
\subsection{Lecturer Contact Details}
\begin{itemize}
\item Dr. Frank Glavin.
\item \href{mailto://frank.glavin@universityofgalway.ie}{\texttt{frank.glavin@universityofgalway.ie}}
\end{itemize}
\subsection{Grading}
\begin{itemize}
\item Continuous Assessment: 30\% (2 assignments, worth 15\% each).
\item Written Exam: 70\% (Last 2 year's exam papers most relevant).
\end{itemize}
\subsection{Module Overview}
\textbf{Machine Learning (ML)} allows computer programs to improve their performance with experience (i.e., data).
This module is targeted at learners with no prior ML experience, but with university experience of mathematics \&
statistics and \textbf{strong} programming skills.
The focus of this module is on practical applications of commonly used ML algorithms, including deep learning
applied to computer vision.
Students will learn to use modern ML frameworks (e.g., scikit-learn, Tensorflow / Keras) to train \& evaluate
models for common categories of ML task including classification, clustering, \& regression.
\subsubsection{Learning Objectives}
On successful completion, a student should be able to:
\begin{enumerate}
\item Explain the details of commonly used Machine Learning algorithms.
\item Apply modern frameworks to develop models for common categories of Machine Learning task, including
classification, clustering, \& regression.
\item Understand how Deep Learning can be applied to computer vision tasks.
\item Pre-process datasets for Machine Learning tasks using techniques such as normalisation \& feature
selection.
\item Select appropriate algorithms \& evaluation metrics for a given dataset \& task.
\item Choose appropriate hyperparameters for a range of Machine Learning algorithms.
\item Evaluate \& interpret the results produced by Machine Learning models.
\item Diagnose \& address commonly encountered problems with Machine Learning models.
\item Discuss ethical issues \& emerging trends in Machine Learning.
\end{enumerate}
\section{What is Machine Learning?}
There are many possible definitions for ``machine learning'':
\begin{itemize}
\item Samuel, 1959: ``Field of study that gives computers the ability to learn without being explicitly
programmed''.
\item Witten \& Frank, 1999: ``Learning is changing behaviour in a way that makes \textit{performance} better
in the future''.
\item Mitchelll, 1997: ``Improvement with experience at some task''.
A well-defined ML problem will improve over task $T$ with regards to \textbf{performance} measure $P$,
based on experience $E$.
\item Artificial Intelligence $\neq$ Machine Learning $\neq$ Deep Learning.
\item Artificial Intelligence $\not \supseteq$ Machine Learning $\not \supseteq$ Deep Learning.
\end{itemize}
Machine Learning techniques include:
\begin{itemize}
\item Supervised learning.
\item Unsupervised learning.
\item Semi-Supervised learning.
\item Reinforcement learning.
\end{itemize}
Major types of ML task include:
\begin{enumerate}
\item Classification.
\item Regression.
\item Clustering.
\item Co-Training.
\item Relationship discovery.
\item Reinforcement learning.
\end{enumerate}
Techniques for these tasks include:
\begin{enumerate}
\item \textbf{Supervised learning:}
\begin{itemize}
\item \textbf{Classification:} decision trees, SVMs.
\item \textbf{Regression:} linear regression, neural nets, $k$-NN (good for classification too).
\end{itemize}
\item \textbf{Unsupervised learning:}
\begin{itemize}
\item \textbf{Clustering:} $k$-Means, EM-clustering.
\item \textbf{Relationship discovery:} association rules, bayesian nets.
\end{itemize}
\item \textbf{Semi-Supervised learning:}
\begin{itemize}
\item \textbf{Learning from part-labelled data:} co-training, transductive learning (combines ideas
from clustering \& classification).
\end{itemize}
\item \textbf{Reward-Based:}
\begin{itemize}
\item \textbf{Reinforcement learning:} Q-learning, SARSA.
\end{itemize}
\end{enumerate}
In all cases, the machine searches for a \textbf{hypothesis} that best describes the data presented to it.
Choices to be made include:
\begin{itemize}
\item How is the hypothesis expressed? e.g., mathematical equation, logic rules, diagrammatic form, table,
parameters of a model (e.g. weights of an ANN), etc.
\item How is search carried out? e.g., systematic (breadth-first or depth-first) or heuristic (most promising
first).
\item How do we measure the quality of a hypothesis?
\item What is an appropriate format for the data?
\item How much data is required?
\end{itemize}
To apply ML, we need to know:
\begin{itemize}
\item How to formulate a problem.
\item How to prepare the data.
\item How to select an appropriate algorithm.
\item How to interpret the results.
\end{itemize}
To evaluate results \& compare methods, we need to know:
\begin{itemize}
\item The separation between training, testing, \& validation.
\item Performance measures such as simple metrics, statistical tests, \& graphical methods.
\item How to improve performance.
\item Ensemble methods.
\item Theoretical bounds on performance.
\end{itemize}
\subsection{Data Mining}
\textbf{Data Mining} is the process of extracting interesting knowledge from large, unstructured datasets.
This knowledge is typically non-obvious, comprehensible, meaningful, \& useful.
\\\\
The storage ``law'' states that storage capacity doubles every year, faster than Moore's ``law'', which may results
in write-only ``data tombs''.
Therefore, developments in ML may be essential to be able to process \& exploit this lost data.
\subsection{Big Data}
\textbf{Big Data} consists of datasets of scale \& complexity such that they can be difficult to process using
current standard methods.
The data scale dimensions are affected by one or more of the ``3 Vs'':
\begin{itemize}
\item \textbf{Volume:} terabytes \& up.
\item \textbf{Velocity:} from batch to streaming data.
\item \textbf{Variety:} numeric, video, sensor, unstructured text, etc.
\end{itemize}
It is also fashionable to add more ``Vs'' that are not key:
\begin{itemize}
\item \textbf{Veracity:} quality \& uncertainty associated with items.
\item \textbf{Variability:} change / inconsistency over time.
\item \textbf{Value:} for the organisation.
\end{itemize}
Key techniques for handling big data include: sampling, inductive learning, clustering, associations, \& distributed
programming methods.
\section{Introduction to Python}
\textbf{Python} is a general-purpose high-level programming language, first created by Guido van Rossum in 1991.
Python programs are interpreted by an \textit{interpreter}, e.g. \textbf{CPython} -- the reference implementation
supported by the Python Software Foundation.
CPython is both a compiler and an interpreter as it first compiles Python code into bytecode before interpreting it.
\\\\
Python interpreters are available for a wide variety of operating systems \& platforms.
Python supports multiple programming paradigms, including procedural programming, object-oriented programming, \&
functional programming.
Python is \textbf{dynamically typed}, unlike languages such as C, C++, \& Java which are \textit{statically typed},
meaning that many common error checks are deferred until runtime in Python, whereas in a statically typed language like Java
these checks are performed during compilation.
\\\\
Python uses \textbf{garbage collection}, meaning that memory management is handled automatically and there is no need for
the programmer to manually allocate \& de-allocate chunks of memory.
\\\\
Python is used for all kinds of computational tasks, including:
\begin{itemize}
\item Scientific computing.
\item Data analytics.
\item Artificial Intelligence \& Machine Learning.
\item Computer vision.
\item Web development / web apps.
\item Mobile applications.
\item Desktop GUI applications.
\end{itemize}
While having relatively simple syntax and being easy to learn for beginners, Python also has very advanced
functionality.
It is one of the most widely used programming languages, being both open source \& freely available.
Python programs will run almost anywhere that there is an installation of the Python interpreter.
In contrast, many languages such as C or C++ have separate binaries that must be compiled for each specific platform
\& operating system.
\\\\
Python has a wide array of libraries available, most of which are free \& open source.
Python programs are usually much shorter than the equivalent Java or C++ code, meaning less code to write and
faster development times for experienced Python developers.
Its brevity also means that the code is easier to maintain, debug, \& refactor as much less source code is required
to be read for these tasks.
Python code can also be run without the need for ahead-of-time compilation (as in C or C++), allowing for faster
iterations over code versions \& faster testing.
Python can also be easily extended \& integrated with software written in many other programming languages.
\\\\
Drawbacks of using Python include:
\begin{itemize}
\item \textbf{Efficiency:} Program execution speed in Python is typically a lot slower than more low-level
languages such as C or C++.
The relative execution speed of Python compared to C or C++ depends a lot on coding practices and the
specific application being considered.
\item \textbf{Memory Management} in Python is less efficient than well-written C or C++
code although these efficiency concerns are not usually a major issues, as compute power \& memory are now
relatively cheap on desktop, laptop, \& server systems.
Python is used in the backend of large web services such as Spotify \& Instagram, and performs
adequately.
However, these performance concerns may mean that Python is unsuitable for some performance-critical
applications, e.g. resource-intensive scientific computing, embedded devices, automotive, etc.
Faster alternative Python implementations such as \textbf{PyPy} are also available, with PyPy
providing an average of a four-fold speedup by implementing advanced compilation techniques.
It's also possible to call code that is implemented in C within Python to speed up performance-critical
sections of your program.
\item \textbf{Dynamic typing} can make code more difficult to write \& debug compared to statically-typed
languages, wherein the compiler checks that all variable types match before the code is executed.
\item \textbf{Python2 vs Python3:} There are two major version of Python in widespread use that are not
compatible with each other due to several changes that were made when Python3 was introduced.
This means that some libraries that were originally written in Python2 have not been ported over to
Python3.
Python2 is now mostly used only in legacy business applications, while most new development is in
Python3.
Python2 is no longer supported or receives updates as of 2020.
\end{itemize}
\subsection{Running Python Programs}
Python programs can be executed in a variety of different ways:
\begin{itemize}
\item through the Python interactive shell on your local machine.
\item through remote Python interactive shells that are accessible through web browsers.
\item by using the console of your operating system to launch a standalone Python script (\verb|.py| file).
\item by using an IDE to launch a \verb|.py| file.
\item as GUI applications using libraries such as Tkinter PyQt.
\item as web applications that provide services to other computers, e.g. by using the Flask framework to create
a web server with content that can be accessed using web browsers.
\item through Jupyter / JupyterLab notebooks, either hosted locally on your machine or cloud-based Jupyter
notebook execution environments such as Google Colab, Microsoft Azure Notebooks, Binder, etc.
\end{itemize}
\subsection{Hello World}
The following programs writes ``Hello World!'' to the screen.
\begin{code}
\begin{minted}[linenos, breaklines, frame=single]{python}
print("Hello World!")
\end{minted}
\caption{\texttt{helloworld.py}}
\end{code}
\subsection{PEP 8 Style Guide}
\textbf{PEPs (Python Enhancement Proposals)} describe \& document the way in which the Python language evolves over time, e.g. addition of new features.
Backwards compatibility policy etc. PEPSs can be proposed, then accepted or rejected.
The full list is available at \url{https://www.python.org/dev/peps/}.
\textbf{PEP 8} gives coding conventions for the Python code comprising the standard library in the main Python
distribution. See: \url{https://www.python.org/dev/peps/pep-0008/}.
It contains conventions for the user-defined names (e.g., variables, functions, packages), as well as code layout,
line length, use of blank lines, style of comments, etc.
\\\\
Many professional Python developers \& companies adhere to (at least some of) the PEP8 conventions.
It is important to learn to follow these conventions from the start, especially if you want to work with other
programmers, as experienced Python developers will often flag violations of the PEP 8 conventions during code
reviews.
Of course, many companies \& open-source software projects have defined their own internal coding style guidelines
which take precedence over PEP 8 in the case of conflicts.
Following PEP 8 conventions is relatively easy if you are using a good IDE, e.g. PyCharm automatically finds \&
alerts you to violations of the PEP 8 conventions.
\subsubsection{Variable Naming Conventions}
According to PEP 8, variable names ``should be lowercase, with words separated by underscores as necessary to
improve readability'', i.e. \mintinline{python}{snake_case}.
``Never use the characters \verb|l|, \verb|O|, or \verb|I| as single-character variable names.
In some fonts, these characters are indistinguishable from the numerals one \& zero.
When tempted to use \verb|l|, use \verb|L| instead''.
According to PEP 8, different naming conventions are used for different identifiers, e.g.:
``Class names should normally use the CapWords convention''.
This helps programmers to quickly \& easily distinguish which category an identifier name represents.
\subsection{Dynamic Typing}
In Python, variable names can point to objects of any type.
Built-in data types in python include \mintinline{python}{str}, \mintinline{python}{int}, \mintinline{python}{float},
etc.
Each type can hold a different type of data.
As we saw, \mintinline{python}{str} can hold any combination of characters.
\end{document}