387 lines
18 KiB
TeX
387 lines
18 KiB
TeX
%! TeX program = lualatex
|
|
\documentclass[a4paper,11pt]{article}
|
|
% packages
|
|
\usepackage{censor}
|
|
\StopCensoring
|
|
\usepackage{fontspec}
|
|
\setmainfont{EB Garamond}
|
|
% for tironian et fallback
|
|
% % \directlua{luaotfload.add_fallback
|
|
% % ("emojifallback",
|
|
% % {"Noto Serif:mode=harf"}
|
|
% % )}
|
|
% % \setmainfont{EB Garamond}[RawFeature={fallback=emojifallback}]
|
|
|
|
\setmonofont[Scale=MatchLowercase]{Deja Vu Sans Mono}
|
|
\usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry}
|
|
\setlength{\parindent}{0pt}
|
|
|
|
\usepackage{fancyhdr} % Headers and footers
|
|
\fancyhead[R]{\normalfont \leftmark}
|
|
\fancyhead[L]{}
|
|
\pagestyle{fancy}
|
|
|
|
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
|
|
\usepackage[english]{babel} % Language hyphenation and typographical rules
|
|
\usepackage{xcolor}
|
|
\definecolor{linkblue}{RGB}{0, 64, 128}
|
|
\usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref}
|
|
% \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}}
|
|
\newcommand{\secref}[1]{\textbf{§\ref{#1}~\nameref{#1}}}
|
|
|
|
\usepackage{changepage} % adjust margins on the fly
|
|
|
|
\usepackage{minted}
|
|
\usemintedstyle{algol_nu}
|
|
|
|
\usepackage{pgfplots}
|
|
\pgfplotsset{width=\textwidth,compat=1.9}
|
|
|
|
\usepackage{caption}
|
|
\newenvironment{code}{\captionsetup{type=listing}}{}
|
|
\captionsetup[listing]{skip=0pt}
|
|
\setlength{\abovecaptionskip}{5pt}
|
|
\setlength{\belowcaptionskip}{5pt}
|
|
|
|
\usepackage[yyyymmdd]{datetime}
|
|
\renewcommand{\dateseparator}{--}
|
|
|
|
\usepackage{enumitem}
|
|
|
|
\usepackage{titlesec}
|
|
|
|
\author{Andrew Hayes}
|
|
|
|
\begin{document}
|
|
\begin{titlepage}
|
|
\begin{center}
|
|
\hrule
|
|
\vspace*{0.6cm}
|
|
\censor{\huge \textbf{CT4101}}
|
|
\vspace*{0.6cm}
|
|
\hrule
|
|
\LARGE
|
|
\vspace{0.5cm}
|
|
Machine Learning
|
|
\vspace{0.5cm}
|
|
\hrule
|
|
|
|
\vfill
|
|
\vfill
|
|
|
|
\hrule
|
|
\begin{minipage}{0.495\textwidth}
|
|
\vspace{0.4em}
|
|
\raggedright
|
|
\normalsize
|
|
Name: Andrew Hayes \\
|
|
E-mail: \href{mailto://a.hayes18@universityofgalway.ie}{\texttt{a.hayes18@universityofgalway.ie}} \hfill\\
|
|
Student ID: 21321503 \hfill
|
|
\end{minipage}
|
|
\begin{minipage}{0.495\textwidth}
|
|
\raggedleft
|
|
\vspace*{0.8cm}
|
|
\Large
|
|
\today
|
|
\vspace*{0.6cm}
|
|
\end{minipage}
|
|
\medskip\hrule
|
|
\end{center}
|
|
\end{titlepage}
|
|
|
|
\pagenumbering{roman}
|
|
\newpage
|
|
\tableofcontents
|
|
\newpage
|
|
\setcounter{page}{1}
|
|
\pagenumbering{arabic}
|
|
|
|
\section{Introduction}
|
|
\subsection{Lecturer Contact Details}
|
|
\begin{itemize}
|
|
\item Dr. Frank Glavin.
|
|
\item \href{mailto://frank.glavin@universityofgalway.ie}{\texttt{frank.glavin@universityofgalway.ie}}
|
|
\end{itemize}
|
|
|
|
\subsection{Grading}
|
|
\begin{itemize}
|
|
\item Continuous Assessment: 30\% (2 assignments, worth 15\% each).
|
|
\item Written Exam: 70\% (Last 2 year's exam papers most relevant).
|
|
\end{itemize}
|
|
|
|
\subsection{Module Overview}
|
|
\textbf{Machine Learning (ML)} allows computer programs to improve their performance with experience (i.e., data).
|
|
This module is targeted at learners with no prior ML experience, but with university experience of mathematics \&
|
|
statistics and \textbf{strong} programming skills.
|
|
The focus of this module is on practical applications of commonly used ML algorithms, including deep learning
|
|
applied to computer vision.
|
|
Students will learn to use modern ML frameworks (e.g., scikit-learn, Tensorflow / Keras) to train \& evaluate
|
|
models for common categories of ML task including classification, clustering, \& regression.
|
|
|
|
\subsubsection{Learning Objectives}
|
|
On successful completion, a student should be able to:
|
|
\begin{enumerate}
|
|
\item Explain the details of commonly used Machine Learning algorithms.
|
|
\item Apply modern frameworks to develop models for common categories of Machine Learning task, including
|
|
classification, clustering, \& regression.
|
|
\item Understand how Deep Learning can be applied to computer vision tasks.
|
|
\item Pre-process datasets for Machine Learning tasks using techniques such as normalisation \& feature
|
|
selection.
|
|
\item Select appropriate algorithms \& evaluation metrics for a given dataset \& task.
|
|
\item Choose appropriate hyperparameters for a range of Machine Learning algorithms.
|
|
\item Evaluate \& interpret the results produced by Machine Learning models.
|
|
\item Diagnose \& address commonly encountered problems with Machine Learning models.
|
|
\item Discuss ethical issues \& emerging trends in Machine Learning.
|
|
\end{enumerate}
|
|
|
|
\section{What is Machine Learning?}
|
|
There are many possible definitions for ``machine learning'':
|
|
\begin{itemize}
|
|
\item Samuel, 1959: ``Field of study that gives computers the ability to learn without being explicitly
|
|
programmed''.
|
|
\item Witten \& Frank, 1999: ``Learning is changing behaviour in a way that makes \textit{performance} better
|
|
in the future''.
|
|
\item Mitchelll, 1997: ``Improvement with experience at some task''.
|
|
A well-defined ML problem will improve over task $T$ with regards to \textbf{performance} measure $P$,
|
|
based on experience $E$.
|
|
\item Artificial Intelligence $\neq$ Machine Learning $\neq$ Deep Learning.
|
|
\item Artificial Intelligence $\not \supseteq$ Machine Learning $\not \supseteq$ Deep Learning.
|
|
\end{itemize}
|
|
|
|
Machine Learning techniques include:
|
|
\begin{itemize}
|
|
\item Supervised learning.
|
|
\item Unsupervised learning.
|
|
\item Semi-Supervised learning.
|
|
\item Reinforcement learning.
|
|
\end{itemize}
|
|
|
|
Major types of ML task include:
|
|
\begin{enumerate}
|
|
\item Classification.
|
|
\item Regression.
|
|
\item Clustering.
|
|
\item Co-Training.
|
|
\item Relationship discovery.
|
|
\item Reinforcement learning.
|
|
\end{enumerate}
|
|
|
|
Techniques for these tasks include:
|
|
\begin{enumerate}
|
|
\item \textbf{Supervised learning:}
|
|
\begin{itemize}
|
|
\item \textbf{Classification:} decision trees, SVMs.
|
|
\item \textbf{Regression:} linear regression, neural nets, $k$-NN (good for classification too).
|
|
\end{itemize}
|
|
|
|
\item \textbf{Unsupervised learning:}
|
|
\begin{itemize}
|
|
\item \textbf{Clustering:} $k$-Means, EM-clustering.
|
|
\item \textbf{Relationship discovery:} association rules, bayesian nets.
|
|
\end{itemize}
|
|
|
|
\item \textbf{Semi-Supervised learning:}
|
|
\begin{itemize}
|
|
\item \textbf{Learning from part-labelled data:} co-training, transductive learning (combines ideas
|
|
from clustering \& classification).
|
|
\end{itemize}
|
|
|
|
\item \textbf{Reward-Based:}
|
|
\begin{itemize}
|
|
\item \textbf{Reinforcement learning:} Q-learning, SARSA.
|
|
\end{itemize}
|
|
\end{enumerate}
|
|
|
|
In all cases, the machine searches for a \textbf{hypothesis} that best describes the data presented to it.
|
|
Choices to be made include:
|
|
\begin{itemize}
|
|
\item How is the hypothesis expressed? e.g., mathematical equation, logic rules, diagrammatic form, table,
|
|
parameters of a model (e.g. weights of an ANN), etc.
|
|
\item How is search carried out? e.g., systematic (breadth-first or depth-first) or heuristic (most promising
|
|
first).
|
|
\item How do we measure the quality of a hypothesis?
|
|
\item What is an appropriate format for the data?
|
|
\item How much data is required?
|
|
\end{itemize}
|
|
|
|
To apply ML, we need to know:
|
|
\begin{itemize}
|
|
\item How to formulate a problem.
|
|
\item How to prepare the data.
|
|
\item How to select an appropriate algorithm.
|
|
\item How to interpret the results.
|
|
\end{itemize}
|
|
|
|
To evaluate results \& compare methods, we need to know:
|
|
\begin{itemize}
|
|
\item The separation between training, testing, \& validation.
|
|
\item Performance measures such as simple metrics, statistical tests, \& graphical methods.
|
|
\item How to improve performance.
|
|
\item Ensemble methods.
|
|
\item Theoretical bounds on performance.
|
|
\end{itemize}
|
|
|
|
\subsection{Data Mining}
|
|
\textbf{Data Mining} is the process of extracting interesting knowledge from large, unstructured datasets.
|
|
This knowledge is typically non-obvious, comprehensible, meaningful, \& useful.
|
|
\\\\
|
|
The storage ``law'' states that storage capacity doubles every year, faster than Moore's ``law'', which may results
|
|
in write-only ``data tombs''.
|
|
Therefore, developments in ML may be essential to be able to process \& exploit this lost data.
|
|
|
|
\subsection{Big Data}
|
|
\textbf{Big Data} consists of datasets of scale \& complexity such that they can be difficult to process using
|
|
current standard methods.
|
|
The data scale dimensions are affected by one or more of the ``3 Vs'':
|
|
\begin{itemize}
|
|
\item \textbf{Volume:} terabytes \& up.
|
|
\item \textbf{Velocity:} from batch to streaming data.
|
|
\item \textbf{Variety:} numeric, video, sensor, unstructured text, etc.
|
|
\end{itemize}
|
|
|
|
It is also fashionable to add more ``Vs'' that are not key:
|
|
\begin{itemize}
|
|
\item \textbf{Veracity:} quality \& uncertainty associated with items.
|
|
\item \textbf{Variability:} change / inconsistency over time.
|
|
\item \textbf{Value:} for the organisation.
|
|
\end{itemize}
|
|
|
|
Key techniques for handling big data include: sampling, inductive learning, clustering, associations, \& distributed
|
|
programming methods.
|
|
|
|
\section{Introduction to Python}
|
|
\textbf{Python} is a general-purpose high-level programming language, first created by Guido van Rossum in 1991.
|
|
Python programs are interpreted by an \textit{interpreter}, e.g. \textbf{CPython} -- the reference implementation
|
|
supported by the Python Software Foundation.
|
|
CPython is both a compiler and an interpreter as it first compiles Python code into bytecode before interpreting it.
|
|
\\\\
|
|
Python interpreters are available for a wide variety of operating systems \& platforms.
|
|
Python supports multiple programming paradigms, including procedural programming, object-oriented programming, \&
|
|
functional programming.
|
|
Python is \textbf{dynamically typed}, unlike languages such as C, C++, \& Java which are \textit{statically typed},
|
|
meaning that many common error checks are deferred until runtime in Python, whereas in a statically typed language like Java
|
|
these checks are performed during compilation.
|
|
\\\\
|
|
Python uses \textbf{garbage collection}, meaning that memory management is handled automatically and there is no need for
|
|
the programmer to manually allocate \& de-allocate chunks of memory.
|
|
\\\\
|
|
Python is used for all kinds of computational tasks, including:
|
|
\begin{itemize}
|
|
\item Scientific computing.
|
|
\item Data analytics.
|
|
\item Artificial Intelligence \& Machine Learning.
|
|
\item Computer vision.
|
|
\item Web development / web apps.
|
|
\item Mobile applications.
|
|
\item Desktop GUI applications.
|
|
\end{itemize}
|
|
|
|
While having relatively simple syntax and being easy to learn for beginners, Python also has very advanced
|
|
functionality.
|
|
It is one of the most widely used programming languages, being both open source \& freely available.
|
|
Python programs will run almost anywhere that there is an installation of the Python interpreter.
|
|
In contrast, many languages such as C or C++ have separate binaries that must be compiled for each specific platform
|
|
\& operating system.
|
|
\\\\
|
|
Python has a wide array of libraries available, most of which are free \& open source.
|
|
Python programs are usually much shorter than the equivalent Java or C++ code, meaning less code to write and
|
|
faster development times for experienced Python developers.
|
|
Its brevity also means that the code is easier to maintain, debug, \& refactor as much less source code is required
|
|
to be read for these tasks.
|
|
Python code can also be run without the need for ahead-of-time compilation (as in C or C++), allowing for faster
|
|
iterations over code versions \& faster testing.
|
|
Python can also be easily extended \& integrated with software written in many other programming languages.
|
|
\\\\
|
|
Drawbacks of using Python include:
|
|
\begin{itemize}
|
|
\item \textbf{Efficiency:} Program execution speed in Python is typically a lot slower than more low-level
|
|
languages such as C or C++.
|
|
The relative execution speed of Python compared to C or C++ depends a lot on coding practices and the
|
|
specific application being considered.
|
|
|
|
\item \textbf{Memory Management} in Python is less efficient than well-written C or C++
|
|
code although these efficiency concerns are not usually a major issues, as compute power \& memory are now
|
|
relatively cheap on desktop, laptop, \& server systems.
|
|
Python is used in the backend of large web services such as Spotify \& Instagram, and performs
|
|
adequately.
|
|
However, these performance concerns may mean that Python is unsuitable for some performance-critical
|
|
applications, e.g. resource-intensive scientific computing, embedded devices, automotive, etc.
|
|
Faster alternative Python implementations such as \textbf{PyPy} are also available, with PyPy
|
|
providing an average of a four-fold speedup by implementing advanced compilation techniques.
|
|
It's also possible to call code that is implemented in C within Python to speed up performance-critical
|
|
sections of your program.
|
|
|
|
\item \textbf{Dynamic typing} can make code more difficult to write \& debug compared to statically-typed
|
|
languages, wherein the compiler checks that all variable types match before the code is executed.
|
|
|
|
\item \textbf{Python2 vs Python3:} There are two major version of Python in widespread use that are not
|
|
compatible with each other due to several changes that were made when Python3 was introduced.
|
|
This means that some libraries that were originally written in Python2 have not been ported over to
|
|
Python3.
|
|
Python2 is now mostly used only in legacy business applications, while most new development is in
|
|
Python3.
|
|
Python2 is no longer supported or receives updates as of 2020.
|
|
\end{itemize}
|
|
|
|
\subsection{Running Python Programs}
|
|
Python programs can be executed in a variety of different ways:
|
|
\begin{itemize}
|
|
\item through the Python interactive shell on your local machine.
|
|
\item through remote Python interactive shells that are accessible through web browsers.
|
|
\item by using the console of your operating system to launch a standalone Python script (\verb|.py| file).
|
|
\item by using an IDE to launch a \verb|.py| file.
|
|
\item as GUI applications using libraries such as Tkinter PyQt.
|
|
\item as web applications that provide services to other computers, e.g. by using the Flask framework to create
|
|
a web server with content that can be accessed using web browsers.
|
|
\item through Jupyter / JupyterLab notebooks, either hosted locally on your machine or cloud-based Jupyter
|
|
notebook execution environments such as Google Colab, Microsoft Azure Notebooks, Binder, etc.
|
|
\end{itemize}
|
|
|
|
\subsection{Hello World}
|
|
The following programs writes ``Hello World!'' to the screen.
|
|
\begin{code}
|
|
\begin{minted}[linenos, breaklines, frame=single]{python}
|
|
print("Hello World!")
|
|
\end{minted}
|
|
\caption{\texttt{helloworld.py}}
|
|
\end{code}
|
|
|
|
\subsection{PEP 8 Style Guide}
|
|
\textbf{PEPs (Python Enhancement Proposals)} describe \& document the way in which the Python language evolves over time, e.g. addition of new features.
|
|
Backwards compatibility policy etc. PEPSs can be proposed, then accepted or rejected.
|
|
The full list is available at \url{https://www.python.org/dev/peps/}.
|
|
\textbf{PEP 8} gives coding conventions for the Python code comprising the standard library in the main Python
|
|
distribution. See: \url{https://www.python.org/dev/peps/pep-0008/}.
|
|
It contains conventions for the user-defined names (e.g., variables, functions, packages), as well as code layout,
|
|
line length, use of blank lines, style of comments, etc.
|
|
\\\\
|
|
Many professional Python developers \& companies adhere to (at least some of) the PEP8 conventions.
|
|
It is important to learn to follow these conventions from the start, especially if you want to work with other
|
|
programmers, as experienced Python developers will often flag violations of the PEP 8 conventions during code
|
|
reviews.
|
|
Of course, many companies \& open-source software projects have defined their own internal coding style guidelines
|
|
which take precedence over PEP 8 in the case of conflicts.
|
|
Following PEP 8 conventions is relatively easy if you are using a good IDE, e.g. PyCharm automatically finds \&
|
|
alerts you to violations of the PEP 8 conventions.
|
|
|
|
\subsubsection{Variable Naming Conventions}
|
|
According to PEP 8, variable names ``should be lowercase, with words separated by underscores as necessary to
|
|
improve readability'', i.e. \mintinline{python}{snake_case}.
|
|
``Never use the characters \verb|l|, \verb|O|, or \verb|I| as single-character variable names.
|
|
In some fonts, these characters are indistinguishable from the numerals one \& zero.
|
|
When tempted to use \verb|l|, use \verb|L| instead''.
|
|
According to PEP 8, different naming conventions are used for different identifiers, e.g.:
|
|
``Class names should normally use the CapWords convention''.
|
|
This helps programmers to quickly \& easily distinguish which category an identifier name represents.
|
|
|
|
\subsection{Dynamic Typing}
|
|
In Python, variable names can point to objects of any type.
|
|
Built-in data types in python include \mintinline{python}{str}, \mintinline{python}{int}, \mintinline{python}{float},
|
|
etc.
|
|
Each type can hold a different type of data.
|
|
As we saw, \mintinline{python}{str} can hold any combination of characters.
|
|
|
|
|
|
|
|
\end{document}
|