uni/year4/semester1/CT4101: Machine Learning/notes/CT4101-Notes.tex

%! TeX program = lualatex
\documentclass[a4paper,11pt]{article}
% packages
\usepackage{censor}
\StopCensoring
\usepackage{fontspec}
\setmainfont{EB Garamond}
% for tironian et fallback
% % \directlua{luaotfload.add_fallback
% % ("emojifallback",
% %      {"Noto Serif:mode=harf"}
% % )}
% % \setmainfont{EB Garamond}[RawFeature={fallback=emojifallback}]

\setmonofont[Scale=MatchLowercase]{Deja Vu Sans Mono}
\usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry}
\setlength{\parindent}{0pt}

\usepackage{fancyhdr}       % Headers and footers
\fancyhead[R]{\normalfont \leftmark}
\fancyhead[L]{}
\pagestyle{fancy}

\usepackage{microtype}      % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{xcolor}
\definecolor{linkblue}{RGB}{0, 64, 128}
\usepackage[final, colorlinks = false, urlcolor = linkblue]{hyperref}
% \newcommand{\secref}[1]{\textbf{§~\nameref{#1}}}
\newcommand{\secref}[1]{\textbf{§\ref{#1}~\nameref{#1}}}

\usepackage{changepage}     % adjust margins on the fly

\usepackage{minted}
\usemintedstyle{algol_nu}

\usepackage{pgfplots}
\pgfplotsset{width=\textwidth,compat=1.9}

\usepackage{caption}
\newenvironment{code}{\captionsetup{type=listing}}{}
\captionsetup[listing]{skip=0pt}
\setlength{\abovecaptionskip}{5pt}
\setlength{\belowcaptionskip}{5pt}

\usepackage[yyyymmdd]{datetime}
\renewcommand{\dateseparator}{--}

\usepackage{enumitem}

\usepackage{titlesec}

\author{Andrew Hayes}

\begin{document}
\begin{titlepage}
    \begin{center}
        \hrule
        \vspace*{0.6cm}
        \censor{\huge \textbf{CT4101}}
        \vspace*{0.6cm}
        \hrule
        \LARGE
        \vspace{0.5cm}
            Machine Learning
        \vspace{0.5cm}
        \hrule

        \vfill
        \vfill

        \hrule
        \begin{minipage}{0.495\textwidth}
            \vspace{0.4em}
            \raggedright
            \normalsize
            Name: Andrew Hayes \\
            E-mail: \href{mailto://a.hayes18@universityofgalway.ie}{\texttt{a.hayes18@universityofgalway.ie}}  \hfill\\
            Student ID: 21321503 \hfill
        \end{minipage}
        \begin{minipage}{0.495\textwidth}
            \raggedleft
            \vspace*{0.8cm}
            \Large
            \today
            \vspace*{0.6cm}
        \end{minipage}
        \medskip\hrule
    \end{center}
\end{titlepage}

\pagenumbering{roman}
\newpage
\tableofcontents
\newpage
\setcounter{page}{1}
\pagenumbering{arabic}

\section{Introduction}
\subsection{Lecturer Contact Details}
\begin{itemize}
    \item   Dr. Frank Glavin.
    \item   \href{mailto://frank.glavin@universityofgalway.ie}{\texttt{frank.glavin@universityofgalway.ie}}
\end{itemize}

\subsection{Grading}
\begin{itemize}
    \item   Continuous Assessment: 30\% (2 assignments, worth 15\% each).
    \item   Written Exam: 70\% (Last 2 year's exam papers most relevant).
\end{itemize}

\subsection{Module Overview}
\textbf{Machine Learning (ML)} allows computer programs to improve their performance with experience (i.e., data).
This module is targeted at learners with no prior ML experience, but with university experience of mathematics \&
statistics and \textbf{strong} programming skills.
The focus of this module is on practical applications of commonly used ML algorithms, including deep learning
applied to computer vision.
Students will learn to use modern ML frameworks (e.g., scikit-learn, Tensorflow / Keras) to train \& evaluate
models for common categories of ML task including classification, clustering, \& regression.

\subsubsection{Learning Objectives}
On successful completion, a student should be able to:
\begin{enumerate}
    \item   Explain the details of commonly used Machine Learning algorithms.
    \item   Apply modern frameworks to develop models for common categories of Machine Learning task, including
            classification, clustering, \& regression.
    \item   Understand how Deep Learning can be applied to computer vision tasks.
    \item   Pre-process datasets for Machine Learning tasks using techniques such as normalisation \& feature
            selection.
    \item   Select appropriate algorithms \& evaluation metrics for a given dataset \& task.
    \item   Choose appropriate hyperparameters for a range of Machine Learning algorithms.
    \item   Evaluate \& interpret the results produced by Machine Learning models.
    \item   Diagnose \& address commonly encountered problems with Machine Learning models.
    \item   Discuss ethical issues \& emerging trends in Machine Learning.
\end{enumerate}

\section{What is Machine Learning?}
There are many possible definitions for ``machine learning'':
\begin{itemize}
    \item   Samuel, 1959: ``Field of study that gives computers the ability to learn without being explicitly
            programmed''.
    \item   Witten \& Frank, 1999: ``Learning is changing behaviour in a way that makes \textit{performance} better
            in the future''.
    \item   Mitchelll, 1997: ``Improvement with experience at some task''.
            A well-defined ML problem will improve over task $T$ with regards to \textbf{performance} measure $P$,
            based on experience $E$.
    \item   Artificial Intelligence $\neq$ Machine Learning $\neq$ Deep Learning.
    \item   Artificial Intelligence $\not \supseteq$ Machine Learning $\not \supseteq$ Deep Learning.
\end{itemize}

Machine Learning techniques include:
\begin{itemize}
    \item   Supervised learning.
    \item   Unsupervised learning.
    \item   Semi-Supervised learning.
    \item   Reinforcement learning.
\end{itemize}

Major types of ML task include:
\begin{enumerate}
    \item   Classification.
    \item   Regression.
    \item   Clustering.
    \item   Co-Training.
    \item   Relationship discovery.
    \item   Reinforcement learning.
\end{enumerate}

Techniques for these tasks include:
\begin{enumerate}
    \item   \textbf{Supervised learning:}
            \begin{itemize}
                \item   \textbf{Classification:} decision trees, SVMs.
                \item   \textbf{Regression:} linear regression, neural nets, $k$-NN (good for classification too).
            \end{itemize}

    \item   \textbf{Unsupervised learning:}
            \begin{itemize}
                \item   \textbf{Clustering:} $k$-Means, EM-clustering.
                \item   \textbf{Relationship discovery:} association rules, bayesian nets.
            \end{itemize}

    \item   \textbf{Semi-Supervised learning:}
            \begin{itemize}
                \item   \textbf{Learning from part-labelled data:} co-training, transductive learning (combines ideas
                        from clustering \& classification).
            \end{itemize}

    \item   \textbf{Reward-Based:}
            \begin{itemize}
                \item   \textbf{Reinforcement learning:} Q-learning, SARSA.
            \end{itemize}
\end{enumerate}

In all cases, the machine searches for a \textbf{hypothesis} that best describes the data presented to it.
Choices to be made include:
\begin{itemize}
    \item   How is the hypothesis expressed? e.g., mathematical equation, logic rules, diagrammatic form, table,
            parameters of a model (e.g. weights of an ANN), etc.
    \item   How is search carried out? e.g., systematic (breadth-first or depth-first) or heuristic (most promising
            first).
    \item   How do we measure the quality of a hypothesis?
    \item   What is an appropriate format for the data?
    \item   How much data is required?
\end{itemize}

To apply ML, we need to know:
\begin{itemize}
    \item   How to formulate a problem.
    \item   How to prepare the data.
    \item   How to select an appropriate algorithm.
    \item   How to interpret the results.
\end{itemize}

To evaluate results \& compare methods, we need to know:
\begin{itemize}
    \item   The separation between training, testing, \& validation.
    \item   Performance measures such as simple metrics, statistical tests, \& graphical methods.
    \item   How to improve performance.
    \item   Ensemble methods.
    \item   Theoretical bounds on performance.
\end{itemize}

\subsection{Data Mining}
\textbf{Data Mining} is the process of extracting interesting knowledge from large, unstructured datasets.
This knowledge is typically non-obvious, comprehensible, meaningful, \& useful.
\\\\
The storage ``law'' states that storage capacity doubles every year, faster than Moore's ``law'', which may results
in write-only ``data tombs''.
Therefore, developments in ML may be essential to be able to process \& exploit this lost data.

\subsection{Big Data}
\textbf{Big Data} consists of datasets of scale \& complexity such that they can be difficult to process using
current standard methods.
The data scale dimensions are affected by one or more of the ``3 Vs'':
\begin{itemize}
    \item   \textbf{Volume:} terabytes \& up.
    \item   \textbf{Velocity:} from batch to streaming data.
    \item   \textbf{Variety:} numeric, video, sensor, unstructured text, etc.
\end{itemize}

It is also fashionable to add more ``Vs'' that are not key:
\begin{itemize}
    \item   \textbf{Veracity:} quality \& uncertainty associated with items.
    \item   \textbf{Variability:} change / inconsistency over time.
    \item   \textbf{Value:} for the organisation.
\end{itemize}

Key techniques for handling big data include: sampling, inductive learning, clustering, associations, \& distributed
programming methods.

\section{Introduction to Python}
\textbf{Python} is a general-purpose high-level programming language, first created by Guido van Rossum in 1991.
Python programs are interpreted by an \textit{interpreter}, e.g. \textbf{CPython} -- the reference implementation
supported by the Python Software Foundation.
CPython is both a compiler and an interpreter as it first compiles Python code into bytecode before interpreting it.
\\\\
Python interpreters are available for a wide variety of operating systems \& platforms.
Python supports multiple programming paradigms, including procedural programming, object-oriented programming, \&
functional programming.
Python is \textbf{dynamically typed}, unlike languages such as C, C++, \& Java which are \textit{statically typed},
meaning that many common error checks are deferred until runtime in Python, whereas in a statically typed language like Java
these checks are performed during compilation.
\\\\
Python uses \textbf{garbage collection}, meaning that memory management is handled automatically and there is no need for
the programmer to manually allocate \& de-allocate chunks of memory.
\\\\
Python is used for all kinds of computational tasks, including:
\begin{itemize}
    \item   Scientific computing.
    \item   Data analytics.
    \item   Artificial Intelligence \& Machine Learning.
    \item   Computer vision.
    \item   Web development / web apps.
    \item   Mobile applications.
    \item   Desktop GUI applications.
\end{itemize}

While having relatively simple syntax and being easy to learn for beginners, Python also has very advanced
functionality.
It is one of the most widely used programming languages, being both open source \& freely available.
Python programs will run almost anywhere that there is an installation of the Python interpreter.
In contrast, many languages such as C or C++ have separate binaries that must be compiled for each specific platform
\& operating system.
\\\\
Python has a wide array of libraries available, most of which are free \& open source.
Python programs are usually much shorter than the equivalent Java or C++ code, meaning less code to write and
faster development times for experienced Python developers.
Its brevity also means that the code is easier to maintain, debug, \& refactor as much less source code is required
to be read for these tasks.
Python code can also be run without the need for ahead-of-time compilation (as in C or C++), allowing for faster
iterations over code versions \& faster testing.
Python can also be easily extended \& integrated with software written in many other programming languages.
\\\\
Drawbacks of using Python include:
\begin{itemize}
    \item   \textbf{Efficiency:} Program execution speed in Python is typically a lot slower than more low-level
            languages such as C or C++.
            The relative execution speed of Python compared to C or C++ depends a lot on coding practices and the
            specific application being considered.

    \item   \textbf{Memory Management} in Python is less efficient than well-written C  or C++
            code although these efficiency concerns are not usually a major issues, as compute power \& memory are now
            relatively cheap on desktop, laptop, \& server systems.
            Python is used in the backend of large web services such as Spotify \& Instagram, and performs
            adequately.
            However, these performance concerns may mean that Python is unsuitable for some performance-critical
            applications, e.g. resource-intensive scientific computing, embedded devices, automotive, etc.
            Faster alternative Python implementations such as \textbf{PyPy} are also available, with PyPy
            providing an average of a four-fold speedup by implementing advanced compilation techniques.
            It's also possible to call code that is implemented in C within Python to speed up performance-critical
            sections of your program.

    \item   \textbf{Dynamic typing} can make code more difficult to write \& debug compared to statically-typed
            languages, wherein the compiler checks that all variable types match before the code is executed.

    \item   \textbf{Python2 vs Python3:} There are two major version of Python in widespread use that are not
            compatible with each other due to several changes that were made when Python3 was introduced.
            This means that some libraries that were originally written in Python2 have not been ported over to
            Python3.
            Python2 is now mostly used only in legacy business applications, while most new development is in
            Python3.
            Python2 is no longer supported or receives updates as of 2020.
\end{itemize}

\subsection{Running Python Programs}
Python programs can be executed in a variety of different ways:
\begin{itemize}
    \item   through the Python interactive shell on your local machine.
    \item   through remote Python interactive shells that are accessible through web browsers.
    \item   by using the console of your operating system to launch a standalone Python script (\verb|.py| file).
    \item   by using an IDE to launch a \verb|.py| file.
    \item   as GUI applications using libraries such as Tkinter PyQt.
    \item   as web applications that provide services to other computers, e.g. by using the Flask framework to create
            a web server with content that can be accessed using web browsers.
    \item   through Jupyter / JupyterLab notebooks, either hosted locally on your machine or cloud-based Jupyter
            notebook execution environments such as Google Colab, Microsoft Azure Notebooks, Binder, etc.
\end{itemize}

\subsection{Hello World}
The following programs writes ``Hello World!'' to the screen.
\begin{code}
\begin{minted}[linenos, breaklines, frame=single]{python}
print("Hello World!")
\end{minted}
\caption{\texttt{helloworld.py}}
\end{code}

\subsection{PEP 8 Style Guide}
\textbf{PEPs (Python Enhancement Proposals)} describe \& document the way in which the Python language evolves over time, e.g. addition of new features.
Backwards compatibility policy etc. PEPSs can be proposed, then accepted or rejected.
The full list is available at \url{https://www.python.org/dev/peps/}.
\textbf{PEP 8} gives coding conventions for the Python code comprising the standard library in the main Python
distribution. See: \url{https://www.python.org/dev/peps/pep-0008/}.
It contains conventions for the user-defined names (e.g., variables, functions, packages), as well as code layout,
line length, use of blank lines, style of comments, etc.
\\\\
Many professional Python developers \& companies adhere to (at least some of) the PEP8 conventions.
It is important to learn to follow these conventions from the start, especially if you want to work with other
programmers, as experienced Python developers will often flag violations of the PEP 8 conventions during code
reviews.
Of course, many companies \& open-source software projects have defined their own internal coding style guidelines
which take precedence over PEP 8 in the case of conflicts.
Following PEP 8 conventions is relatively easy if you are using a good IDE, e.g. PyCharm automatically finds \&
alerts you to violations of the PEP 8 conventions.

\subsubsection{Variable Naming Conventions}
According to PEP 8, variable names ``should be lowercase, with words separated by underscores as necessary to
improve readability'', i.e. \mintinline{python}{snake_case}.
``Never use the characters \verb|l|, \verb|O|, or \verb|I| as single-character variable names.
In some fonts, these characters are indistinguishable from the numerals one \& zero.
When tempted to use \verb|l|, use \verb|L| instead''.
According to PEP 8, different naming conventions are used for different identifiers, e.g.:
``Class names should normally use the CapWords convention''.
This helps programmers to quickly \& easily distinguish which category an identifier name represents.

\subsection{Dynamic Typing}
In Python, variable names can point to objects of any type.
Built-in data types in python include \mintinline{python}{str}, \mintinline{python}{int}, \mintinline{python}{float},
etc.
Each type can hold a different type of data.
As we saw, \mintinline{python}{str} can hold any combination of characters.


\end{document}