%! TeX program = lualatex \documentclass[a4paper,11pt]{article} % packages \usepackage{fontspec} \setmainfont{EB Garamond} % for tironian et fallback % % \directlua{luaotfload.add_fallback % % ("emojifallback", % % {"Noto Serif:mode=harf"} % % )} % % \setmainfont{EB Garamond}[RawFeature={fallback=emojifallback}] \setmonofont[Scale=MatchLowercase]{Deja Vu Sans Mono} \usepackage[a4paper,left=2cm,right=2cm,top=\dimexpr15mm+1.5\baselineskip,bottom=2cm]{geometry} \setlength{\parindent}{0pt} \usepackage{fancyhdr} % Headers and footers \fancyhead[R]{\normalfont \leftmark} \fancyhead[L]{} \pagestyle{fancy} \usepackage{microtype} % Slightly tweak font spacing for aesthetics \usepackage[english]{babel} % Language hyphenation and typographical rules \usepackage[final, colorlinks = true, urlcolor = blue, linkcolor = black]{hyperref} \usepackage{changepage} % adjust margins on the fly \usepackage{amsmath} \usepackage{minted} \usemintedstyle{algol_nu} \usepackage{xcolor} \usepackage{algpseudocode} \usepackage{tkz-graph} \usetikzlibrary{positioning, fit, shapes.geometric} \usepackage{pgfplots} \pgfplotsset{width=\textwidth,compat=1.9} \usepackage{caption} \newenvironment{code}{\captionsetup{type=listing}}{} \usepackage[yyyymmdd]{datetime} \renewcommand{\dateseparator}{-} \usepackage{titlesec} \begin{document} \begin{titlepage} \begin{center} \hrule \vspace*{0.6cm} \huge \textbf{CT3532} \vspace*{0.6cm} \hrule \LARGE \vspace{0.5cm} DATABASE SYSTEMS II \vspace{0.5cm} \hrule \vfill \includegraphics[width=0.7\textwidth]{images/db.png} \vfill \Large \vspace{0.5cm} \hrule \vspace{0.5cm} \textbf{Andreas Ó hAoḋa} % \vspace{0.5cm} % \hrule % \vspace{0.5cm} \normalsize University of Galway \today \vspace{0.5cm} \hrule \end{center} \end{titlepage} \pagenumbering{roman} \newpage \tableofcontents \newpage \setcounter{page}{1} \pagenumbering{arabic} \section{Introduction} \subsection{Recommended Texts} \begin{itemize} \item \emph{Fundamentals of Database Systems} by Elmasri and Navathe 005.74 ELM \item \emph{Database system concepts} by Silberschatz, A. 005.74 SIL \end{itemize} \subsection{Assessment} Continuous Assessment accounts for 30\% of the final grade, and the exam accounts for the remaining 70\%. Plagiarism of assignments is not permitted - This is strictly enforced. \subsubsection{Assignments} \section{Design} \subsection{Re-cap} \textbf{Normalisation} can be used to develop a normalised relational schema given the universal relation, and verify the correctness of relational schema developed from conceptual design. We decompose relations such that it satisfies successively restrictive normal forms. \\\\ Desirable properties of a relational schema include: \begin{itemize} \item \textbf{Clear semantics of a relation:} The \textbf{semantics} of a relation refers to how the attributes grouped together in a relation are to be interpreted. If ER modelling is done carefully and the mapping is undertaken correctly, it is likely that the semantics of the resulting relation will be \emph{clear}. One should try to design a relation so that it is easy to explain its meaning. \item \textbf{Reducing the number of redundant values in tuples:} The presence of redundancy leads to waste of storage space and potential for anomalies (deletion, update, insertion). One should try to design relations so that no anomalies may occur. If an anomaly can occur, it should be noted. Normalisation will remove many of the potential anomalies. \item \textbf{Reducing the number of null values in tuples:} Having null values is often necessary, but it can create problems, such as: \begin{itemize} \item Wasted space. \item Different interpretations, i.e.: attribute does not apply to this tuple, attribute value is unknown, attribute value is known but absent, etc. \end{itemize} \item \textbf{Disallowing the possibility of generating spurious tuples:} If a relation $R$ is decomposed into $R_1$ \& $R_2$ and connected via a primary key -- foreign key pair, then performing an equi-join between $R_1$ \& $R_2$ on the involved keys should not produce tuples that were not in the original relation $R$. \end{itemize} More formally, we typically have a relation $R$ and a set of functional dependencies $F$, defined over $R$. We wish to create a decomposition $D = \{R_1, R_2, \dots, R_n\}$ We wish to guarantee certain properties of this decomposition. We require that all attributes in the original $R$ be maintained in the decomposition, \textit{id est}: $R = R_1 \cup R_2 \cup \dots \cup R_n$ \begin{itemize} \item A relation is said to be in the \textbf{First Normal Form (1NF)} if there are no repeating fields. \item A relation is said to be in the \textbf{Second Normal Form (2NF)} if it is in 1NF and if every non-prime attribute is fully functionally dependent on the key. \item A relation is said to be in the \textbf{Third Normal Form (3NF)} if it is in 2NF and if no non-prime attribute is transitively dependent on the key. \item A relation is said to be in the \textbf{Boyce-Codd Normal Form (BCNF)} if the relation is in 3NF and if every determinant is a candidate key. \end{itemize} \subsubsection{Example} \begin{table}[H] \centering \begin{tabular}{lll} \hline StudentNo & Major & Advisor \\ \hline 123 & I.T. & Smith \\ 123 & Econ & Murphy \\ 444 & Biol. & O' Reilly \\ 617 & I.T. & Jones \\ 829 & I.T. & Smith \\ \hline \end{tabular} \caption{Sample Data} \end{table} Constraints: \begin{itemize} \item A student may have more than one major. \item For each major, a student can have only one advisor. \item Each major can have several advisors. \item Each advisor advises one major. \item Each advisor can advise several students. \end{itemize} Functional dependencies: \begin{itemize} \item \{StudentNo, Major\} {\rightarrow} \{Advisor\} \item \{Advisor\} {\rightarrow} \{Major\} \end{itemize} An update anomaly may exist: If student 44 changes major, we lose information that O' Reilly supervises Biology. To solve this, we can decompose the tables so as to satisfy BCNF: \begin{itemize} \item TAKES: \underline{StudentNo, Advisor} \item ADVISES: \underline{Advisor}, Major \end{itemize} \subsubsection{General Rule} Consider a relation $R$ with functional dependencies $F$. If $X \rightarrow Y$ violates BCNF, decompose $R$ into: \begin{itemize} \item $\{R - Y\}$ \item $\{XY\}$ \end{itemize} \subsubsection{Exercise} Let $R = \{A, B, C, D, E, F, G, H\}$. The functional dependencies defined over $R$ are: \begin{itemize} \item $A \rightarrow D$ \item $B \rightarrow E$ \item $E \rightarrow F$ \item $F \rightarrow G$ \item $F \rightarrow H$ \item $\{A, B\} \rightarrow C$ \item $C \rightarrow A$ \end{itemize} \begin{tikzpicture} \SetGraphUnit{2} \SetUpEdge[style={->}] \Vertices{circle}{A, B, C, D, E, F, G, H} \node[ellipse, draw=black, fit=(A) (B), inner sep=-1mm] (AB) {}; \Edge(A)(D) \Edge(B)(E) \Edge(E)(F) \Edge(F)(G) \Edge(F)(H) \Edge(AB)(C) \Edge(C)(A) \end{tikzpicture} Decompose $R$ such that the BCNF is satisfied. \section{Design by Synthesis} \subsection{Background} Typically, we have the relation $R$ and a set of functional dependencies $F$. We wish to create a decomposition $D = R_1, R_2, \dots, R_m$. Clearly, all attributes of $R$ must occur in at least one schema $R_i$, \textit{id est}: $U^{m}_{i=1} R_i = R$. This is known as the \textbf{attribute preservation} constraint. \\\\ A \textbf{functional dependency} is a constraint between two sets of attributes. A functional dependency $X \rightarrow Y$ exists for all tuples $t_1$ \& $t_2$ if $t_1[X] = t_2[X]$, then $t_2$ if $t_1[Y] = t_2[Y]$. We usually only specify the obvious functional dependencies, there may be many more. Given a set of functional dependencies $F$, the \textbf{closure of \textit{F}}, denoted $F^+$, refers to all dependencies that can be derived from $F$. \subsubsection{Armstrong's Axioms} \textbf{Armstrong's Axioms} are a set of inference rules that allow us to deduce all functional dependencies from a given initial set. They are: \begin{itemize} \item \textbf{Reflexivity:} if $X \supseteq Y$, then $X \rightarrow Y$. \item \textbf{Augmentation:} if $X \rightarrow Y$, then $XZ \rightarrow YZ$. \item \textbf{Transitivity:} if $X \rightarrow Y$, $Y \rightarrow Z$, then $X \rightarrow Z$. \item \textbf{Projectivity:} if $X \rightarrow YZ$, then $X \rightarrow Z$. \item \textbf{Additivity:} if $X \rightarrow Y$, $X \rightarrow Z$, then $X \rightarrow YZ$. \item \textbf{Pseudo-transitivity:} if $X \rightarrow Y$, $WY \rightarrow Z$, then $WX \rightarrow Z$. \end{itemize} The first three rules have been shown to be \textbf{sound} \& \textbf{complete}: \begin{itemize} \item \textbf{Sound:} Given a set $F$ on a relation $R$, any dependency we can infer from $F$ using the first three rules holds for every state of $r$ of $R$ that satisfies the dependencies in $F$. \item \textbf{Complete:} We can use the first three rules repeatedly to infer all possible dependencies that can be inferred from $F$. \end{itemize} For any sets of attributes $A$, we can infer $A^+$, the set of attributes that are functionally determined by $A$ given a set of functional dependencies. % skipped page 8 here \subsubsection{Cover Sets} A set of functional dependencies $F$, \textbf{covers} a set of functional dependencies $E$ if every functional dependency in $E$ is in $F^+$. \\\\ Two sets of functional dependencies $E$ \& $F$ are equivalent if $E^+ = F^+$. We can check if $F$ covers $E$ by calculating $A^+$ with respect to $F$ for each functional dependency $A \rightarrow B$ and then checking that $A^+$ includes all the attributes of $B$. \\\\ A set of functional dependencies, $F$< is \textbf{minimal} if: \begin{itemize} \item Every functional dependency in $F$ has a single attribute for its right-hand side. \item We cannot remove any dependency from $F$ and maintain a set of dependencies equivalent to $F$. \item We cannot replace any dependency $X \rightarrow A$ with a dependency $Y \rightarrow A$ where $Y \subset X$, and still maintain a set of dependencies equivalent to $F$. \end{itemize} All functional dependencies $X \rightarrow Y$ specified in $F$ should exist in one of the schema $R_i$, or should be inferable from the dependencies in $R_i$; this is known as the \textbf{dependency preservation} constraint. Each functional dependency specifies some constraint; if the dependency is absent, then some desired constraint is also absent. If a functional dependency is absent, then we must enforce the constraint in some other manner; this can be inefficient. \\\\ Given $F$ \& $R$, the \textbf{projection} of $F$ on $R_i$, denoted $\pi_{R_i}(F)$ where $R_i$ is a subset of $R$, is the set $X \rightarrow Y$ in $F^+$ such that attributes $X \cup Y \in R_i$. A decomposition of $R$ is dependency-preserving if: $$ ((\pi_{R_1}(F)) \cup \dots \cup (\pi_{R_m}(F)))^+ = F^+ $$ \textbf{Theorem:} It is always possible to find a decomposition $D$ with respect to $F$ such that: \begin{enumerate} \item The decomposition is dependency-preserving. \item All $R_i$ in $D$ are in 3NF. \end{enumerate} We can always guarantee a dependency-preserving decomposition to 3NF. \textbf{Algorithm:} \begin{enumerate} \item Find a minimal cover set $G$ for $F$. \item For each left-hand side $X$ of a functional dependency in $G$, create a relation $X \cup A_1 \cup A_2 \dots A_m$ in $D$, where $X \rightarrow A_1$, $X \rightarrow A_2, \dots$ are the only dependencies in $G$ with $X$ as a left-hand side. \item Group any remaining attributes into a single relation. \end{enumerate} \subsubsection{Lossless Joins} Consider the following relation: \begin{itemize} \item EMPPROJ: \underline{ssn, pnumber}, hours, ename, pname, plocation \end{itemize} and its decomposition to: \begin{itemize} \item EMPPROJ1: \underline{ename, plocation} \item EMPLOCAN: \underline{ssn, pno}, hrs, pname, plocation \end{itemize} If we perform a natural join on these relations, we may generate spurious tuples. When a natural join is issued against relations, no spurious tuples should be generated. A decomposition $D = \{R_1, R_2, \dots R_n\}$ of $R$ has the \textbf{lossless join} (or non-additive join) property with regards to $F$ on $R$ if every instance $r$ of the following holds: $$ \bowtie (\pi_{R_1}(r), \dots \pi_{R_m}(r)) = r $$ We can automate a procedure for testing for the lossless property. We can also automate the decomposition of $R$ into $R_1, \dots R_m$ such that is possesses the lossless join property. \\\\ A decomposition $D = \{R_1, R_2\}$ has the lossless property if \& only if: \begin{itemize} \item The functional dependency $(R_1 \cap R_2) \rightarrow \{R_1 - R_2\}$ is in $F^+$, \emph{or} \item The functional dependency $(R_1 \cap R_2) \rightarrow \{R_2 - R_1\}$ is in $F^+$. \end{itemize} Furthermore, if a decomposition has the lossless property and we decompose one of $R_i$ such that this also is a lossless decomposition, then replacing that decomposition of $R_i$ in the original decomposition will result in a lossless decomposition. \\\\ \textbf{Algorithm:} To decompose to BCNF: \begin{enumerate} \item Let $D = R$. \item While there is a schema $B$ in $D$ that violates BCNF, do: \begin{enumerate} \item Find the functional dependency $(X \rightarrow Y)$ that violates BCNF. \item Replace $B$ with $(B-Y)$ \& $(X \cup Y)$. \end{enumerate} \end{enumerate} This guarantees a decomposition such that all attributes are preserved, the lossless join property is enforced, and all $R_i$ are in BCNF. It is not always possible to decompose $R$ into a set of $R_i$ such that all $R_i$ satisfy BCNF and properties of lossless joins \& dependency preservation are maintained. We can guarantee a decomposition such that: \begin{itemize} \item All attributes are preserved. \item All relations are in 3NF. \item All functional dependencies are maintained. \item The lossless join property is maintained. \end{itemize} \textbf{Algorithm:} Finding a key for a relation schema $R$. \begin{enumerate} \item Set $K := R$. \item For each attribute $A \in K$, compute $(K-A)^+$ with respect to the set of functional dependencies. If $(K-A)^+$ contains all the attributes in $R$, the set $K := K - \{A\}$. \end{enumerate} Given a set of functional dependencies $F$, we can develop a minimal cover set. Using this, we can decompose $R$ into a set of relations such that all attributes are preserved, all functional dependencies are preserved, the decomposition has the lossless join property, and all relations are in 3NF. The advantages of this are that it provides a good database design and can be automated. The primary disadvantage is that often, numerous good designs are possible. \section{B Trees \& B+ Trees} \subsection{Generalised Search Tree} In a \textbf{Generalised Search Tree}, each node has the format $P_1, K_1, P_2, K_2, \dots, P_{n-1}, K_{n-1}, P_n$ where $P_i$ is a \textbf{tree value} and $K_i$ is a \textbf{search value}. Hence, the number of values per node depends on the size of the key field, block size, \& block pointer size. The following constraints hold: \begin{itemize} \item $K_1 < K_2 < \dots < K_{n-1} < K_n$. \item For all values $x$ in a sub-tree pointed to $P_i$, $K_{i-1} < x < K_i$. \item The number of tree pointers per node is known as the \textbf{order} or $\rho$ of the tree. \end{itemize} \subsubsection{Efficiency} For a generalised search tree: $T(N) = O(\text{log}(N))$, assuming a balanced tree. In order to guarantee this efficiency in searching \& other operations, we need techniques to ensure that the tree is always balanced. \subsection{B Trees} A \textbf{B tree} is a balanced generalised search tree. B trees can be viewed as a dynamic multi-level index. The properties of a search tree still hold and the algorithms for insertion \& deletion of values are modified in order. The node structure contains a record pointer for each key value. The node structure is as follows: $$ P_1 < K_1, Pr_1 > P_2 < K_2, Pr2 > \dots P_{n-1} < K_{n-1}, Pr_{n-1} > P_n $$ \subsubsection{Example} Consider a B Tree of order 3 (two values and three tree pointers per node/block). Insert records with key values: 10, 6, 8, 14, 4, 16, 19, 11, 21. \subsubsection{Algorithm to Insert a Value into a B Tree} \begin{enumerate} \item Find the appropriate leaf-level node to insert a value. \item If space remains in the leaf-level node, then insert the new value in the correct location. \item If no space remains, we need to deal with collisions. \end{enumerate} \subsubsection{Dealing with Collisions} \begin{enumerate} \item Split node into left \& right nodes. \item Propagate the middle value up a level and place its value in a node there. Note that this propagation may cause further propagations and even the creation of a new root node. \item Place the values less than the middle value in the left node. \item Place the values greater than the middle value in the right node. \end{enumerate} This maintains the balanced nature of the tree, and $O(\text{log}_\rho(N))$ for search, insertion, \& deletion. However, there is always potential for unused space in the tree. Empirical analysis has shown that B trees remain 69\% full given random insertions \& deletions. \subsubsection{Exercise} Can you define an algorithm for deletion (at a high level)? How much work is needed in the various cases (best, average, worst)? \subsection{B+ Trees} The most commonly used index type is the \textbf{B+ tree} -- a dynamic, multi-level index. B+ trees differ from B trees in terms of structure, and have slightly more complicated insertion \& deletion algorithms. B+ trees offer increased efficiency over a B Tree and ensures a higher order \rho. \subsubsection{Node Structure} B+ trees have two different node structures: internal nodes \& leaf-level nodes. The internal node structure is: $$P_1, K_1, P_2, K_2, \dots P_{n-1}, K_{n-1}, P_n$$ All record pointers are maintained at the leaf level in a B+ tree. There are no record pointers in the internal nodes. B+ trees have less information per record and hence more search values per node. \\\\ One tree pointer is maintained at each leaf-level node, which points to the next leaf-level node. Note that there is only one tree pointer per node at the leaf level. Each leaf-level node's structure is: $$K_1, Pr_1, K_2, PR_2, \dots K_m, Pr_m, P_{\text{next}}$$ \subsubsection{Example} Let $B = 512$, $P = 6$, $K = 10$. Assume 30,000 records as before. Assume that the tree is 69\% full. How many blocks will the tree require? How many block accesses will a search require? \subsubsection{Example} A tree of order $\rho$ has at most $\rho - 1$ search values per node. For a B+ tree, there are two types of tree nodes; hence there are two different orders: $\rho$ \& $\rho_{\text{leaf}}$. To calculate $\rho_{\text{leaf}}$: $$ |P| + (\rho_{\text{leaf}})(|K| + |Pr|) \leq B $$ $$ \rightarrow 17(\rho_{\text{leaf}}) \leq 506 $$ $$ \rho_{\text{leaf}} = 29 $$ Given a fill factor of 69\%: \begin{itemize} \item Each internal node will have, on average, 22 pointers. \item Each leaf-level node will have, on average, 20 pointers. \end{itemize} \begin{itemize} \item Root: 1 node, 21 entries, 22 pointers. \item Level 1: 22 nodes, 462 entries, 484 pointers. \item Level 2: 484 nodes, \dots, etc. \item Leaf Level: \dots \end{itemize} Hence, 4 levels are sufficient. The number of block accesses $= 4 + 1$. The number of blocks is $1 + 22 + 484 + \dots$ \section{Hash Tables} \subsection{Introduction} Can we improve upon logarithmic searching? \textbf{Hashing} is a technique that attempts to provide constant time for searching \& insertion, i.e. $O(K)$. The basic idea for searching \& insertion is to apply a hash function to the search field of the record; the return value of the hash function is used to reference a location in the hash table. \\\\ Care should be taken in designing a hash function. We usually require a \textbf{fair} hash function. This is difficult to guarantee if there is no or limited information available about the type of data to be stored. Often, heuristics can be used if domain knowledge is available. We can have both internal (i.e., some data structure in memory) or external hashing (i.e., to file locations). We must consider the size of the original table or file. \subsection{Approaches} \begin{enumerate} \item Create a hash table containing $N$ addressable ``slots'' which each can contain one record. \item Create a hash function that returns a value to be used in insertion \& searching. The value returned by the hash function must be in the correct range, i.e. the address space of the hash table \end{enumerate} If the range of the keys is that of the address space of the table, we can guarantee constant-time lookup. However, this is usually not the case as the address space of the table is much smaller than that of the search field. \\\\ With numeric keys, we can use modulo-division or truncation for hashing. With character keys, we must first convert to an integer value: this can be achieved by multiplying the ASCII code of the characters together and then applying modulo-division. However, we cannot guarantee constant-time performance as collisions will occur, i.e. two records with different search values being hashed to the same location in the table; we require a collision resolution policy. Efficiency then will depend on the number of collisions. The number of collisions depends primarily on the load factor $\lambda$ of the file: $$ \lambda = \frac{\text{Number of records}}{\text{Number of slots}} $$ \subsubsection{Collision Resolution Policies} \begin{itemize} \item \textbf{Chaining:} if a location is full, add the item to a linked list. Performance degrades if the load factor is high. The lookup time is on average $1 + \lambda$. \item \textbf{Linear probing:} if a location is full, then check in a linear manner for the next free space. This can degrade to a linear scan. The performance, if successful is $0.5(1 + \frac{1}{1-\lambda})$ and if unsuccessful is $0.5 + (1 + \frac{1}{1 + \lambda}^2)$. One big disadvantage of this approach is that it leads to the formation of clusters. \item \textbf{Quadratic probing:} if a location is full, check the location $x + 1$, location $x + 4$, location $(x + n)^2$. This results in less clustering. \item \textbf{Double hashing:} if location $x$ is occupied, then apply a second hash function. This can help guarantee an even distribution (a fairer hash function). \end{itemize} \subsection{Dynamic Hashing} The cases that we've considered thus far deal with the idea of a \textbf{fixed hash table}: this is referred to as \textbf{static hashing}. Problems arise if the database grows larger than planned: too many overflow buckets and performance degrades. A more suitable approach is \textbf{dynamic hashing}, wherein the table or file can be re-sized as needed. \subsubsection{General Approach} \begin{enumerate} \item Use a family of hash functions $h_0$, $h_1$, $h_2$, etc. $h_{i+1}$ is a refinement of $h_i$. E.g., $K \text{mod} 2^i$. \item Develop a base hash function that maps the key to a positive integer. \item Use $h_0(x) = x \text{mod} 2^b$ for a chosen $b$. There will be $2^b$ buckets initially. We can effectively double the size of the table by incrementing $b$. \end{enumerate} We only double the number of buckets when re-organising conceptually; we do not actually double the number of buckets in practice as it may not be needed. \subsection{Dynamic Hashing Approaches} Common dynamic hashing approaches include extendible hashing \& linear hashing. \textbf{Extendible hashing} involves re-organising the buckets when \& where needed, whereas \textbf{linear hashing} involves re-organising buckets when but not where needed. \subsubsection{Extendible Hashing} \begin{itemize} \item When a bucket overflows, split that bucket in two. A directory is used to achieve this conceptual doubling. \item If a collision or overflow occurs, we don't re-organise the file by doubling the number of buckets, as this would be too expensive. Instead, we maintain a directory of pointers to buckets, we can effectively double the number of buckets by doubling the directory, splitting just the bucket that overflowed. Doubling the directory is much cheaper than doubling the file, as the directory is much smaller than the file, \item On overflow, we split the bucket by allocating a new bucket and redistributing its contents. We double the directory size if necessary. \end{itemize} We maintain a \textbf{local depth} for each bucket, effectively the number of bits needed to hash an item here. We also maintain a \textbf{global depth} for the directory which is the number of bits used in indexing items. We can use these values to determine when to split the directory. \begin{itemize} \item If overflow occurs in a bucket where the local depth = global depth, then split the bucket, redistribute its contents, and double the directory. \item If overflow occurs in a bucket where the local depth < global depth, then split the bucket, redistribute its contents, and increase the local depth. \end{itemize} If the directory can fit in the memory, then the retrieval for point queries can be achieved with one disk read. \subsection{Linear Hashing} \textbf{Linear hashing} is another approach to indexing to a dynamic file. It is similar to dynamic hashing in that a family of hash functions are used ($h = K \text{mod} 2^i$), but differs in that no index is needed. Initially, we create a file of $M$ buckets; $K \text{mod} M^1$ is a suitable hash function. We will use a family of such functions $K \text{mod} (2^i \times M), i = 0$ initially. We can view the hashing as comprising of a sequence of phases: for phase $j$, the hash functions $K \text{mod} 2^i \times M$ \& $K \text{mod} 2^{j+1} \times M$ are used. \\\\ We \textbf{split a bucket} by redistributing the records into two buckets: the original one \& a new one. In phase $j$, to determine which ones go into the original while the others go into the new one, we use $H_{j+1}(K) = K \text{mod} 2^{j+1} \times M$ to calculate their address. Irrespective of the bucket which causes the overflow, we always split the next bucket in a \textbf{linear order}. We begin with bucket 0, and keep track of which bucket to split next $p$. At the end of a phase when $p$ is equal to the number of buckets present at the start of the phrase, we reset $p$ and a new phase begins ($j$ is incremented). \section{Joins} Many approaches \& algorithms can be used to do \textbf{joins}. \subsection{Nested Loop Join} To perform the join $r \bowtie s$: \begin{minted}[texcl, mathescape, linenos, breaklines, frame=single]{text} for each tuple t_r in r do: for each tuple t_s in s do: if t_r and t_ satisfy join condition: add(t_r, t_s) to result end end \end{minted} This is an expensive approach; every pair of tuples is checked to see if they satisfy the join condition. If one of the relations fits in memory, it is beneficial to use this in the inner loop (known as the \textbf{inner relation}). \subsection{Block nested Loop Join} The \textbf{block nested loop join} is a variation on the nested loop join that increases efficiency by reducing the number of block accesses. \begin{minted}[texcl, mathescape, linenos, breaklines, frame=single]{text} for each block B_r in r do: for each block B_s in s do: for each tuple t_r in B_r do: for each tuple t_s in B_s do: if t_r and t_s satisfy join condition: add (t_r, t_s) to result end end end end \end{minted} \subsection{Indexed Nested Loop Join} If there is an index available for the inner table in a nested loop join, we can replace file scans with index accesses. \subsection{Merge Join} If both relations are sorted on the joining attribute, then we can merge the relations. The technique is identical to merging two sorted lists (such as in the ``merge''' step of a Merge-Sort algorithm). Merge joins are much more efficient than a nested join. They can also be computed for relations that are not ordered on a joining attribute, but have indexes on the joining attribute. \subsection{Hash Joins} \begin{enumerate} \item Create a hashing function which maps the join attribute(s) to partitions in a range $1 .. N$. \item For all tuples in $r$, hash the tuples to $H_{ri}$. \item For all tuples in $s$, hash the tuples to $H_{si}$. \item For $i = 1$ to $N$, join the partitions $H_{ri} = H_{si}$. \end{enumerate} \section{Sorting} Sorting is a very important operation because it is used if a query specifies \mintinline{SQL}{ORDER BY} and is used prior to relational operators (e.g. Join) to allow more efficient processing of the operation. We can sort a relation in two ways: \begin{itemize} \item \textbf{Physically:} actual order of tuples re-arranged on the disk. \item \textbf{Logically:} build an index and sort index entries. \end{itemize} When the relation to be sorted fits in the memory, we can use standard sorting techniques (such as Quicksort). However, when the relation doesn't fit in memory, we have to use other approaches such as the \textbf{external sort merge}, which is essentially an $N$-way merge, an extension of the idea in the merge step of the merge sort algorithm. \begin{minted}[texcl, mathescape, linenos, breaklines, frame=single]{text} i := 0; M = number of page frames in main memory buffer repeat read M blocks of the relation sort M blocks in memory write sorted data to file R_i until end of relation read first block of each R_i into memory repeat choose first (in sort order) from pages write tuple to output remove tuple from buffer if any buffer R_i empty and not eof(R_i) read next block from R_i into memory until all pages empty \end{minted} \section{Parallel Databases} Characteristics of \textbf{Parallel databases} include: \begin{itemize} \item Increased transaction requirements. \item Increased volumes of data, particularly in data-warehousing. \item Many queries lend themselves easily to parallel execution. \item Can reduce the time required to retrieve relations from disk by partitioning relations onto a set of disks. \item Horizontal partitioning is usually used. Subsets of a relation are sent to different disks. \end{itemize} \subsection{Query Types} Common types of queries include: \begin{itemize} \item \textbf{Batch processing:} scanning an entire relation. \item \textbf{Point-Queries:} return all tuples that match some value. \item \textbf{Range-Queries:} return all tuples with some value in some range. \end{itemize} \subsection{Partitioning Approaches} \subsubsection{Round Robin} With \textbf{Round Robin}, the relation is scanned in order. Assuming $n$ disks, the $i^\text{th}$ relation is sent to disk $D_i\text{mod}n$. Round Robin guarantees an even distribution \\\\ Round Robin is useful for batch processing, but is not very suitable for either point or range queries as all disks have to be accessed. \subsubsection{Hash Partitioning} In \textbf{hash partitioning}, we choose attributes to act as partitioning attributes. We define a hash function with range $0 \dots n-1$, assuming $n$ disks. Each tuple is placed according to the result of the hash function. \\\\ Hash partitioning is very useful if a point query is based on a partitioning attribute. It is usually useful for batch querying if a fair hash function is used, but is poor for range querying. \subsubsection{Range Partitioning} In \textbf{range partitioning}, a partitioning attribute is first chosen. The partitioning vector is defined as $< v_0, v_1, \dots, v_{n-2} >$. Tuples are placed according to the value of the partitioning attribute. If $t_\text{partitioning attribute} < v_0$, then we place tuple $t$ on disk $D_0$. \\\\ Range partitioning is useful for both point \& range querying, but can lead to inefficiency in range querying if many tuples satisfy the condition. \subsection{Types of Parallelism} \subsubsection{Inter-Query Parallelism} In \textbf{inter-query parallelism}, different transactions run in parallel on different processors, thus increasing the transaction throughput, although the times for individual queries remain the same. Inter-query parallelism is the easiest for of parallelism to implement. \subsubsection{Intra-Query Parallelism} \textbf{Intra-query parallelism} allows us to run a single query in parallel on multiple processors (\& disks), which can speed up the running time of a query. Parallel execution can be achieved by parallelising individual components, which is called \textbf{intra-operation parallelism}. Parallel execution can also be achieved by evaluating portions of the query in parallel, which is called \textbf{inter-operation parallelism}. Both of these types can be combined. \end{document}