[CT414]: Finish Assignment 2
@ -22,8 +22,8 @@ public class MapReduceFiles {
|
||||
return;
|
||||
}
|
||||
|
||||
int[] mapSizes = {1000, 2000, 5000, 10000};
|
||||
int[] reduceSizes = {100, 200, 500, 1000};
|
||||
int[] mapSizes = {1000, 2000, 4500, 4750, 5000, 5250, 5500, 10000};
|
||||
int[] reduceSizes = {100, 150, 200, 500, 1000};
|
||||
|
||||
System.out.println("===== Starting Grid Search =====");
|
||||
|
||||
|
@ -1,17 +1,41 @@
|
||||
MapLines,ReduceWords,MapTime,GroupTime,ReduceTime,TotalTime
|
||||
1000,100,2099,406,335,2840
|
||||
1000,200,1610,454,198,2262
|
||||
1000,500,1388,452,46,1886
|
||||
1000,1000,1538,302,48,1888
|
||||
2000,100,1726,314,263,2303
|
||||
2000,200,1512,323,62,1897
|
||||
2000,500,1669,334,46,2049
|
||||
2000,1000,1762,279,113,2154
|
||||
5000,100,1291,331,92,1714
|
||||
5000,200,1877,368,67,2312
|
||||
5000,500,1640,396,41,2077
|
||||
5000,1000,1439,365,193,1997
|
||||
10000,100,1285,359,94,1738
|
||||
10000,200,1598,359,98,2055
|
||||
10000,500,1489,314,68,1871
|
||||
10000,1000,1460,332,47,1839
|
||||
1000,100,1534,269,230,2033
|
||||
1000,150,1192,352,145,1689
|
||||
1000,200,1129,243,49,1421
|
||||
1000,500,1168,281,36,1485
|
||||
1000,1000,1425,237,69,1731
|
||||
2000,100,1104,290,159,1553
|
||||
2000,150,1158,303,59,1520
|
||||
2000,200,1216,269,46,1531
|
||||
2000,500,1202,260,36,1498
|
||||
2000,1000,1202,264,37,1503
|
||||
4500,100,1145,252,171,1568
|
||||
4500,150,1089,238,59,1386
|
||||
4500,200,995,294,214,1503
|
||||
4500,500,863,295,188,1346
|
||||
4500,1000,1183,250,55,1488
|
||||
4750,100,1016,267,172,1455
|
||||
4750,150,1078,228,59,1365
|
||||
4750,200,1041,260,47,1348
|
||||
4750,500,1110,278,36,1424
|
||||
4750,1000,975,263,35,1273
|
||||
5000,100,1069,277,74,1420
|
||||
5000,150,1253,224,54,1531
|
||||
5000,200,874,306,218,1398
|
||||
5000,500,1118,252,36,1406
|
||||
5000,1000,1006,244,47,1297
|
||||
5250,100,1165,225,72,1462
|
||||
5250,150,1008,272,173,1453
|
||||
5250,200,1054,250,47,1351
|
||||
5250,500,1134,275,37,1446
|
||||
5250,1000,1027,280,42,1349
|
||||
5500,100,976,249,71,1296
|
||||
5500,150,1062,285,228,1575
|
||||
5500,200,882,290,47,1219
|
||||
5500,500,1254,257,36,1547
|
||||
5500,1000,999,308,39,1346
|
||||
10000,100,1167,360,74,1601
|
||||
10000,150,1093,270,61,1424
|
||||
10000,200,1075,296,57,1428
|
||||
10000,500,1161,279,39,1479
|
||||
10000,1000,1154,255,38,1447
|
||||
|
|
@ -143,13 +143,41 @@ The distributed MapReduce saw significant improvements in the map \& group phase
|
||||
After implementing the requested changes in steps 2--6 of the assignment specification, I then implemented a grid-search function which tested a range of values for the number of lines of text per map thread and the number of words per reduce thread.
|
||||
The results of this grid-search were exported to a CSV file for analysis.
|
||||
I then wrote a Python script to visualise the parameter combinations using heatmaps.
|
||||
Heatmaps that contain results pertaining only to the map phase and the reduce phase, as well as a table of the results from the CSV file can be found in the Appendix.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{./images/gridsearch.png}
|
||||
\includegraphics[width=0.8\textwidth]{./images/gridsearch.png}
|
||||
\caption{Running the grid-search and plotting the results}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{./images/total_time_heatmap.png}
|
||||
\caption{Heatmap of total time taken by each parameter combination}
|
||||
\end{figure}
|
||||
|
||||
As can be seen from the heatmap above, the best result achieved by the distributed MapReduce approach was 1,219 milliseconds with the parameter combination of 200 words per reduce thread and 5,500 lines per map thread.
|
||||
Not only is this an improvement of over 2600\% ($\frac{31874}{1219} \times 100 = 2614.766$), but it beats the original brute force result of 1,307 milliseconds and the non-distributed MapReduce result of 1,372 milliseconds.
|
||||
We can tell that this isn't a fluke, as this is consistent with other neighbouring values in the heatmap, with similar times of 1,296 milliseconds, 1,273 milliseconds, etc.
|
||||
\\\\
|
||||
What I find most interesting about these results is that they are so similar to the brute force results; while the fully-optimised \& tuned distributed MapReduce does beat brute force, it only does so by a narrow margin.
|
||||
I have a few ideas as to why this might be the case:
|
||||
it may be due to the brute force approach being more suitable to CPU caching due to its sequential approach, or due to the high thread overhead running it on the CPU of a single laptop, but I think that the main reason is that the dataset isn't actually very big.
|
||||
While 10 large books may seem to be quite a lot of data, MapReduce was created to deal with petabytes of data, not a few megabytes.
|
||||
I would hypothesise that the real performance benefits of distributed MapReduce would only become clear if the testing was repeated on at least a couple gigabytes of data.
|
||||
|
||||
\section{Appendix: Source Code}
|
||||
\begin{code}
|
||||
\inputminted[linenos, breaklines, frame=single]{java}{../code/MapReduceFiles.java}
|
||||
\caption{\texttt{MapReduceFiles.java}}
|
||||
\end{code}
|
||||
|
||||
\begin{code}
|
||||
\inputminted[linenos, breaklines, frame=single]{python}{../code/plots.py}
|
||||
\caption{\texttt{plots.py}}
|
||||
\end{code}
|
||||
|
||||
\begin{table}[H]
|
||||
\centering
|
||||
\pgfplotstabletypeset[
|
||||
@ -168,13 +196,6 @@ I then wrote a Python script to visualise the parameter combinations using heatm
|
||||
\caption{Results written to \texttt{performance\_results.csv}}
|
||||
\end{table}
|
||||
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{./images/total_time_heatmap.png}
|
||||
\caption{Heatmap of total time taken by each parameter combination}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{./images/map_time_heatmap.png}
|
||||
@ -187,18 +208,6 @@ I then wrote a Python script to visualise the parameter combinations using heatm
|
||||
\caption{Heatmap of time taken during the reduce phase by each parameter combination}
|
||||
\end{figure}
|
||||
|
||||
\section{Appendix: Source Code}
|
||||
\begin{code}
|
||||
\inputminted[linenos, breaklines, frame=single]{java}{../code/MapReduceFiles.java}
|
||||
\caption{\texttt{MapReduceFiles.java}}
|
||||
\end{code}
|
||||
|
||||
\begin{code}
|
||||
\inputminted[linenos, breaklines, frame=single]{python}{../code/plots.py}
|
||||
\caption{\texttt{plots.py}}
|
||||
\end{code}
|
||||
|
||||
|
||||
|
||||
|
||||
\end{document}
|
||||
|
Before Width: | Height: | Size: 702 KiB After Width: | Height: | Size: 697 KiB |
Before Width: | Height: | Size: 39 KiB After Width: | Height: | Size: 57 KiB |
Before Width: | Height: | Size: 33 KiB After Width: | Height: | Size: 48 KiB |
Before Width: | Height: | Size: 38 KiB After Width: | Height: | Size: 60 KiB |