[CT414]: Finish Assignment 2

This commit is contained in:
2025-03-24 01:22:25 +00:00
parent be4f004bcd
commit 59a466c782
8 changed files with 71 additions and 38 deletions

View File

@ -22,8 +22,8 @@ public class MapReduceFiles {
return; return;
} }
int[] mapSizes = {1000, 2000, 5000, 10000}; int[] mapSizes = {1000, 2000, 4500, 4750, 5000, 5250, 5500, 10000};
int[] reduceSizes = {100, 200, 500, 1000}; int[] reduceSizes = {100, 150, 200, 500, 1000};
System.out.println("===== Starting Grid Search ====="); System.out.println("===== Starting Grid Search =====");

View File

@ -1,17 +1,41 @@
MapLines,ReduceWords,MapTime,GroupTime,ReduceTime,TotalTime MapLines,ReduceWords,MapTime,GroupTime,ReduceTime,TotalTime
1000,100,2099,406,335,2840 1000,100,1534,269,230,2033
1000,200,1610,454,198,2262 1000,150,1192,352,145,1689
1000,500,1388,452,46,1886 1000,200,1129,243,49,1421
1000,1000,1538,302,48,1888 1000,500,1168,281,36,1485
2000,100,1726,314,263,2303 1000,1000,1425,237,69,1731
2000,200,1512,323,62,1897 2000,100,1104,290,159,1553
2000,500,1669,334,46,2049 2000,150,1158,303,59,1520
2000,1000,1762,279,113,2154 2000,200,1216,269,46,1531
5000,100,1291,331,92,1714 2000,500,1202,260,36,1498
5000,200,1877,368,67,2312 2000,1000,1202,264,37,1503
5000,500,1640,396,41,2077 4500,100,1145,252,171,1568
5000,1000,1439,365,193,1997 4500,150,1089,238,59,1386
10000,100,1285,359,94,1738 4500,200,995,294,214,1503
10000,200,1598,359,98,2055 4500,500,863,295,188,1346
10000,500,1489,314,68,1871 4500,1000,1183,250,55,1488
10000,1000,1460,332,47,1839 4750,100,1016,267,172,1455
4750,150,1078,228,59,1365
4750,200,1041,260,47,1348
4750,500,1110,278,36,1424
4750,1000,975,263,35,1273
5000,100,1069,277,74,1420
5000,150,1253,224,54,1531
5000,200,874,306,218,1398
5000,500,1118,252,36,1406
5000,1000,1006,244,47,1297
5250,100,1165,225,72,1462
5250,150,1008,272,173,1453
5250,200,1054,250,47,1351
5250,500,1134,275,37,1446
5250,1000,1027,280,42,1349
5500,100,976,249,71,1296
5500,150,1062,285,228,1575
5500,200,882,290,47,1219
5500,500,1254,257,36,1547
5500,1000,999,308,39,1346
10000,100,1167,360,74,1601
10000,150,1093,270,61,1424
10000,200,1075,296,57,1428
10000,500,1161,279,39,1479
10000,1000,1154,255,38,1447

1 MapLines ReduceWords MapTime GroupTime ReduceTime TotalTime
2 1000 100 2099 1534 406 269 335 230 2840 2033
3 1000 200 150 1610 1192 454 352 198 145 2262 1689
4 1000 500 200 1388 1129 452 243 46 49 1886 1421
5 1000 1000 500 1538 1168 302 281 48 36 1888 1485
6 2000 1000 100 1000 1726 1425 314 237 263 69 2303 1731
7 2000 200 100 1512 1104 323 290 62 159 1897 1553
8 2000 500 150 1669 1158 334 303 46 59 2049 1520
9 2000 1000 200 1762 1216 279 269 113 46 2154 1531
10 5000 2000 100 500 1291 1202 331 260 92 36 1714 1498
11 5000 2000 200 1000 1877 1202 368 264 67 37 2312 1503
12 5000 4500 500 100 1640 1145 396 252 41 171 2077 1568
13 5000 4500 1000 150 1439 1089 365 238 193 59 1997 1386
14 10000 4500 100 200 1285 995 359 294 94 214 1738 1503
15 10000 4500 200 500 1598 863 359 295 98 188 2055 1346
16 10000 4500 500 1000 1489 1183 314 250 68 55 1871 1488
17 10000 4750 1000 100 1460 1016 332 267 47 172 1839 1455
18 4750 150 1078 228 59 1365
19 4750 200 1041 260 47 1348
20 4750 500 1110 278 36 1424
21 4750 1000 975 263 35 1273
22 5000 100 1069 277 74 1420
23 5000 150 1253 224 54 1531
24 5000 200 874 306 218 1398
25 5000 500 1118 252 36 1406
26 5000 1000 1006 244 47 1297
27 5250 100 1165 225 72 1462
28 5250 150 1008 272 173 1453
29 5250 200 1054 250 47 1351
30 5250 500 1134 275 37 1446
31 5250 1000 1027 280 42 1349
32 5500 100 976 249 71 1296
33 5500 150 1062 285 228 1575
34 5500 200 882 290 47 1219
35 5500 500 1254 257 36 1547
36 5500 1000 999 308 39 1346
37 10000 100 1167 360 74 1601
38 10000 150 1093 270 61 1424
39 10000 200 1075 296 57 1428
40 10000 500 1161 279 39 1479
41 10000 1000 1154 255 38 1447

View File

@ -143,13 +143,41 @@ The distributed MapReduce saw significant improvements in the map \& group phase
After implementing the requested changes in steps 2--6 of the assignment specification, I then implemented a grid-search function which tested a range of values for the number of lines of text per map thread and the number of words per reduce thread. After implementing the requested changes in steps 2--6 of the assignment specification, I then implemented a grid-search function which tested a range of values for the number of lines of text per map thread and the number of words per reduce thread.
The results of this grid-search were exported to a CSV file for analysis. The results of this grid-search were exported to a CSV file for analysis.
I then wrote a Python script to visualise the parameter combinations using heatmaps. I then wrote a Python script to visualise the parameter combinations using heatmaps.
Heatmaps that contain results pertaining only to the map phase and the reduce phase, as well as a table of the results from the CSV file can be found in the Appendix.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
\includegraphics[width=\textwidth]{./images/gridsearch.png} \includegraphics[width=0.8\textwidth]{./images/gridsearch.png}
\caption{Running the grid-search and plotting the results} \caption{Running the grid-search and plotting the results}
\end{figure} \end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./images/total_time_heatmap.png}
\caption{Heatmap of total time taken by each parameter combination}
\end{figure}
As can be seen from the heatmap above, the best result achieved by the distributed MapReduce approach was 1,219 milliseconds with the parameter combination of 200 words per reduce thread and 5,500 lines per map thread.
Not only is this an improvement of over 2600\% ($\frac{31874}{1219} \times 100 = 2614.766$), but it beats the original brute force result of 1,307 milliseconds and the non-distributed MapReduce result of 1,372 milliseconds.
We can tell that this isn't a fluke, as this is consistent with other neighbouring values in the heatmap, with similar times of 1,296 milliseconds, 1,273 milliseconds, etc.
\\\\
What I find most interesting about these results is that they are so similar to the brute force results; while the fully-optimised \& tuned distributed MapReduce does beat brute force, it only does so by a narrow margin.
I have a few ideas as to why this might be the case:
it may be due to the brute force approach being more suitable to CPU caching due to its sequential approach, or due to the high thread overhead running it on the CPU of a single laptop, but I think that the main reason is that the dataset isn't actually very big.
While 10 large books may seem to be quite a lot of data, MapReduce was created to deal with petabytes of data, not a few megabytes.
I would hypothesise that the real performance benefits of distributed MapReduce would only become clear if the testing was repeated on at least a couple gigabytes of data.
\section{Appendix: Source Code}
\begin{code}
\inputminted[linenos, breaklines, frame=single]{java}{../code/MapReduceFiles.java}
\caption{\texttt{MapReduceFiles.java}}
\end{code}
\begin{code}
\inputminted[linenos, breaklines, frame=single]{python}{../code/plots.py}
\caption{\texttt{plots.py}}
\end{code}
\begin{table}[H] \begin{table}[H]
\centering \centering
\pgfplotstabletypeset[ \pgfplotstabletypeset[
@ -168,13 +196,6 @@ I then wrote a Python script to visualise the parameter combinations using heatm
\caption{Results written to \texttt{performance\_results.csv}} \caption{Results written to \texttt{performance\_results.csv}}
\end{table} \end{table}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./images/total_time_heatmap.png}
\caption{Heatmap of total time taken by each parameter combination}
\end{figure}
\begin{figure}[H] \begin{figure}[H]
\centering \centering
\includegraphics[width=0.8\textwidth]{./images/map_time_heatmap.png} \includegraphics[width=0.8\textwidth]{./images/map_time_heatmap.png}
@ -187,18 +208,6 @@ I then wrote a Python script to visualise the parameter combinations using heatm
\caption{Heatmap of time taken during the reduce phase by each parameter combination} \caption{Heatmap of time taken during the reduce phase by each parameter combination}
\end{figure} \end{figure}
\section{Appendix: Source Code}
\begin{code}
\inputminted[linenos, breaklines, frame=single]{java}{../code/MapReduceFiles.java}
\caption{\texttt{MapReduceFiles.java}}
\end{code}
\begin{code}
\inputminted[linenos, breaklines, frame=single]{python}{../code/plots.py}
\caption{\texttt{plots.py}}
\end{code}
\end{document} \end{document}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 702 KiB

After

Width:  |  Height:  |  Size: 697 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 60 KiB