EMR 002 Large Scale Sorting on EMR - qyjohn/AWS_Tutorials GitHub Wiki
1. Problem Statement
Sorting is any process of arranging items systematically. Sorting has two common, yet distinct meanings: (1) ordering - arranging items in a sequence ordered by some criterion; and (2) categorizing - grouping items with similar properties. In this tutorial, we are going to sort a very large amount of test data so that the items in the test results are arranged in a particular order.
The sorting problem becomes a big data problem when the combined size of the input / intermediate / output data exceeds the storage space available on a single machine.
2. Theories
We encourage you to read the following Wikipedia articles on sorting to understand what sorting is, as well as the commonly used sorting algorithms. In particular, you need to understand that External Sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory, usually a hard disk drive.
The above-mention theories deal with the sorting of data on a single machine. When the combined size of the input / intermediate / output data is beyond the storage capacity of a single machine, parallel and distributed sorting is then required. In general, this is done on a cluster with multiple nodes. Initially, each node has a portion of the input data to be sorted. In general, parallel and distributed sorting involves the following steps:
-
Pre-Processing - Optional pre-processing on each node, for example, sorting the input data locally.
-
Partitioning - Each node represents a partition in the sorting network, receiving only a certain range of the data to be sorted. Assuming that we have 10 nodes and the range of the input data is from 1 to 9,999,999, then the 1st node / partition receives only the data from 1 to 1,000,000, while the 10th node / partition receives only the data from 9,000,000 to 9,999,999.
-
Mapping - Each node sends the input data to their destination nodes, based on the value of each individual item. Each node stores the data received as intermediate data on local storage.
-
Sorting - Each node sorts the intermediate data to produce the test result.
During the mapping process, input data is read from disk on the source node, sent over the network, and written to disk on the destination node. During the sorting phase, intermediate data is read from disk, sorted, and written to disk on each node. Therefore, parallel and distributed sorting usually requires 2X disk I/O (both reads and writes) and 1X network I/O. 3X disk space is usually required to store the input, intermediate, and output data, unless part of the input and intermediate data are deleted from disk during the sorting process.
In the industry, the Sort Benchmark compares external sorting algorithms implemented using finely tuned hardware and software. Each winning system provides detailed test reports describing both the hardware and software. If you are interested in the parallel and distributed sorting problem, you should read these test reports.
3. Demo/Test Introduction
In this demo, we will use Hadoop MapReduce to sort 2 TB of test data on an EMR cluster. Because the size of the input and output data (4 TB total) is beyond the storage capacity of a regular computer, sorting at this scale is usually done in a parallel and distributed manner.
The system to perform the above-mentioned sorting is an EMR cluster with 3 x i2.4xlarge nodes. The HDFS replication factor is set to 1 to avoid over-utilization of storage space. We used the teragen, terasort, and teravalidate application in the Hadoop MapReduce examples to generate the input, sort the data, and validate the output.
4. Demo/Test Procedure
(1) Launch an EMR cluster with 1 master node (m3.xlarge) and 2 core nodes (i2.4xlarge). The release version is emr-5.8.0. It should be noted that in an EMR cluster the master node does not participate in the computation and is not an HDFS data node. Each i2.4xlarge node has 16 vCPU, 122 GB memory, and 4 x 800 GB SSD storage. When there are less than 4 nodes in the EMR cluster, EMR automatically sets dfs.replication=1 for the EMR cluster.
(2) SSH into the master node of the EMR cluster with "hadoop" as the username.
(3) Use the following command to view HDFS capacity. As shown in the command output, HDFS contains two data nodes, with 2.90 TB storage on each node.
[hadoop@ip-172-31-21-222 ~]$ hdfs dfsadmin -report
Configured Capacity: 6383165308928 (5.81 TB)
Present Capacity: 6382674655506 (5.81 TB)
DFS Remaining: 6381873709056 (5.80 TB)
DFS Used: 800946450 (763.84 MB)
DFS Used%: 0.01%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (2):
Name: 172.31.21.165:50010 (ip-172-31-21-165.ap-southeast-2.compute.internal)
Hostname: ip-172-31-21-165.ap-southeast-2.compute.internal
Decommission Status : Normal
Configured Capacity: 3191582654464 (2.90 TB)
DFS Used: 451154026 (430.25 MB)
Non DFS Used: 245215126 (233.86 MB)
DFS Remaining: 3190886285312 (2.90 TB)
DFS Used%: 0.01%
DFS Remaining%: 99.98%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 08 00:55:59 UTC 2017
Name: 172.31.27.186:50010 (ip-172-31-27-186.ap-southeast-2.compute.internal)
Hostname: ip-172-31-27-186.ap-southeast-2.compute.internal
Decommission Status : Normal
Configured Capacity: 3191582654464 (2.90 TB)
DFS Used: 349792424 (333.59 MB)
Non DFS Used: 245438296 (234.07 MB)
DFS Remaining: 3190987423744 (2.90 TB)
DFS Used%: 0.01%
DFS Remaining%: 99.98%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 08 00:56:00 UTC 2017
(4) Use the following command to generate the input data.
[hadoop@ip-172-31-21-222 ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 20000000000 /teraInput
17/09/08 01:15:40 INFO impl.TimelineClientImpl: Timeline service address: http://ip-172-31-21-222.ap-southeast-2.compute.internal:8188/ws/v1/timeline/
17/09/08 01:15:40 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-21-222.ap-southeast-2.compute.internal/172.31.21.222:8032
17/09/08 01:15:41 INFO terasort.TeraSort: Generating 20000000000 using 64
17/09/08 01:15:41 INFO mapreduce.JobSubmitter: number of splits:64
17/09/08 01:15:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1504831629578_0001
17/09/08 01:15:41 INFO impl.YarnClientImpl: Submitted application application_1504831629578_0001
17/09/08 01:15:42 INFO mapreduce.Job: The url to track the job: http://ip-172-31-21-222.ap-southeast-2.compute.internal:20888/proxy/application_1504831629578_0001/
17/09/08 01:15:42 INFO mapreduce.Job: Running job: job_1504831629578_0001
17/09/08 01:15:49 INFO mapreduce.Job: Job job_1504831629578_0001 running in uber mode : false
17/09/08 01:15:49 INFO mapreduce.Job: map 0% reduce 0%
17/09/08 01:16:12 INFO mapreduce.Job: map 1% reduce 0%
17/09/08 01:16:38 INFO mapreduce.Job: map 2% reduce 0%
17/09/08 01:16:55 INFO mapreduce.Job: map 3% reduce 0%
17/09/08 01:17:09 INFO mapreduce.Job: map 4% reduce 0%
......
17/09/08 01:38:56 INFO mapreduce.Job: map 96% reduce 0%
17/09/08 01:39:18 INFO mapreduce.Job: map 97% reduce 0%
17/09/08 01:39:39 INFO mapreduce.Job: map 98% reduce 0%
17/09/08 01:40:41 INFO mapreduce.Job: map 99% reduce 0%
17/09/08 01:42:12 INFO mapreduce.Job: map 100% reduce 0%
17/09/08 01:42:59 INFO mapreduce.Job: Job job_1504831629578_0001 completed successfully
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=8154998
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5614
HDFS: Number of bytes written=2000000000000
HDFS: Number of read operations=256
HDFS: Number of large read operations=0
HDFS: Number of write operations=128
Job Counters
Killed map tasks=2
Launched map tasks=66
Other local map tasks=66
Total time spent by all maps in occupied slots (ms)=9785465766
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=85837419
Total vcore-milliseconds taken by all map tasks=85837419
Total megabyte-milliseconds taken by all map tasks=313134904512
Map-Reduce Framework
Map input records=20000000000
Map output records=20000000000
Input split bytes=5614
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=136647
CPU time spent (ms)=36582810
Physical memory (bytes) snapshot=63729070080
Virtual memory (bytes) snapshot=330655215616
Total committed heap usage (bytes)=64965050368
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=6056631481885948571
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=2000000000000
The data generation phase took approximately 30 minutes to complete. You can view the various metrics on the EMR cluster using the "Monitoring" tab in the EMR console. In the data generation phase, it is useful to look into the "HDFS Utilization" and "HDFS Bytes Written" graphs.
On the node level, we can also use the CloudWatch console to view the CPU utilization, network I/O, and disk I/O on each node. In terms of CPU utilization, both core nodes reached 100% CPU utilization, while the CPU utilization on the master node was very low. In terms of disk I/O, both core nodes had a large amount of data written to disk, while there was no data written to disk on the master node. In terms of network I/O, there was very little network I/O on all nodes, as compared to the level of disk I/O observed on the core nodes.
(5) Use the following command to sort the input data
[hadoop@ip-172-31-21-222 ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput
17/09/08 02:01:12 INFO terasort.TeraSort: starting
17/09/08 02:01:14 INFO input.FileInputFormat: Total input paths to process : 64
Spent 753ms computing base-splits.
Spent 89ms computing TeraScheduler splits.
Computing input splits took 843ms
Sampling 10 splits of 14912
Making 31 from 100000 sampled records
Computing parititions took 676ms
Spent 1520ms computing partitions.
17/09/08 02:01:15 INFO impl.TimelineClientImpl: Timeline service address: http://ip-172-31-21-222.ap-southeast-2.compute.internal:8188/ws/v1/timeline/
17/09/08 02:01:15 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-21-222.ap-southeast-2.compute.internal/172.31.21.222:8032
17/09/08 02:01:16 INFO mapreduce.JobSubmitter: number of splits:14912
17/09/08 02:01:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1504831629578_0002
17/09/08 02:01:16 INFO impl.YarnClientImpl: Submitted application application_1504831629578_0002
17/09/08 02:01:16 INFO mapreduce.Job: The url to track the job: http://ip-172-31-21-222.ap-southeast-2.compute.internal:20888/proxy/application_1504831629578_0002/
17/09/08 02:01:16 INFO mapreduce.Job: Running job: job_1504831629578_0002
17/09/08 02:01:31 INFO mapreduce.Job: Job job_1504831629578_0002 running in uber mode : false
17/09/08 02:01:31 INFO mapreduce.Job: map 0% reduce 0%
17/09/08 02:02:36 INFO mapreduce.Job: map 1% reduce 0%
17/09/08 02:04:05 INFO mapreduce.Job: map 2% reduce 0%
17/09/08 02:05:50 INFO mapreduce.Job: map 3% reduce 0%
17/09/08 02:07:25 INFO mapreduce.Job: map 4% reduce 0%
......
17/09/08 05:35:28 INFO mapreduce.Job: map 100% reduce 96%
17/09/08 05:36:22 INFO mapreduce.Job: map 100% reduce 97%
17/09/08 05:37:19 INFO mapreduce.Job: map 100% reduce 98%
17/09/08 05:38:48 INFO mapreduce.Job: map 100% reduce 99%
17/09/08 05:40:36 INFO mapreduce.Job: map 100% reduce 100%
17/09/08 05:42:05 INFO mapreduce.Job: Job job_1504831629578_0002 completed successfully
17/09/08 05:42:05 INFO mapreduce.Job: Counters: 51
File System Counters
FILE: Number of bytes read=909878072144
FILE: Number of bytes written=1792802087913
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2000002206976
HDFS: Number of bytes written=2000000000000
HDFS: Number of read operations=44829
HDFS: Number of large read operations=0
HDFS: Number of write operations=62
Job Counters
Killed map tasks=1
Launched map tasks=14912
Launched reduce tasks=31
Data-local map tasks=14875
Rack-local map tasks=37
Total time spent by all maps in occupied slots (ms)=45190274994
Total time spent by all reduces in occupied slots (ms)=42127669668
Total time spent by all map tasks (ms)=396405921
Total time spent by all reduce tasks (ms)=184770481
Total vcore-milliseconds taken by all map tasks=396405921
Total vcore-milliseconds taken by all reduce tasks=184770481
Total megabyte-milliseconds taken by all map tasks=1446088799808
Total megabyte-milliseconds taken by all reduce tasks=1348085429376
Map-Reduce Framework
Map input records=20000000000
Map output records=20000000000
Map output bytes=2040000000000
Map output materialized bytes=880993050187
Input split bytes=2206976
Combine input records=0
Combine output records=0
Reduce input groups=20000000000
Reduce shuffle bytes=880993050187
Reduce input records=20000000000
Reduce output records=20000000000
Spilled Records=40000000000
Shuffled Maps =462272
Failed Shuffles=0
Merged Map outputs=462272
GC time elapsed (ms)=18623264
CPU time spent (ms)=275541400
Physical memory (bytes) snapshot=16447003381760
Virtual memory (bytes) snapshot=77316479946752
Total committed heap usage (bytes)=37070665940992
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2000000000000
File Output Format Counters
Bytes Written=2000000000000
17/09/08 05:42:05 INFO terasort.TeraSort: done
The sort phase took 3 hours and 42 minutes to complete. The following CloudWatch metrics shows the CPU utilization, disk I/O and network I/O on the various nodes on the EMR cluster.
(6) Use the following command to validate the sorted results
[hadoop@ip-172-31-21-222 ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /teraOutput /teraValidate
17/09/08 05:42:52 INFO impl.TimelineClientImpl: Timeline service address: http://ip-172-31-21-222.ap-southeast-2.compute.internal:8188/ws/v1/timeline/
17/09/08 05:42:52 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-21-222.ap-southeast-2.compute.internal/172.31.21.222:8032
17/09/08 05:42:54 INFO input.FileInputFormat: Total input paths to process : 31
Spent 585ms computing base-splits.
Spent 4ms computing TeraScheduler splits.
17/09/08 05:42:54 INFO mapreduce.JobSubmitter: number of splits:31
17/09/08 05:42:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1504831629578_0003
17/09/08 05:42:54 INFO impl.YarnClientImpl: Submitted application application_1504831629578_0003
17/09/08 05:42:54 INFO mapreduce.Job: The url to track the job: http://ip-172-31-21-222.ap-southeast-2.compute.internal:20888/proxy/application_1504831629578_0003/
17/09/08 05:42:54 INFO mapreduce.Job: Running job: job_1504831629578_0003
17/09/08 05:43:02 INFO mapreduce.Job: Job job_1504831629578_0003 running in uber mode : false
17/09/08 05:43:02 INFO mapreduce.Job: map 0% reduce 0%
17/09/08 05:43:23 INFO mapreduce.Job: map 1% reduce 0%
17/09/08 05:43:33 INFO mapreduce.Job: map 2% reduce 0%
17/09/08 05:43:43 INFO mapreduce.Job: map 3% reduce 0%
17/09/08 05:43:53 INFO mapreduce.Job: map 4% reduce 0%
......
17/09/08 05:54:49 INFO mapreduce.Job: map 98% reduce 29%
17/09/08 05:54:50 INFO mapreduce.Job: map 98% reduce 31%
17/09/08 05:54:51 INFO mapreduce.Job: map 99% reduce 31%
17/09/08 05:54:53 INFO mapreduce.Job: map 99% reduce 32%
17/09/08 05:54:56 INFO mapreduce.Job: map 100% reduce 100%
17/09/08 05:54:57 INFO mapreduce.Job: Job job_1504831629578_0003 completed successfully
17/09/08 05:54:57 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=1613
FILE: Number of bytes written=4095530
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2000000004619
HDFS: Number of bytes written=27
HDFS: Number of read operations=96
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=31
Launched reduce tasks=1
Data-local map tasks=31
Total time spent by all maps in occupied slots (ms)=2387093196
Total time spent by all reduces in occupied slots (ms)=16527720
Total time spent by all map tasks (ms)=20939414
Total time spent by all reduce tasks (ms)=72490
Total vcore-milliseconds taken by all map tasks=20939414
Total vcore-milliseconds taken by all reduce tasks=72490
Total megabyte-milliseconds taken by all map tasks=76386982272
Total megabyte-milliseconds taken by all reduce tasks=528887040
Map-Reduce Framework
Map input records=20000000000
Map output records=93
Map output bytes=2604
Map output materialized bytes=3039
Input split bytes=4619
Combine input records=0
Combine output records=0
Reduce input groups=63
Reduce shuffle bytes=3039
Reduce input records=93
Reduce output records=1
Spilled Records=186
Shuffled Maps =31
Failed Shuffles=0
Merged Map outputs=31
GC time elapsed (ms)=76486
CPU time spent (ms)=16490120
Physical memory (bytes) snapshot=49538342912
Virtual memory (bytes) snapshot=168361086976
Total committed heap usage (bytes)=62638784512
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2000000000000
File Output Format Counters
Bytes Written=27
[hadoop@ip-172-31-21-222 ~]$ hadoop fs -ls hdfs:///teraValidate
Found 2 items
-rw-r--r-- 1 hadoop hadoop 0 2017-09-08 05:54 hdfs:///teraValidate/_SUCCESS
-rw-r--r-- 1 hadoop hadoop 27 2017-09-08 05:54 hdfs:///teraValidate/part-r-00000
[hadoop@ip-172-31-21-222 ~]$ hadoop fs -cat hdfs:///teraValidate/part-r-00000
checksum 2540d7a3e4d427a9b