Hadoop job is a type of data-intensive, compressing data with using SSE4.2 can speed up I/O operations, save storage space and speed up data transfers across the network. So reduced I/O and network load can bring significant performance improvements. But as compression tradeoff, CPU utilization and processing time may be increased due to data compression and decompression in Hadoop's MapReduce pipeline.
SSE4 (Streaming SIMD Extensions 4) is a CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L).Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1. SSE4.2 consists of 7 remaining instructions that improve performance of text processing and some application-specific operations.
Compression Codec
Compression in Hadoop can significantly improve the end-to-end processing time whenever I/O or network issues are the bottleneck in a MapReduce job. A compression format is commonly referred to as a codec, which is short for coder-decoder, a set of compiled, ready to use Java libraries that a developer can invoke programmatically to perform data compression and decompression in a MapReduce job. The built-in codec in Apache Hadoop is like Gzip, Bzip2, Snappy , LZO etc.
Configuration
SSE4.2 in Hadoop
Item
Value
Description
XX
XX
XX
XX
XX
XX
TBD....
Data Compression in Hadoop
Compression Format Options in Hadoop
Compression Format
Algorithm
Extension
Splittlable
Java/Native Implemention
Codec Class
gzip
DEFLATE
.gz
No
Yes/Yes
org.apache.hadoop.io.compress.GzipCodec
IGzip
IGzip
.lgz
???
????
org.apache.hadoop.io.compress.IGzipCodec
bzip2
bzip2
.bz2
Yes
Yes/No
org.apache.hadoop.io.compress.BZip2Codec
LZO
LZO
.lzo
No
No/Yes
com.hadoop.compression.lzo.LzopCodec
LZ4
LZ4
.lz4
No
No/Yes
org.apache.hadoop.io.compress.Lz4Codec
Snappy
Snappy
.snappy
No
No/Yes
org.apache.hadoop.io.compress.SnappyCodec
Note: Typically, a single large compressed file is stored in HDFS with many data blocks that are distributed across many data nodes. When the file has been compressed by using one of the splittable algorithms, the data blocks can be decompressed in parallel by multiple MapReduce tasks. However, if a file has been compressed by a non-splittable algorithm, Hadoop must pull all the blocks together and use a single MapReduce task to decompress them.
Compression Phase in Hadoop MR Job Cycle
Phase in MR
Config Item
Value
Description
Input Data to Map
File extension recognized automatically for decompression
.gz, .lgz,etc
File extension supported format.Note: For SequenceFile, Header have information(compression, block compression and compression codec)
one of supported compression codec
Compression format
Intermediate (Map) Output
mapreduce.map.output.compress
false(default)/true
Should the outputs of the maps be compressed before being sent across the network.
mapreduce.map.output.compress.codec
one of defined in codec
Compression format
Final (Reduce) Output
mapreduce.output.fileoutputformat.compress
false(default)/true
Should the job outputs be compressed?
mapreduce.output.fileoutputformat.compress.type
RECORD
If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK.
mapreduce.output.fileoutputformat.compress.codec
one of defined in codec
Compression format
Test Category
Test Category
Test Scenario
Description
Functional Test
Compression Codec Test
Test supported Compression Codec Function with SSE4.2 enabled/disabled:To validate if compression work with SSE4.2 enabled/disabled
Data Compression Test in MR different phase
Test supported data compression in MapReduce 3 phase:
Input data compression
Map output data compression
Final Reduce output data compression
To validate data comprssion work during MR individual phase
Performance Test
Data Compression performance
Test data compression at different data size when SSE4.2 enabled/disabled and collect test performance data to compare their gap/improvement
Functional Test
Test Scenarios
Scenarios
Description
Comments
1#
Enable data compression to Map input phasevia MR detectautomatically
SSE4.2 Enabled
2#
Enable data compression in Map output (Shuffle) phasevia MR job configuration
SSE4.2 Enabled
3#
Enable data compression in Final Reduce output phasevia MR job configuration
SSE4.2 Enabled
4#
Enable data compression in combined phase via MR job configuration
SSE4.2 Enabled
5#
Enable data compression to Map input phasevia MR detectautomatically
SSE4.2 Disabled
6#
Enable data compression in Map output (Shuffle) phasevia MR job configuration
SSE4.2 Disabled
7#
Enable data compression in Final Reduce output phasevia MR job configuration
SSE4.2 Disabled
8#
Enable data compression in combined phase via MR job configuration
SSE4.2 Disabled
Test Critera
Items
Description
Compression Rito
(ncompressed data size - compressed data size) /uncompressed data size percentage
Correctness
Data compression Correctness
Test Cases Matrix
Compression Type
Benchmark
MapReduce Phasewith SSE4.2 Enabled
Map Input
Map output
Final Reduce ouput
Gzip
wordcount
IGzip
wordcount
Bzip2
wordcount
LZO
wordcount
LZ4
wordcount
Snappy
wordcount
Compression Type
Benchmark
MapReduce Phasewith SSE4.2 Disabled
Map Input
Map output
Final Reduce ouput
Gzip
wordcount
IGzip
wordcount
Bzip2
wordcount
LZO
wordcount
LZ4
wordcount
Snappy
wordcount
Performance Test
Performance Metrics
Items
Description
Map Time
Total of MapReduce Map phase execute time (min)
Reduce Time
Total of MapReduce Reduce phase execute time (min)
Job Time
Total of MapReduce job execute time (min)
Test Cases Matrix
Compression Type
Benchmark
Data Size
MapReduce Phasewith SSE4.2 Enabled
Map Input
Map output
Final Reduce ouput
Gzip
DFSIO-Read
100GB/
DFSIO-Write
100GB
Terasort
100GB
IGzip
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
Bzip2
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
LZO
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
LZ4
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
Snappy
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
Compression Type
Benchmark
Data Size
MapReduce Phasewith SSE4.2 Disabled
Map Input
Map output
Final Reduce ouput
Gzip
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
IGzip
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
Bzip2
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
LZO
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
LZ4
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
Snappy
DFSIO-Read
100GB
DFSIO-Write
100GB
Terasort
100GB
Relibilty Test
Issue Tracking
SSE4.2 Tips
How to identify if CUP supports SSE4.2 instruction set ?
Check out the output of "cat /proc/cpuinfo | grep sse4_2 | wc -l" and if the value is larger than 0 then indicate the CPU suppports SSE4.2 instruction set.