QA SSE 4.2 Accelerated Compression Codec - cto-bdt-qa/bdt-qa GitHub Wiki

Table of Contents

Introdcution

IDH/CDH Joint Enhancement Feature - SSE 4.2 accelerated compression codec support Intel License issue, IGzip compression format.

Hadoop job is a type of  data-intensive, compressing data with using SSE4.2 can speed up I/O operations, save storage space and speed up data transfers across the network. So reduced I/O and network load can bring significant performance improvements.  But as compression tradeoff, CPU utilization and processing time may be increased due to data compression and decompression in Hadoop's MapReduce pipeline.

Schedule & Status

QA Start Date QA Test Plan QA Test Scenarios/Cases QA End Date




Ownership

Dev Owner QA Owner
Todd Lipcon <[email protected]> Zhou, Yi A <[email protected]>

Related Terminology

Items Description
SSE 4.2 SSE4 (Streaming SIMD Extensions 4) is a CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L).Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1. SSE4.2 consists of 7 remaining instructions that improve performance of text processing and some application-specific operations.
Compression Codec Compression in Hadoop can significantly improve the end-to-end processing time whenever I/O or network issues are the bottleneck in a MapReduce job. A compression format is commonly referred to as a codec, which is short for coder-decoder, a set of compiled, ready to use Java libraries that a developer can invoke programmatically to perform data compression and decompression in a MapReduce job. The built-in codec in Apache Hadoop is like Gzip, Bzip2, Snappy , LZO etc.

Configuration

SSE4.2 in Hadoop

Item
Value
Description
XX
XX
XX
XX
XX
XX

TBD....

Data Compression in Hadoop

Compression Format Options in Hadoop

Compression Format
Algorithm
Extension
Splittlable
Java/Native Implemention
Codec Class
gzip
DEFLATE
.gz
No
Yes/Yes
org.apache.hadoop.io.compress.GzipCodec
IGzip
IGzip
.lgz
???
????
org.apache.hadoop.io.compress.IGzipCodec
bzip2
bzip2
.bz2
Yes
Yes/No
org.apache.hadoop.io.compress.BZip2Codec
LZO
LZO
.lzo
No
No/Yes
com.hadoop.compression.lzo.LzopCodec
LZ4
LZ4
.lz4
No
No/Yes
org.apache.hadoop.io.compress.Lz4Codec
Snappy
Snappy
.snappy
No
No/Yes
org.apache.hadoop.io.compress.SnappyCodec

Note: Typically, a single large compressed file is stored in HDFS with many data blocks that are distributed across many data nodes. When the file has been compressed by using one of the splittable algorithms, the data blocks can be decompressed in parallel by multiple MapReduce tasks. However, if a file has been compressed by a non-splittable algorithm, Hadoop must pull all the blocks together and use a single MapReduce task to decompress them.

Compression Phase in Hadoop MR Job Cycle

Phase in MR Config Item Value Description
Input Data to Map File extension recognized automatically for decompression .gz, .lgz,etc File extension supported format. Note: For SequenceFile, Header have information(compression, block compression and compression codec)
one of supported compression codec
Compression format
Intermediate (Map) Output mapreduce.map.output.compress false(default)/true Should the outputs of the maps be compressed before being sent across the network.
mapreduce.map.output.compress.codec one of defined in codec Compression format
Final  (Reduce) Output mapreduce.output.fileoutputformat.compress false(default)/true Should the job outputs be compressed?
mapreduce.output.fileoutputformat.compress.type RECORD If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK.
mapreduce.output.fileoutputformat.compress.codec one of defined in codec Compression format

Test Category

Test Category
Test Scenario
Description
Functional Test
Compression Codec Test
Test  supported Compression Codec Function with SSE4.2 enabled/disabled: To validate if compression work with SSE4.2 enabled/disabled

Data Compression Test in MR different phase Test supported data compression in MapReduce 3 phase:
  • Input data compression
  • Map output data compression
  • Final Reduce output data compression
To validate data comprssion work during MR individual phase
Performance Test
Data Compression performance
Test data compression at different data size when SSE4.2 enabled/disabled and collect test performance data to compare their gap/improvement

Functional Test

Test Scenarios

Scenarios

Description

Comments

1#
Enable data compression to Map input phase via MR detect automatically
SSE4.2 Enabled
2#
Enable data compression in Map output (Shuffle) phase via MR  job configuration
SSE4.2 Enabled
3#
Enable data compression in Final Reduce output phase via MR  job configuration
SSE4.2 Enabled
4#
Enable data compression in combined phase via MR  job configuration
SSE4.2 Enabled
5#
Enable data compression to Map input phase via MR detect automatically
SSE4.2 Disabled
6#
Enable data compression in Map output (Shuffle) phase via MR  job configuration
SSE4.2 Disabled
7#
Enable data compression in Final Reduce output phase via MR  job configuration
SSE4.2 Disabled
8#
Enable data compression in combined phase via MR  job configuration
SSE4.2 Disabled

Test Critera

Items
Description
Compression Rito
(ncompressed data size - compressed data size) /uncompressed data size percentage
Correctness
Data compression Correctness

Test Cases Matrix

Compression Type

Benchmark

MapReduce Phase with SSE4.2 Enabled
Map Input
Map output
Final Reduce ouput
Gzip
wordcount



IGzip
wordcount



Bzip2
wordcount



LZO
wordcount



LZ4
wordcount



Snappy
wordcount



Compression Type

Benchmark

MapReduce Phase with SSE4.2 Disabled
Map Input
Map output
Final Reduce ouput
Gzip
wordcount



IGzip
wordcount



Bzip2
wordcount



LZO
wordcount



LZ4
wordcount



Snappy
wordcount



Performance Test

Performance Metrics

Items
Description
Map Time
Total of MapReduce Map phase execute time (min)
Reduce Time
Total of MapReduce Reduce phase execute time (min)
Job Time
Total of MapReduce job execute time (min)

Test Cases Matrix

Compression Type

Benchmark

Data Size
MapReduce Phase with SSE4.2 Enabled
Map Input
Map output
Final Reduce ouput
Gzip
DFSIO-Read
100GB/




DFSIO-Write
100GB




Terasort
100GB



IGzip
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



Bzip2
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



LZO
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



LZ4
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



Snappy
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



Compression Type

Benchmark

Data Size
MapReduce Phase with SSE4.2 Disabled
Map Input
Map output
Final Reduce ouput
Gzip
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



IGzip
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



Bzip2
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



LZO
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



LZ4
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



Snappy
DFSIO-Read
100GB




DFSIO-Write
100GB




Terasort
100GB



Relibilty Test

Issue Tracking

SSE4.2 Tips

How to identify if CUP supports SSE4.2 instruction set ?

Check out the output  of "cat /proc/cpuinfo | grep sse4_2 | wc -l" and if the value is larger than 0 then indicate the CPU suppports SSE4.2 instruction set.

⚠️ **GitHub.com Fallback** ⚠️