Table of Contents Introdcution Schedule & Status Ownership Related Terminology Configuration SSE4.2 in Hadoop Data Compression in Hadoop Compression Format Options in Hadoop Compression Phase in Hadoop MR Job Cycle Test Category Functional Test Test Scenarios Test Critera Test Cases Matrix Performance Test Performance Metrics Test Cases Matrix Relibilty Test Issue Tracking SSE4.2 Tips How to identify if CUP supports SSE4.2 instruction set ?

Introdcution

IDH/CDH Joint Enhancement Feature - SSE 4.2 accelerated compression codec support Intel License issue, IGzip compression format.

Hadoop job is a type of data-intensive, compressing data with using SSE4.2 can speed up I/O operations, save storage space and speed up data transfers across the network. So reduced I/O and network load can bring significant performance improvements. But as compression tradeoff, CPU utilization and processing time may be increased due to data compression and decompression in Hadoop's MapReduce pipeline.

Schedule & Status

QA Start Date	QA Test Plan	QA Test Scenarios/Cases	QA End Date

Ownership

Dev Owner	QA Owner
Todd Lipcon <[email protected]>	Zhou, Yi A <[email protected]>

Related Terminology

Items	Description
SSE 4.2	SSE4 (Streaming SIMD Extensions 4) is a CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L).Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1. SSE4.2 consists of 7 remaining instructions that improve performance of text processing and some application-specific operations.
Compression Codec	Compression in Hadoop can significantly improve the end-to-end processing time whenever I/O or network issues are the bottleneck in a MapReduce job. A compression format is commonly referred to as a codec, which is short for coder-decoder, a set of compiled, ready to use Java libraries that a developer can invoke programmatically to perform data compression and decompression in a MapReduce job. The built-in codec in Apache Hadoop is like Gzip, Bzip2, Snappy , LZO etc.

Configuration

SSE4.2 in Hadoop

Item	Value	Description
XX	XX	XX
XX	XX	XX

TBD....

Data Compression in Hadoop

Compression Format Options in Hadoop

Compression Format	Algorithm	Extension	Splittlable	Java/Native Implemention	Codec Class
gzip	DEFLATE	.gz	No	Yes/Yes	org.apache.hadoop.io.compress.GzipCodec
IGzip	IGzip	.lgz	???	????	org.apache.hadoop.io.compress.IGzipCodec
bzip2	bzip2	.bz2	Yes	Yes/No	org.apache.hadoop.io.compress.BZip2Codec
LZO	LZO	.lzo	No	No/Yes	com.hadoop.compression.lzo.LzopCodec
LZ4	LZ4	.lz4	No	No/Yes	org.apache.hadoop.io.compress.Lz4Codec
Snappy	Snappy	.snappy	No	No/Yes	org.apache.hadoop.io.compress.SnappyCodec

Note: Typically, a single large compressed file is stored in HDFS with many data blocks that are distributed across many data nodes. When the file has been compressed by using one of the splittable algorithms, the data blocks can be decompressed in parallel by multiple MapReduce tasks. However, if a file has been compressed by a non-splittable algorithm, Hadoop must pull all the blocks together and use a single MapReduce task to decompress them.

Compression Phase in Hadoop MR Job Cycle

Phase in MR	Config Item	Value	Description
Input Data to Map	File extension recognized automatically for decompression	.gz, .lgz,etc	File extension supported format. Note: For SequenceFile, Header have information(compression, block compression and compression codec)
Input Data to Map	one of supported compression codec		Compression format
Intermediate (Map) Output	mapreduce.map.output.compress	false(default)/true	Should the outputs of the maps be compressed before being sent across the network.
Intermediate (Map) Output	mapreduce.map.output.compress.codec	one of defined in codec	Compression format
Final (Reduce) Output	mapreduce.output.fileoutputformat.compress	false(default)/true	Should the job outputs be compressed?
	mapreduce.output.fileoutputformat.compress.type	RECORD	If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK.
	mapreduce.output.fileoutputformat.compress.codec	one of defined in codec	Compression format

Test Category

Test Category	Test Scenario	Description
Functional Test	Compression Codec Test	Test supported Compression Codec Function with SSE4.2 enabled/disabled: To validate if compression work with SSE4.2 enabled/disabled
	Data Compression Test in MR different phase	Test supported data compression in MapReduce 3 phase: Input data compression Map output data compression Final Reduce output data compression To validate data comprssion work during MR individual phase
Performance Test	Data Compression performance	Test data compression at different data size when SSE4.2 enabled/disabled and collect test performance data to compare their gap/improvement

Functional Test

Test Scenarios

Scenarios	Description	Comments
1#	Enable data compression to Map input phase via MR detect automatically	SSE4.2 Enabled
2#	Enable data compression in Map output (Shuffle) phase via MR job configuration	SSE4.2 Enabled
3#	Enable data compression in Final Reduce output phase via MR job configuration	SSE4.2 Enabled
4#	Enable data compression in combined phase via MR job configuration	SSE4.2 Enabled
5#	Enable data compression to Map input phase via MR detect automatically	SSE4.2 Disabled
6#	Enable data compression in Map output (Shuffle) phase via MR job configuration	SSE4.2 Disabled
7#	Enable data compression in Final Reduce output phase via MR job configuration	SSE4.2 Disabled
8#	Enable data compression in combined phase via MR job configuration	SSE4.2 Disabled

Test Critera

Items	Description
Compression Rito	(ncompressed data size - compressed data size) /uncompressed data size percentage
Correctness	Data compression Correctness

Test Cases Matrix

Compression Type	Benchmark	MapReduce Phase with SSE4.2 Enabled
Compression Type	Benchmark	Map Input	Map output	Final Reduce ouput
Gzip	wordcount
IGzip	wordcount
Bzip2	wordcount
LZO	wordcount
LZ4	wordcount
Snappy	wordcount

Compression Type	Benchmark	MapReduce Phase with SSE4.2 Disabled
Compression Type	Benchmark	Map Input	Map output	Final Reduce ouput
Gzip	wordcount
IGzip	wordcount
Bzip2	wordcount
LZO	wordcount
LZ4	wordcount
Snappy	wordcount

Performance Test

Performance Metrics

Items	Description
Map Time	Total of MapReduce Map phase execute time (min)
Reduce Time	Total of MapReduce Reduce phase execute time (min)
Job Time	Total of MapReduce job execute time (min)

Test Cases Matrix

Compression Type	Benchmark	Data Size	MapReduce Phase with SSE4.2 Enabled
Compression Type	Benchmark	Data Size	Map Input	Map output	Final Reduce ouput
Gzip	DFSIO-Read	100GB/
	DFSIO-Write	100GB
	Terasort	100GB
IGzip	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
Bzip2	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
LZO	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
LZ4	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
Snappy	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB

Compression Type	Benchmark	Data Size	MapReduce Phase with SSE4.2 Disabled
Compression Type	Benchmark	Data Size	Map Input	Map output	Final Reduce ouput
Gzip	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
IGzip	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
Bzip2	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
LZO	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
LZ4	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB
Snappy	DFSIO-Read	100GB
	DFSIO-Write	100GB
	Terasort	100GB

Relibilty Test

Issue Tracking

SSE4.2 Tips

How to identify if CUP supports SSE4.2 instruction set ?

Check out the output of "cat /proc/cpuinfo | grep sse4_2 | wc -l" and if the value is larger than 0 then indicate the CPU suppports SSE4.2 instruction set.

QA SSE 4.2 Accelerated Compression Codec - cto-bdt-qa/bdt-qa GitHub Wiki

Table of Contents

Introdcution

Schedule & Status

Ownership

Related Terminology

Configuration

SSE4.2 in Hadoop

Data Compression in Hadoop

Compression Format Options in Hadoop

Compression Phase in Hadoop MR Job Cycle

Test Category

Functional Test

Test Scenarios

Test Critera

Test Cases Matrix

Performance Test

Performance Metrics

Test Cases Matrix

Relibilty Test

Issue Tracking

SSE4.2 Tips

How to identify if CUP supports SSE4.2 instruction set ?

⚠️ GitHub.com Fallback ⚠️

QA SSE 4.2 Accelerated Compression Codec - cto-bdt-qa/bdt-qa GitHub Wiki

Table of Contents

Introdcution

Schedule & Status

Ownership

Related Terminology

Configuration

SSE4.2 in Hadoop

Data Compression in Hadoop

Compression Format Options in Hadoop

Compression Phase in Hadoop MR Job Cycle

Test Category

Functional Test

Test Scenarios

Test Critera

Test Cases Matrix

Performance Test

Performance Metrics

Test Cases Matrix

Relibilty Test

Issue Tracking

SSE4.2 Tips

How to identify if CUP supports SSE4.2 instruction set ?

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️