Word Count Input - aerospike-community/aerospike-hadoop GitHub Wiki

In this example, we read test:words data from Aerospike, do a word count on the data and write the output to HDFS in /tmp/output.

test:words data is organized by Primary Key = path:linenum where in our example path is /home/hdclient/tmp/input. In our example, each line is contained in bin1. A line is read and tokenized into words before doing the word count.

Word Count Input Example

Source code:

~/aerospike-hadoop/examples/word_count_input/src/main/java/com/aerospike/hadoop/examples/wordcountinput/WordCountInput.java

Steps to run the example:

hdclient@ztg-client:~/aerospike-hadoop$ cd ~
hdclient@ztg-client:~$ cd aerospike-hadoop/
hdclient@ztg-client:~/aerospike-hadoop$ hadoop jar ./examples/word_count_input/build/libs/word_count_input.jar -D aerospike.input.namespace=test -D aerospike.input.setname=words -D aerospike.input.operation=scan /tmp/output
15/06/29 14:27:08 INFO wordcountinput.WordCountInput: run starting
. . .
. . .
15/06/29 14:27:14 INFO wordcountinput.WordCountInput: finished

hdclient@ztg-client:~/aerospike-hadoop$ hdfs dfs -ls /tmp/output

Found 2 items
-rw-r--r--   3 hdclient supergroup          0 2015-06-29 14:27 /tmp/output/_SUCCESS
-rw-r--r--   3 hdclient supergroup      52510 2015-06-29 14:27 /tmp/output/part-00000

hdclient@ztg-client:~/aerospike-hadoop$ hdfs dfs -cat /tmp/output/part-00000

(   4080
. . .
. . .
writes_master   2040
writes_reply    2040
zombie  2040
{bar}   340
{test}  340