Examples Overview - aerospike-community/aerospike-hadoop GitHub Wiki
Setup
Before we can get into building and running these examples from the source code, we need a standalone Apache Hadoop multi-node cluster, accessible from an edge node. In the Test Cluster Setup sections, we will achieve this topology and configuration.
Aerospike Hadooop Connector Usage Examples
In these hands-on examples, we first load the required test data into our Apache Hadoop cluster and our Aerospike server. We will then demonstrate the following examples.
Word count input from Aerospike
Read text data from records in Aerospike and count unique words in Hadoop, output results to HDFS.
Sample output in HDFS:
$hadoop fs -cat /tmp/output/part*|more
--A 1
--And 2
--But 1
Word count output to Aerospike
Read text data from file in HDFS, count unique words in Hadoop, output results to Aerospike.
Sample output: aql> select * from test.counts
+----------------+-------+
| word | count |
+----------------+-------+
| "bowsy" | 1 |
| "heel." | 1 |
| "anything." | 2 |
Aggregating Integers
Read integer data from records in Aerospike and aggregate by key (arbitrarily chosen to be modulo 3163 of the value) - find count, min, max and sum for each key. Write aggregation results to HDFS.
Sample output: hadoop fs -cat /tmp/output/part*|more
0 32 0 98053 1568848
6 32 6 98059 1569040
12 32 12 98065 1569232
Sessionization & Joining
This example comprises analyzing web server log files for identifying sessions, creating user profiles and joining example.
This is a three part example. It uses web server log data stored in HDFS. In step 1, the log data is sessionized and output to Aerospike. In step 2, userids in the web server logs are assigned a profile (age and gender, generated randomly). Finally in step 3, sessionization data is again extracted from web server logs (just like step 1) and then joined with userid profile data read from Aerospike, the joined data is re-written back to Aerospike into a new set.
Sessionization
Typical log line entries as shown below are used to extract session information.
946628 - - [15/Jun/1998:22:00:01 +0000] "GET /images/blank.gif HTTP/1.1" 200 85
Sample output: select * from test.sessions
+---------+--------------+--------------+-------+
| userid | start | end | nhits |
+---------+--------------+--------------+-------+
| 1593760 | 897967790000 | 897968633000 | 131 |
| 1591365 | 897948422000 | 897949662000 | 195 |
Creating user profiles
Extract unique userid from log file data in HDFS, assign random age and gender and write to Aerospike.
Sample output: select * from test.profiles
+---------+-----+--------+
| userid | age | isMale |
+---------+-----+--------+
| 1520488 | 28 | 0 |
| 1597667 | 47 | 1 |
External Join
Sessionize same as part 1, read profile data from part 2, join the two
and write result to new set in Aerospike.
Sample output: select * from test.sessions
+---------+--------------+--------------+-------+
| userid | start | end | nhits |
+---------+--------------+--------------+-------+
| 1593760 | 897967790000 | 897968633000 | 131 |
| 1591365 | 897948422000 | 897949662000 | 195 |
| 1261950 | 897982435000 | 897982608000 | 70 |