Module 1: Lab #1 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki

Team: 12
Professor: Yugyung Lee

Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub

Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub

YouTube Link explaining the Lab work can be found here

The report for the Lab work is here

Objective

Hadoop MapReduce Algorithm
Use Case Based No SQL Comparison

Question 1

Solution 1

Use Case: Finding Facebook common friends

Input File

Mapper Class Code

Reducer Class Code

Driver Class Code

Execution

Exit Code 0

Output Generated

Output File

MapReduce Diagram

Question 2

Consider one of the use cases from the below link:
https://umkc.box.com/s/q64fvjm6yd454w5v3ky0he4854g6m1fq

These use cases were discussed in Lecture 1: Cassandra.

Consider one of the use case and use a simple dataset. Describe the use case considered based on your assumptions, report the dataset, its fields, datatype etc.
Use HBase to implement a Solution for the use case. Report at least 3 queries, their input and output. The query’s relevance towards solving the use case is important.
Use Cassandra to implement a Solution for the use case. Report at least 3 queries, their input and output. The query’s relevance towards solving the use case is important.
Compare Cassandra and HBase for your use case. Present a table with comparison of your use case being implemented in both NO SQL Systems.

Solution 2

Use Case: Facebook Messaging - Inbox/ Term Search

Part 1

When you login to Facebook and click Messages => "See all messages", you will see a search box... you can search your inbox in there..
The basic idea is that you use the user id as the partition key, and then all the information you need for an inbox search will be clustered as rows in that partition. You can then set up multiple tables like this with different types of data clustered in the partition to support different types of searches. Since Cassandra can access a partition in essentially constant time even with millions of users, the system can scale and remain fast as you add nodes and users.

Part 2

HBase implementation:

Step 1:

Create table named 'facebook'

Empty Table

Update table (put values and alter columns)

Describe Table details

Step 2: Queries

Query 1: PrefixFilter: This filter takes one argument as a prefix of a row key. It returns solely those key-values present in the very row that starts with the specified row prefix

Query 2: MultipleColumnPrefixFilter: This filter takes a listing of column prefixes. It returns key-values that are present in the very column that starts with any of the specified column prefixes. every column prefixes should be a form qualifier.

Query 3: ColumnCountGetFilter: This filter takes one argument a limit. It returns the primary limit number of columns within the table.

Query 4: Filter applied column family wise

Part 3

Cassandra implementation:

Step 1: Create Keyspace and Table named facebook

The table facebook is created with messages from different Users as columns (user1, user2 etc).

Insert data in the created Table

Final updated Table with data

Final created Keyspace description

Step 2: Perform queries on the created facebook Table to get the desired output.

Query 1: Term search - performs the text search to find the particular text sent by any user

Query 2: User search - returns all the contents from the selected User

Query 3: Interaction Search - returns all the conversation with all the users for a particular User ID

Part 4

Comparison of Cassandra and HBase based on the selected use case:

HBase:

Has a simpler consistency model than Cassandra.
Very good scalability and performance for their data patterns.
Most feature rich for their requirements: auto load balancing and failover, compression support, multiple shards per server, etc.
HDFS, the filesystem used by HBase, supports replication, end-to-end checksums, and automatic rebalancing.
Facebook's operational teams have a lot of experience using HDFS because Facebook is a big user of Hadoop and Hadoop uses HDFS as its distributed file system.

Cassandra:

Wide-column store based on ideas of BigTable and DynamoDB.
SQL-like DML and DDL statements (CQL).
APIs and other access methods - Proprietary protocol, Thrift.
No Server-side scripts.
User concepts - Access rights for users can be defined per object.