Module 1: Lab #1 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki
Team: 12
Professor: Yugyung Lee
Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub
Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub
YouTube Link explaining the Lab work can be found here
The report for the Lab work is here
Objective
- Hadoop MapReduce Algorithm
- Use Case Based No SQL Comparison
Question 1
Solution 1
Use Case: Finding Facebook common friends
Input File
Mapper Class Code
Reducer Class Code
Driver Class Code
Execution
Exit Code 0
Output Generated
Output File
MapReduce Diagram
Question 2
Consider one of the use cases from the below link:
https://umkc.box.com/s/q64fvjm6yd454w5v3ky0he4854g6m1fq
These use cases were discussed in Lecture 1: Cassandra.
- Consider one of the use case and use a simple dataset. Describe the use case considered based on your assumptions, report the dataset, its fields, datatype etc.
- Use HBase to implement a Solution for the use case. Report at least 3 queries, their input and output. The query’s relevance towards solving the use case is important.
- Use Cassandra to implement a Solution for the use case. Report at least 3 queries, their input and output. The query’s relevance towards solving the use case is important.
- Compare Cassandra and HBase for your use case. Present a table with comparison of your use case being implemented in both NO SQL Systems.
Solution 2
Use Case: Facebook Messaging - Inbox/ Term Search
Part 1
When you login to Facebook and click Messages => "See all messages", you will see a search box... you can search your inbox in there..
The basic idea is that you use the user id as the partition key, and then all the information you need for an inbox search will be clustered as rows in that partition. You can then set up multiple tables like this with different types of data clustered in the partition to support different types of searches. Since Cassandra can access a partition in essentially constant time even with millions of users, the system can scale and remain fast as you add nodes and users.
Part 2
HBase implementation:
Step 1:
Create table named 'facebook'
Empty Table
Update table (put values and alter columns)
Describe Table details
Step 2: Queries
Query 1: PrefixFilter: This filter takes one argument as a prefix of a row key. It returns solely those key-values present in the very row that starts with the specified row prefix
Query 2: MultipleColumnPrefixFilter: This filter takes a listing of column prefixes. It returns key-values that are present in the very column that starts with any of the specified column prefixes. every column prefixes should be a form qualifier.
Query 3: ColumnCountGetFilter: This filter takes one argument a limit. It returns the primary limit number of columns within the table.
Query 4: Filter applied column family wise
Part 3
Cassandra implementation:
Step 1: Create Keyspace and Table named facebook
The table facebook is created with messages from different Users as columns (user1, user2 etc).
Insert data in the created Table
Final updated Table with data
Final created Keyspace description
Step 2: Perform queries on the created facebook Table to get the desired output.
Query 1: Term search - performs the text search to find the particular text sent by any user
Query 2: User search - returns all the contents from the selected User
Query 3: Interaction Search - returns all the conversation with all the users for a particular User ID
Part 4
Comparison of Cassandra and HBase based on the selected use case:
HBase:
- Has a simpler consistency model than Cassandra.
- Very good scalability and performance for their data patterns.
- Most feature rich for their requirements: auto load balancing and failover, compression support, multiple shards per server, etc.
- HDFS, the filesystem used by HBase, supports replication, end-to-end checksums, and automatic rebalancing.
- Facebook's operational teams have a lot of experience using HDFS because Facebook is a big user of Hadoop and Hadoop uses HDFS as its distributed file system.
Cassandra:
- Wide-column store based on ideas of BigTable and DynamoDB.
- SQL-like DML and DDL statements (CQL).
- APIs and other access methods - Proprietary protocol, Thrift.
- No Server-side scripts.
- User concepts - Access rights for users can be defined per object.
References
- https://stackoverflow.com/questions/28130774/how-did-facebook-use-cassandra-for-inbox-search-if-caasandra-has-no-search-capa
- https://www.quora.com/What-is-inbox-search-on-Facebook
- http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
- http://horicky.blogspot.com/2010/10/bigtable-model-with-cassandra-and-hbase.html
- https://learnhbase.wordpress.com/2013/03/02/hbase-shell-commands/
- https://drill.apache.org/docs/querying-hbase/
- https://acadgild.com/blog/different-types-of-filters-in-hbase-shell