ICP5 - gracesyl/big-data-hadoop GitHub Wiki

ICP5:

Sqoop introduction:

Hadoop is great for storing massive data in terms of volume using HDFS•It provides scalable processing environment for structured and unstructured data•But’s it is batch-oriented processing thus not suitable for interactive query applications•Sqoop act like ETL tool used to copy data between HDFS and SQL databases.

part1:

1.Create table in MySQL and Import into HDFS through Sqoop. 2.Export table from HDFS to MySQL

creation of table in mysql:

Importing by using the sqoop:

Part2: Create HIVE tables using Hbase;

Creating target table in mysql:

exporting in hadoop:

Table in mysql:

Part3:

Choose one of following datasets:

I have chosen stock dataset and i have downloaded.

Create table in hive and load the dataset:

creating the table in mysql to import from hive:

Importing from hive to sql using sqoop:

Form 3 intuitive questions from your dataset: 1.Statistics 2.WordCount 3.Identifying pattern

1.statistical query in hbase using the stock table:

  1. Wordcount query in hbase:

WordCount Output:

3.Identifying pattern:

I used the pattern "LIKE" in hbase to find %24% inbetween occurring values in the stock dataset.

This is how the result of pattern is shown in the hive.

BONUS:

1.Save your queries resultsinto Hive Table 2.Use complex datatypes for your queries

I have saved the pattern result in a separate table in hive using a single query as follows:

Thus showing the created output pattern in a separate table as follows:

Half bonus

2.use complex datatypes for your queries:

I have created complex datatype table in hive as follows:

using the complex dataset in query as follows:

grouping the complex datatype column in my query and doing the group by in hive:

Thus the import and export queries are run using sqoop command in hadoop.