Big_Data_Programming_ICP_5 - kusamdinesh/Big-Data-and-Hadoop GitHub Wiki

Title: Sqoop Introduction

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from the Hadoop file system to relational databases.

Why Sqoop?

Many organizations deploy Hadoop to analyze a vast stream of information. When Big Data storages and analyzers of the Hadoop ecosystem came into the picture. We need some tool to import and export from big data storages to the external world. Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between the relational database server and Hadoop’s HDFS

Sqoop Features:

Allows data imports/exports Parallelize data transfer Copies data quickly Makes data analysis more efficient Mitigates excessive loads In today's class, we're going to perform a few basic operations like import, export, and word count in the hive.

1.Firstly we are logging into the MySQL server in the Cloudera cluster and we are listing the databases and tables available in the Cloudera's MySQL server.

2.Now we are creating a database by name sqoop and then we are creating a table with a name acad. Then we are inserting values into the table and performing a select operation to retrieve values from the table to make sure if the values were properly inserted into the table or not.

3.Sqoop import from RDBMS to HDFS.

As there is not much data to transfer, there is only one split on the data and we are specifying only one mapper, hence we can see only one map-reduce job.

Once the Sqoop import is complete, we need to make sure if the transfer was successful to the target directory. Below is the output image for the reference.

4.Now we are performing import-all-tables command and importing all the tables from the retail_db database into HDFS. As there are many tables under retail_db database and we have mentioned --m as 1, it takes one map-reduce job for each table to be transferred into HDFS.

5.Now we are creating Persons1 schema in mysql and will export Persons data from hive.

6.Now we are using table-schema.hql

7.Creating hive table and exporting it to SQL.

8.Computing statistics of a table in Hive: Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans.

9.Performing word count in Hive on stocks table. We are splitting the date column by '-', as there are no other columns with separators. Here we are using a lateral view to enable the use of a table-valued function (explode) which takes the list that comes out of split function and outputs a new row for every value. Also, we are using a split function to and separating the date column data using '-' as a separator.

10.Identifying patterns in Hive: The Hadoop Hive regular expression functions identify precise patterns of characters in the given string and are useful for extracting a string from the data and validation of the existing data, for example, validate the date, range checks, checks for characters, and extract specific characters from the data. Here we are retrieving the data with patterns having a %-% in the data. In stocks data, the date column has data with '-'. So when we are retrieving data where the date column has a - pattern.