SQOOP - praveenpoluri/Big-Data-Programing GitHub Wiki

SQOOP:

Introduction:

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. Sqoop works with relational databases such as : Teradata, Oracle, MySQL, PostgreSQL and etc. Many organizations deploy Hadoop to analyze vast stream of information •W hen Big Data storages and analyzers of the Hadoop ecosystem came into picture•W e need some tool to import and export from big data storages to external world•Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between relational database server and Hadoop’s HDFS.

Aim:

To import and export Tables from MySql to HDFS, create hive tables using hql and import and export tables to Relational Databases from hive tables using sqoop.

Tools:

Hive
Linux
Sqoop
MySql

Tasks:

Task 1:

Use Sqoop to import and export mySQL Tables to HDFS.For this first we need to start MySql on commandline.

To import MySql following command should be run :

To connect to database:

created database sqoop with following query:

Now create a table in MySql using the following query:

Inserted data into table emp with following query:

Show table:

Import table to HDFS using SQOOP from MySql: Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as a text file or a binary file

We can specify the target directory while importing table data into HDFS using the Sqoop import tool. Following is the syntax to specify the target directory as option to the Sqoop import command.

sqoop import
--connect jdbc:mysql://localhost/userdb \ --username root \ --table emp_add \ --m 1 \ --target-dir /queryresult

Command below shows the table got imported hdfs:

We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It executes the corresponding SQL query in the respective database server and stores the result in a target directory in HDFS. Example:

sqoop import
--connect jdbc:mysql://localhost/userdb
--username root
--table emp_add
--m 1
--where “city =’sec-bad’”
--target-dir /wherequery

Export table: I have created a employee table

Task2:

Create Hive Tables through HQL Script , Use Sqoop to import and export tables to Relational Databases

As is the case with scripts in other languages such as SQL, Unix Shell etc., Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort invested in writing and executing each command manually.

Open a terminal in your Cloudera CDH4 distribution and give the below command to create a Hive Script. command: gedit filename.sql

we can write the queries in the filex.sql file and you can run it on hive using the following command. Here tables-schema.hql is the hql file which is in the downloads folder of my vm.

we can run the hql file using following command:

Below is the command for loading table into HDFS which is to be used by HQL file:

Here i created a new table empnew in mysql to export data from hive table with following command:

I already imported emp table into hive from mysql using sqoop now i am using the same table to export back to mysql:

Below is the table i am going to export back to mysql from hive:

Below is the sqoop command that is executed in terminal to export hive table into mysql:

Below is the sql query that displays contents of emp3 table exported in mysql from hdfs :

Task3:

Asked to choose table of our own, i choose employees table and exported it from hive to mysql, asked to display wordcount, statistics and pattern identification:

created employees table into hive and load data into it:

below image shows employees table in hive:

Now i am exporting the employees table from hive into mysql using the follwing sqoop command:

I am running the following queries on twitter table by loading into mysql:

SQL query to get statistics data from shakespere table:

Following is the sql query to get wordcount from shakespere data:

Follwing is the query to identify pattern from stocks data:

Limitations using sqoop:

Even though Sqoop has very strong advantages to its name, it does have some inherent disadvantages, which can be summarized as:
It uses a JDBC connection to connect with RDBMS based data stores, and this can be inefficient and less performant.
For performing analysis, it executes various map-reduce jobs and, at times, this can be time consuming when there are lot of joins if the data is in a denormalized fashion.
Being used for bulk transfer of data, it could put undue pressure on the source data store, and this is not ideal if these stores are heavily used by the main business application.

References:

https://umkc.app.box.com/s/s0ks4vj80c1emckty607vjab0m4mouca

https://umkc.app.box.com/s/qqvm7xtajoqx3hqnrga87o8lbj446ns8

https://www.tutorialspoint.com/sqoop/sqoop_export.htm

Conclusion :

Though sqoop is very useful for data trasfer between hdfs and other databases it is less efficient and under performant for bulk tables.