ICP 05 : Sqoop - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP 05: Sqoop is a transfer tool between Hadoop and SQL or Relational Databases

Objectives

  1. Installation: Verifying Sqoop and MySQL installation
  2. Use Sqoop to import and export mySQL Tables to HDFS.
  3. Create Hive Tables through HQL Script , Use Sqoop to import and export tables to Relational Databases
  4. Perform three queries from databases

1. Installation: Verifying Sqoop and MySQL installation

sudo service mysqld start

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-05/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_24_02_2020_18_12_17.png

mysql -uroot -pcloudera

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-05/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_24_02_2020_18_13_28.png

2. Use Sqoop to import and export mySQL Tables to HDFS.

Step 1: List databases already in MySQL:

show databases;

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-05/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_24_02_2020_18_14_20.png

Step 2: Create a new database named db1:

create database db1;

Step 3: Set the created database as your current database:

use db1;

Step 4: Create a new table called student:

create table student (id INT NOT NULL, name VARCHAR(20), major VARCHAR(20), gpa FLOAT, tutorId INT, PRIMARY KEY(id));

Step 5: Insert some data into the student table:

Step 6: Print the table and make sure the values are inserted:

Step 7: We have our table in MySQL and now, we need to push it to the hdfs using sqoop :

sqoop import --connect jdbc:mysql://localhost/db1--username root --password cloudera --table student --m 1

Please check out step 9 for output

Step 8: Check in the pushed folder in hdfs:

hadoop fs -ls hadoop fs -ls student/

Please check out step 9 for output

Step 9: Print the content inside the pushed file:

hadoop fs -cat student/*

Step 10: Create a new table called student_exported_from_hadoop:

mysql> use db1; mysql > create table student_exported_from_hadoop (id INT NOT NULL, name VARCHAR(20), major VARCHAR(20), gpa FLOAT, tutorId INT, PRIMARY KEY(id));

Step 11: Export data from HDFS into MySQL:

sqoop export --connect jdbc:mysql://localhost/db1--username root --password cloudera --table student_exported_from_hadoop --export-dir student/part-m-00000

Step 12: Verify whether the hdfs data is successfully exported to newly created table in MySQL:

use db1;

show tables;

select * from student_exported_from_hadoop;

3. Create Hive Tables, Use Sqoop to import and export tables to Relational Databases

Step 1: hive:

hive

Please check out Step 4 for output

Step 2: Create a new table called stocks:

hive> CREATE TABLE ogrenci(id INT, name STRING, major STRING, gpa FLOAT, tutorId INT) row format delimited fields terminated by ',' stored as textfile;

Please check out Step 4 for output

Step 3: Verify whether table has been created:

hive> show tables;

Please check out Step 4 for output

Step 4: Load data from hdfs to the created hive table:

hive> LOAD DATA INPATH 'student/' INTO TABLE ogrenci;

Step 5: Verify whether the data has been imported:

hive> select * from ogrenci limit 10;

Step 6: Verify the table inside hive warehouse:

hadoop fs -ls /user/hive/warehouse/

Step 7: Create a table in MySQL:

create table ogrenci (id INT NOT NULL, name VARCHAR(20), major VARCHAR(20), gpa FLOAT, tutorId INT, PRIMARY KEY(id))

Step 8: Export hive table into created MySQL table using sqoop:

sqoop export --connect jdbc:mysql://localhost/db1 --username root --password cloudera --table ogrenci --export-dir /user/hive/warehouse/ogrenci -m 1

Step 9: Verify the exported data from HDFS in MySQL:

mysql> show tables

mysql> select * from ogrenci;

4. Perform three queries from databases

Query-1: Statistics Queries:

Definition: Statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied.

hive > analyze table ogrenci compute statistics;

Query-2: WordCount Queries:

To count the names in the table

hive> select word,count(1) as count from(select explode(split(name,'//s')) as word from ogrenci) temptable group by word;

Query-3: Identifying Pattern Queries:

Definition: SQL pattern matching allows you to search for patterns in data if you don't know the exact word or phrase you are seeking. This kind of SQL query uses wildcard characters to match a pattern, rather than specifying it exactly. For example, you can use the wildcard "C%" to match any string beginning with a capital C.

hive > select * from ogrenci where tutorId = 1000 OR major LIKE 'C%';

References:

http://sqoop.apache.org/

https://www.tutorialspoint.com/sqoop/index.htm

https://en.wikipedia.org/wiki/Statistics

https://www.lifewire.com/pattern-matching-in-sql-server-queries-1019799