ICP 05 : Sqoop - acikgozmehmet/BigDataProgramming GitHub Wiki
ICP 05: Sqoop is a transfer tool between Hadoop and SQL or Relational Databases
Objectives
- Installation: Verifying Sqoop and MySQL installation
- Use Sqoop to import and export mySQL Tables to HDFS.
- Create Hive Tables through HQL Script , Use Sqoop to import and export tables to Relational Databases
- Perform three queries from databases
1. Installation: Verifying Sqoop and MySQL installation
sudo service mysqld start
mysql -uroot -pcloudera
2. Use Sqoop to import and export mySQL Tables to HDFS.
Step 1: List databases already in MySQL:
show databases;
Step 2: Create a new database named db1:
create database db1;
Step 3: Set the created database as your current database:
use db1;
Step 4: Create a new table called student:
create table student (id INT NOT NULL, name VARCHAR(20), major VARCHAR(20), gpa FLOAT, tutorId INT, PRIMARY KEY(id));
Step 5: Insert some data into the student table:
Step 6: Print the table and make sure the values are inserted:
Step 7: We have our table in MySQL and now, we need to push it to the hdfs using sqoop :
sqoop import --connect jdbc:mysql://localhost/db1--username root --password cloudera --table student --m 1
Please check out step 9 for output
Step 8: Check in the pushed folder in hdfs:
hadoop fs -ls
hadoop fs -ls student/
Please check out step 9 for output
Step 9: Print the content inside the pushed file:
hadoop fs -cat student/*
Step 10: Create a new table called student_exported_from_hadoop:
mysql> use db1;
mysql > create table student_exported_from_hadoop (id INT NOT NULL, name VARCHAR(20), major VARCHAR(20), gpa FLOAT, tutorId INT, PRIMARY KEY(id));
Step 11: Export data from HDFS into MySQL:
sqoop export --connect jdbc:mysql://localhost/db1--username root --password cloudera --table student_exported_from_hadoop --export-dir student/part-m-00000
Step 12: Verify whether the hdfs data is successfully exported to newly created table in MySQL:
use db1;
show tables;
select * from student_exported_from_hadoop;
3. Create Hive Tables, Use Sqoop to import and export tables to Relational Databases
Step 1: hive:
hive
Please check out Step 4 for output
Step 2: Create a new table called stocks:
hive> CREATE TABLE ogrenci(id INT, name STRING, major STRING, gpa FLOAT, tutorId INT) row format delimited fields terminated by ',' stored as textfile;
Please check out Step 4 for output
Step 3: Verify whether table has been created:
hive> show tables;
Please check out Step 4 for output
Step 4: Load data from hdfs to the created hive table:
hive> LOAD DATA INPATH 'student/' INTO TABLE ogrenci;
Step 5: Verify whether the data has been imported:
hive> select * from ogrenci limit 10;
Step 6: Verify the table inside hive warehouse:
hadoop fs -ls /user/hive/warehouse/
Step 7: Create a table in MySQL:
create table ogrenci (id INT NOT NULL, name VARCHAR(20), major VARCHAR(20), gpa FLOAT, tutorId INT, PRIMARY KEY(id))
Step 8: Export hive table into created MySQL table using sqoop:
sqoop export --connect jdbc:mysql://localhost/db1 --username root --password cloudera --table ogrenci --export-dir /user/hive/warehouse/ogrenci -m 1
Step 9: Verify the exported data from HDFS in MySQL:
mysql> show tables
mysql> select * from ogrenci;
4. Perform three queries from databases
Query-1: Statistics Queries:
Definition: Statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied.
hive > analyze table ogrenci compute statistics;
Query-2: WordCount Queries:
To count the names in the table
hive> select word,count(1) as count from(select explode(split(name,'//s')) as word from ogrenci) temptable group by word;
Query-3: Identifying Pattern Queries:
Definition: SQL pattern matching allows you to search for patterns in data if you don't know the exact word or phrase you are seeking. This kind of SQL query uses wildcard characters to match a pattern, rather than specifying it exactly. For example, you can use the wildcard "C%" to match any string beginning with a capital C.
hive > select * from ogrenci where tutorId = 1000 OR major LIKE 'C%';
References:
https://www.tutorialspoint.com/sqoop/index.htm
https://en.wikipedia.org/wiki/Statistics
https://www.lifewire.com/pattern-matching-in-sql-server-queries-1019799