Apache Pig - kaushikdas/TechnicalWritings GitHub Wiki
Why Pig?
- Relieves from cumbersome process of writing mappers and reducers
- Provides scripting language Pig Latin to write SQL-like code to do map reduce job
- UDFs (user defined function) make it highly extensible
Where does Pig sit?
+----------------------------+
| Pig |
+----------------+-----------+
| MapReduce | TEZ | <---- underlying map reduce engine
+----------------+-----------+
| YARN |
+----------------------------+
| HDFS |
+----------------------------+
Activity
Find old 5-star movies
/**
* Create a _relation_ named ratings by loading file u.data file from
* *HDFS*. Pig by default expects the file to be TAB delimited and
* this file is already TAB delimited.
* The relation is created with the schema that is specified during
* reading the file from HDFS and by assigning name and type to each
* field as specified by the schema description.
**/
ratings = LOAD '/user/maria_dev/ml-100k/u.data' -- this is path from HDFS
AS (userID:int, movieID:int, rating:int, ratingTime:int);
metaData = LOAD '/user/maria_dev/ml-100k/u.item'
USING PigStorage('|') -- because this file uses | as delim
AS (movieID:int, movieTitle:chararray, releaseDate:chararray,
videoRelease:chararray, imdbLink:chararray);
nameLookup = FOREACH metaData GENERATE movieID, movieTitle,
ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;
ratingsByMovie = GROUP ratings BY movieID;
avgRatings = FOREACH ratingsByMovie
GENERATE group AS movieID, -- group is the a new column name
AVG(ratings.rating) AS avgRating;
fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;
fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;
oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;
DUMP oldestFiveStarMovies;
Execution
Connect to the sandbox using SSH or connect to localhost:4200
(Shell-In-A-Box) and sign in as maria_dev/maria_dev
:
[maria_dev@sandbox ~]$ hadoop fs -copyFromLocal ml-100k/u.item ml-100k/u.item # copy u.item to HDFS
[maria_dev@sandbox ~]$ pig -x tez -f fiveStarMovies.pig
[maria_dev@sandbox ~]$ pig -x tez -f fiveStarMovies.pig
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : TEZ_LOCAL
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : TEZ
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Picked TEZ as the ExecType
~
~
Input(s):
Successfully read 100003 records (2079229 bytes) from: "/user/maria_dev/ml-100k/u.data"
Successfully read 1682 records (236344 bytes) from: "/user/maria_dev/ml-100k/u.item"
Output(s):
Successfully stored 132 records (6741 bytes) in: "hdfs://sandbox.hortonworks.com:8020/tmp/temp657015348/tmp-562690490"
2020-03-05 17:32:07,588 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2020-03-05 17:32:07,588 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(493,4.15,493,Thin Man, The (1934),-1136073600) # actual script result begin
(604,4.012345679012346,604,It Happened One Night (1934),-1136073600)
(615,4.0508474576271185,615,39 Steps, The (1935),-1104537600)
(1203,4.0476190476190474,1203,Top Hat (1935),-1104537600)
(613,4.037037037037037,613,My Man Godfrey (1936),-1073001600)
(633,4.057971014492754,633,Christmas Carol, A (1938),-1009843200)
(132,4.0772357723577235,132,Wizard of Oz, The (1939),-978307200)
~
~
(1191,4.333333333333333,1191,Letter From Death Row, A (1998),886291200)
(1594,4.5,1594,Everest (1998),889488000)
(315,4.1,315,Apt Pupil (1998),909100800) # actual script result end
2020-03-05 17:32:07,904 [main] INFO org.apache.pig.Main - Pig script completed in 36 seconds and 385 milliseconds (36385 ms)
2020-03-05 17:32:07,915 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Shutting down thread pool
2020-03-05 17:32:07,942 [pool-1-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@5a45a709
2020-03-05 17:32:07,953 [pool-1-thread-1] INFO org.apache.tez.client.TezClient - Shutting down Tez Session, sessionName=PigLatin:fiveStarMovies.pig, applicationId=application_1583424002174_0003
[maria_dev@sandbox ~]$
The -f
option specifies pig script file and -x tez
specifies TEZ as execution engine. We can also select MapReduce as the execution engine using -x mr
but with TEZ execution will be ~10 times faster.