Apache Pig - kaushikdas/TechnicalWritings GitHub Wiki

Why Pig?

  • Relieves from cumbersome process of writing mappers and reducers
  • Provides scripting language Pig Latin to write SQL-like code to do map reduce job
  • UDFs (user defined function) make it highly extensible

Where does Pig sit?

+----------------------------+
|            Pig             |
+----------------+-----------+
|    MapReduce   |    TEZ    |  <---- underlying map reduce engine
+----------------+-----------+
|           YARN             |
+----------------------------+
|           HDFS             |
+----------------------------+

Activity

Find old 5-star movies

/**
 * Create a _relation_ named ratings by loading file u.data file from
 * *HDFS*. Pig by default expects the file to be TAB delimited and
 * this file is already TAB delimited.
 * The relation is created with the schema that is specified during
 * reading the file from HDFS and by assigning name and type to each
 * field as specified by the schema description.
 **/
ratings = LOAD '/user/maria_dev/ml-100k/u.data' -- this is path from HDFS
     AS (userID:int, movieID:int, rating:int, ratingTime:int);

metaData = LOAD '/user/maria_dev/ml-100k/u.item'
    USING PigStorage('|') -- because this file uses | as delim
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray,
        videoRelease:chararray, imdbLink:chararray);

nameLookup = FOREACH metaData GENERATE movieID, movieTitle,
        ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;

ratingsByMovie = GROUP ratings BY movieID;

avgRatings = FOREACH ratingsByMovie 
    GENERATE group AS movieID,    -- group is the a new column name
             AVG(ratings.rating) AS avgRating;

fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;

fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;

oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;

DUMP oldestFiveStarMovies;

Execution

Connect to the sandbox using SSH or connect to localhost:4200 (Shell-In-A-Box) and sign in as maria_dev/maria_dev:

[maria_dev@sandbox ~]$ hadoop fs -copyFromLocal ml-100k/u.item ml-100k/u.item    # copy u.item to HDFS
[maria_dev@sandbox ~]$ pig -x tez -f fiveStarMovies.pig
[maria_dev@sandbox ~]$ pig -x tez -f fiveStarMovies.pig
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : TEZ_LOCAL
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Trying ExecType : TEZ
20/03/05 17:31:31 INFO pig.ExecTypeProvider: Picked TEZ as the ExecType
~
~
Input(s):
Successfully read 100003 records (2079229 bytes) from: "/user/maria_dev/ml-100k/u.data"
Successfully read 1682 records (236344 bytes) from: "/user/maria_dev/ml-100k/u.item"

Output(s):
Successfully stored 132 records (6741 bytes) in: "hdfs://sandbox.hortonworks.com:8020/tmp/temp657015348/tmp-562690490"

2020-03-05 17:32:07,588 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2020-03-05 17:32:07,588 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(493,4.15,493,Thin Man, The (1934),-1136073600)    # actual script result begin
(604,4.012345679012346,604,It Happened One Night (1934),-1136073600)
(615,4.0508474576271185,615,39 Steps, The (1935),-1104537600)
(1203,4.0476190476190474,1203,Top Hat (1935),-1104537600)
(613,4.037037037037037,613,My Man Godfrey (1936),-1073001600)
(633,4.057971014492754,633,Christmas Carol, A (1938),-1009843200)
(132,4.0772357723577235,132,Wizard of Oz, The (1939),-978307200)
~
~
(1191,4.333333333333333,1191,Letter From Death Row, A (1998),886291200)
(1594,4.5,1594,Everest (1998),889488000)
(315,4.1,315,Apt Pupil (1998),909100800)   # actual script result end
2020-03-05 17:32:07,904 [main] INFO  org.apache.pig.Main - Pig script completed in 36 seconds and 385 milliseconds (36385 ms)
2020-03-05 17:32:07,915 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Shutting down thread pool
2020-03-05 17:32:07,942 [pool-1-thread-1] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@5a45a709
2020-03-05 17:32:07,953 [pool-1-thread-1] INFO  org.apache.tez.client.TezClient - Shutting down Tez Session, sessionName=PigLatin:fiveStarMovies.pig, applicationId=application_1583424002174_0003
[maria_dev@sandbox ~]$

The -f option specifies pig script file and -x tez specifies TEZ as execution engine. We can also select MapReduce as the execution engine using -x mr but with TEZ execution will be ~10 times faster.

Good Tutorials

  1. https://www.cloudera.com/tutorials/beginners-guide-to-apache-pig/.html
  2. https://www.cloudera.com/tutorials/how-to-process-data-with-apache-pig.html