EMR 001 Compiling MapReduce Applications - qyjohn/AWS

The following tutorial works on Ubuntu 18.04.

1. Include JDK and Apache Maven

First of all we install Apache Maven.

sudo apt update
sudo apt install openjdk-8-jdk-headless maven

2. Create a demo Maven project

Use the following command to create an empty Maven project in your home folder.

cd ~
mvn archetype:generate -DgroupId=net.qyjohn.emr -DartifactId=demo -DinteractiveMode=false

This will create a demo folder, with some skeleton code.

cd demo

Replace pom.xml with the following content:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>net.qyjohn.emr</groupId>
  <artifactId>demo</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>firstMapReduce</name>
  
   <properties>
    <maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
    <java.version>1.8</java.version>
    <hadoop.version>2.8.3</hadoop.version>
  </properties>
  <dependencies>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>${hadoop.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-common</artifactId>
      <version>${hadoop.version}</version>
      <scope>provided</scope>
  </dependency>
<dependency>
  <groupId>junit</groupId>
  <artifactId>junit</artifactId>
  <version>4.11</version>
  <scope>test</scope>
</dependency>
</dependencies>
<build>
          <plugins>
            <plugin>
              <artifactId>maven-compiler-plugin</artifactId>
              <version>${maven-compiler-plugin.version}</version>
              <configuration>
                <source>${java.version}</source>
                <target>${java.version}</target>
              </configuration>
            </plugin>
    </plugins>
  </build>
  
</project>

Create file src/main/java/net/qyjohn/emr/WordCount.java, with the following content:

package net.qyjohn.emr;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
    	
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf,"word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

3. Build the demo project

Use the following command to build the demo project:

mvn package

After this you will get a jar file target/demo-0.0.1-SNAPSHOT.jar. This is your application that is ready to used on your EMR cluster.

If you attempt to install Maven on your EMR cluster, then compile the above-mentioned project, it is likely that you will get the following error message:

[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 2 source files to /home/hadoop/demo/target/classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:46 min
[INFO] Finished at: 2020-04-23T05:45:07Z
[INFO] Final Memory: 14M/223M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project demo: Fatal error compiling: invalid target release: 1.8 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

The reason for this is, when you use "sudo yum install apache-maven" to install Maven, the command also downgraded the default JDK (and JRE) version to 1.7. This will break many daemons running on the EMR cluster, because the majority of them are using JAR's compiled with Java 1.8.

4. Run the demo project

Copy target/demo-0.0.1-SNAPSHOT.jar to your EMR cluster as /home/hadoop/demo-0.0.1-SNAPSHOT.jar. Then get some test data from the Gutenberg project:

cd ~
wget https://www.gutenberg.org/files/1342/1342-0.txt
wget http://www.gutenberg.org/cache/epub/25525/pg25525.txt

Move the data to HDFS, under folder "input":

hdfs dfs -mkdir input
hdfs dfs -put 1342-0.txt input
hdfs dfs -put pg25525.txt input

Now run the example, with folder "output" as the output destination:

hadoop jar /home/hadoop/demo-0.0.1-SNAPSHOT.jar net.qyjohn.emr.WordCount /user/hadoop/input /user/hadoop/output

We can also run this against input and output on S3:

aws s3 cp 1342-0.txt s3://my-bucket-name/input/1.txt
aws s3 cp pg25525.txt s3://my-bucket-name/input/2.txt
hadoop jar /home/hadoop/demo-0.0.1-SNAPSHOT.jar net.qyjohn.emr.WordCount s3://my-bucket-name/input s3://my-bucket-name/output

When you have the input / output on S3, you will see S3 counters in the stdout. These counters would appear in the CloudWatch metrics for the EMR cluster.

	File System Counters
		FILE: Number of bytes read=120947
		FILE: Number of bytes written=1097938
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=184
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=2
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		S3: Number of bytes read=825894
		S3: Number of bytes written=155214
		S3: Number of read operations=0
		S3: Number of large read operations=0
		S3: Number of write operations=0

EMR 001 Compiling MapReduce Applications - qyjohn/AWS_Tutorials GitHub Wiki

1. Include JDK and Apache Maven

2. Create a demo Maven project

3. Build the demo project

4. Run the demo project

⚠️ GitHub.com Fallback ⚠️

EMR 001 Compiling MapReduce Applications - qyjohn/AWS_Tutorials GitHub Wiki

1. Include JDK and Apache Maven

2. Create a demo Maven project

3. Build the demo project

4. Run the demo project

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️