EMR 001 Compiling MapReduce Applications - qyjohn/AWS_Tutorials GitHub Wiki
The following tutorial works on Ubuntu 18.04.
First of all we install Apache Maven.
sudo apt update
sudo apt install openjdk-8-jdk-headless maven
Use the following command to create an empty Maven project in your home folder.
cd ~
mvn archetype:generate -DgroupId=net.qyjohn.emr -DartifactId=demo -DinteractiveMode=false
This will create a demo folder, with some skeleton code.
cd demo
Replace pom.xml with the following content:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.qyjohn.emr</groupId>
<artifactId>demo</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>firstMapReduce</name>
<properties>
<maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
<java.version>1.8</java.version>
<hadoop.version>2.8.3</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Create file src/main/java/net/qyjohn/emr/WordCount.java, with the following content:
package net.qyjohn.emr;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf,"word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Use the following command to build the demo project:
mvn package
After this you will get a jar file target/demo-0.0.1-SNAPSHOT.jar. This is your application that is ready to used on your EMR cluster.
If you attempt to install Maven on your EMR cluster, then compile the above-mentioned project, it is likely that you will get the following error message:
[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 2 source files to /home/hadoop/demo/target/classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:46 min
[INFO] Finished at: 2020-04-23T05:45:07Z
[INFO] Final Memory: 14M/223M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project demo: Fatal error compiling: invalid target release: 1.8 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
The reason for this is, when you use "sudo yum install apache-maven" to install Maven, the command also downgraded the default JDK (and JRE) version to 1.7. This will break many daemons running on the EMR cluster, because the majority of them are using JAR's compiled with Java 1.8.
Copy target/demo-0.0.1-SNAPSHOT.jar to your EMR cluster as /home/hadoop/demo-0.0.1-SNAPSHOT.jar. Then get some test data from the Gutenberg project:
cd ~
wget https://www.gutenberg.org/files/1342/1342-0.txt
wget http://www.gutenberg.org/cache/epub/25525/pg25525.txt
Move the data to HDFS, under folder "input":
hdfs dfs -mkdir input
hdfs dfs -put 1342-0.txt input
hdfs dfs -put pg25525.txt input
Now run the example, with folder "output" as the output destination:
hadoop jar /home/hadoop/demo-0.0.1-SNAPSHOT.jar net.qyjohn.emr.WordCount /user/hadoop/input /user/hadoop/output
We can also run this against input and output on S3:
aws s3 cp 1342-0.txt s3://my-bucket-name/input/1.txt
aws s3 cp pg25525.txt s3://my-bucket-name/input/2.txt
hadoop jar /home/hadoop/demo-0.0.1-SNAPSHOT.jar net.qyjohn.emr.WordCount s3://my-bucket-name/input s3://my-bucket-name/output
When you have the input / output on S3, you will see S3 counters in the stdout. These counters would appear in the CloudWatch metrics for the EMR cluster.
File System Counters
FILE: Number of bytes read=120947
FILE: Number of bytes written=1097938
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=184
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3: Number of bytes read=825894
S3: Number of bytes written=155214
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0