Hadoop command 1 - vaquarkhan/Apache-Kafka-poc-and-notes GitHub Wiki

hadoop dfs

Inspect files

-ls : list all files in -cat : print on stdout -tail [-f] : output the last part of the -du : show space utilization Create/remove files

-mkdir : create a directory -mv : move (rename) files -cp : copy files -rmr : remove files Copy/Put files from a remote machine into the HADOOP cluster

-copyFromLocal : copy a local file to the HDFS -copyToLocal : copy a file on the HDFS to the local disk HELP

-help [cmd]: hopefully this is self-describing Examples:

hadoop dfs -ls /

hadoop dfs -copyFromLocal myfile remotefile

Launching Hadoop Jobs - Command line

Copy the jar file of your job to the client machine (let's call it machine_name) scp localJarFile studentXX@machine_name:~/

SSH to machine_name: ssh studentXX@machine_name

Launch the job: hadoop jar jarFile.jar ClassNameWithPackage [job args]

Note that if the output directory exists (and you don't want it) you need to remove it:

hadoop dfs -rmr output

Example:

hadoop jar fr.eurecom.dsg.WordCount /user/hadoop/wikismall.xml output 2

Reading (Textual) Input Data in the Mapper

This is the class you're looking for: org.apache.hadoop.mapreduce.lib.input.TextInputFormat<K,V>

Precisely, this is the class hierarchy:

java.lang.Object

org.apache.hadoop.mapreduce.InputFormat<K,V>

org.apache.hadoop.mapreduce.lib.input.FileInputFormat<LongWritable,Text>

org.apache.hadoop.mapreduce.lib.input.TextInputFormat

Basically, this is an InputFormat specifically designed for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text. You need to take care of the following:

Key Type: LongWritable

Value Type: Text

Writing (Textual) Output Data in the Reducer

This is the class you're looking for: org.apache.hadoop.mapreduce.lib.output.TextOutputFormat<K,V>

Precisely, this is the class hierarchy:

java.lang.Object

org.apache.hadoop.mapreduce.OutputFormat<K,V>

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<K,V>

org.apache.hadoop.mapreduce.lib.output.TextOutputFormat<K,V>

Essentially, this OutputFormat writes plain text files. TextOutputFormat calls toString() for each key and value pair in output, so any (Writable) type can be used.

Hadoop command 1 - vaquarkhan/Apache-Kafka-poc-and-notes GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️