Hadoop command 1 - vaquarkhan/Apache-Kafka-poc-and-notes GitHub Wiki
hadoop dfs
Inspect files
-ls : list all files in -cat : print on stdout -tail [-f] : output the last part of the -du : show space utilization Create/remove files
-mkdir : create a directory -mv : move (rename) files -cp : copy files -rmr : remove files Copy/Put files from a remote machine into the HADOOP cluster
-copyFromLocal : copy a local file to the HDFS -copyToLocal : copy a file on the HDFS to the local disk HELP
-help [cmd]: hopefully this is self-describing Examples:
hadoop dfs -ls /
hadoop dfs -copyFromLocal myfile remotefile
Launching Hadoop Jobs - Command line
Copy the jar file of your job to the client machine (let's call it machine_name) scp localJarFile studentXX@machine_name:~/
SSH to machine_name: ssh studentXX@machine_name
Launch the job: hadoop jar jarFile.jar ClassNameWithPackage [job args]
Note that if the output directory exists (and you don't want it) you need to remove it:
hadoop dfs -rmr output
Example:
hadoop jar fr.eurecom.dsg.WordCount /user/hadoop/wikismall.xml output 2
Reading (Textual) Input Data in the Mapper
This is the class you're looking for: org.apache.hadoop.mapreduce.lib.input.TextInputFormat<K,V>
Precisely, this is the class hierarchy:
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<LongWritable,Text>
org.apache.hadoop.mapreduce.lib.input.TextInputFormat
Basically, this is an InputFormat specifically designed for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text. You need to take care of the following:
Key Type: LongWritable
Value Type: Text
Writing (Textual) Output Data in the Reducer
This is the class you're looking for: org.apache.hadoop.mapreduce.lib.output.TextOutputFormat<K,V>
Precisely, this is the class hierarchy:
java.lang.Object
org.apache.hadoop.mapreduce.OutputFormat<K,V>
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<K,V>
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat<K,V>
Essentially, this OutputFormat writes plain text files. TextOutputFormat calls toString() for each key and value pair in output, so any (Writable) type can be used.