The following experiments were conducted on Linux Mint 20 Cinnamon having 2 processors, 5.5 GB RAM and 70 GB storage. For each of the experiments, these commands are agnostic of the platform in which Hadoop was set up, unless mentioned otherwise. In addition to that, each of these experiments shown in this page have the following assumptions:
~/src. The entire src directory of this repository was cloned to the home directory in the Linux machine.
These values differ across various development environments. Replace these values wherever necessary.
NOTE: To run these experiments, a Hadoop Development Environment is required. This guide can help you get started if you do not have a Hadoop Development Environment.
Run each of these commands to start HDFS:
For these experiments, it is recommended to open the Terminal from the present working directory and then run the above commands.
start-dfs.sh: Starts the Distributed File System. This starts the following:
start-yarn.sh: Starts Hadoop YARN (Yet Another Resource Manager). YARN manages computing resources in clusters. Running this command starts the following:
To check the status of the Hadoop daemons, type the command
jps. jps is Java Virtual Machine Process Status Tool. For example:
$ jps 2560 NodeManager 2706 Jps 2453 ResourceManager 2021 DataNode 2168 SecondaryNameNode 1930 NameNode
Ensure that all five daemons and Jps are available. The numbers on the left are the process IDs and may differ across environments.
Hadoop Streaming runs executable mappers and reducers. These codes are not currently executables.
For each of the 4 Python files in the directory, add an interpreter in the beginning and leave a line after that (differs across platforms).
#!/usr/bin/python # Rest of the Python code
The first two bytes #! indicates that the Unix/Linux program loader should interpret the rest of the line as a command to launch the interpreter with which the program is executed. For example, #!/usr/bin/python runs python code with the python executable in /usr/bin.
Then for each of the files, run the following commands (This actually converts the files into executables):
chmod +x mapper.py chmod +x reducer.py chmod +x nextpass.py chmod +x reset.py
You can either use the dataset generator included here or download a dataset available online. (If you choose the latter, please abide by the licensing conditions if any).
Upload the dataset into HDFS.
For example, if the dataset is
~/src/csv_dataset.csv and the destination in HDFS is
/dataset/ (The directory is not created), where
~/src is the present working directory where the commands are executed, then the following command copies the dataset into HDFS:
hdfs dfs -mkdir /dataset hdfs dfs -put csv_dataset.csv /dataset/csv_dataset.csv
Note that the name of the files need not be the same while using
hdfs dfs -put.
HDFS can be accessed using a web browser. If default settings are used, then the URL
should open the HDFS.
This URL may differ for different Hadoop configurations.
To browse the file system, go to Utilities, then select 'Browse the file system'.
Hadoop MapReduce jobs are generally written in Java. Therefore, a jar (Java ARchive) file is required to run these jobs. In this case, the jobs are written in Python3 and the jar file used is hadoop-streaming.x.y.z.jar where x.y.z represents the Hadoop version.
The command to execute the MapReduce code is:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \ -libjars custom.jar \ -file apriori_settings.json \ -file discarded_items.txt \ -input /dataset/csv_dataset.csv \ -mapper mapper.py \ -reducer reducer.py \ -output /output1 \ -outputformat CustomMultiOutputFormat
Replace if different:
When the MapReduce code executes successfully, inside the directory /output1 in HDFS (in this case, it is output1), there will be 3 files:
To see the frequent itemsets: go to frequent and download part-00000.
To see the discarded itemsets: go to discarded and download part-00000.
To run the next pass more efficiently, it is recommended to copy the list of discarded itemsets to your present working directory. The steps are as follows:
part-00000file. In this case, the file is deleted using:
hdfs dfs -copyToLocal /output1/discarded/part-00000
This command copies the file in hdfs://output1/part-00000 to the present working directory.
Now configure to run the next pass using
Repeat these steps of running the MapReduce code until:
These two commands stop the YARN service and HDFS respectively: