Experiments: Running the MapReduce code (Linux) - BurraAbhishek/Python_Hadoop_MapReduce_MarketBasketAnalysis GitHub Wiki

The following experiments were conducted on Linux Mint 20 Cinnamon having 2 processors, 5.5 GB RAM and 70 GB storage. For each of the experiments, these commands are agnostic of the platform in which Hadoop was set up, unless mentioned otherwise. In addition to that, each of these experiments shown in this page have the following assumptions:

The username is burraabhishek
The present working directory is ~/src. The entire src directory of this repository was cloned to the home directory in the Linux machine.
The Hadoop version is Hadoop 3.3.0
All the directories in the Hadoop Distributed File System differ across various development environments

These values differ across various development environments. Replace these values wherever necessary.

NOTE: To run these experiments, a Hadoop Development Environment is required. This guide can help you get started if you do not have a Hadoop Development Environment.

Starting Hadoop Distributed File System

Run each of these commands to start HDFS:

start-dfs.sh
start-yarn.sh

For these experiments, it is recommended to open the Terminal from the present working directory and then run the above commands.

start-dfs.sh: Starts the Distributed File System. This starts the following:

namenode (on localhost, unless otherwise specified)
datanodes
secondary namenodes

start-yarn.sh: Starts Hadoop YARN (Yet Another Resource Manager). YARN manages computing resources in clusters. Running this command starts the following:

resourcemanager
nodemanagers

To check the status of the Hadoop daemons, type the command jps. jps is Java Virtual Machine Process Status Tool. For example:

$ jps
2560 NodeManager
2706 Jps
2453 ResourceManager
2021 DataNode
2168 SecondaryNameNode
1930 NameNode

Ensure that all five daemons and Jps are available. The numbers on the left are the process IDs and may differ across environments.

Preparing the code

Hadoop Streaming runs executable mappers and reducers. These codes are not currently executables.

For each of the 4 Python files in the directory, add an interpreter in the beginning and leave a line after that (differs across platforms).

For example,

#!/usr/bin/python

# Rest of the Python code

The first two bytes #! indicates that the Unix/Linux program loader should interpret the rest of the line as a command to launch the interpreter with which the program is executed. For example, #!/usr/bin/python runs python code with the python executable in /usr/bin.

Then for each of the files, run the following commands (This actually converts the files into executables):

chmod +x mapper.py
chmod +x reducer.py
chmod +x nextpass.py
chmod +x reset.py

Preparing a dataset

You can either use the dataset generator included here or download a dataset available online. (If you choose the latter, please abide by the licensing conditions if any).

Upload the dataset into HDFS.

For example, if the dataset is ~/src/csv_dataset.csv and the destination in HDFS is /dataset/ (The directory is not created), where ~/src is the present working directory where the commands are executed, then the following command copies the dataset into HDFS:

hdfs dfs -mkdir /dataset
hdfs dfs -put csv_dataset.csv /dataset/csv_dataset.csv

Note that the name of the files need not be the same while using hdfs dfs -put.

HDFS using GUI

HDFS can be accessed using a web browser. If default settings are used, then the URL

localhost:9870

should open the HDFS.

This URL may differ for different Hadoop configurations.

To browse the file system, go to Utilities, then select 'Browse the file system'.

Running the MapReduce code

Hadoop MapReduce jobs are generally written in Java. Therefore, a jar (Java ARchive) file is required to run these jobs. In this case, the jobs are written in Python3 and the jar file used is hadoop-streaming.x.y.z.jar where x.y.z represents the Hadoop version.

The command to execute the MapReduce code is:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-libjars custom.jar \
-file apriori_settings.json \
-file discarded_items.txt \
-input /dataset/csv_dataset.csv \
-mapper mapper.py \
-reducer reducer.py \
-output /output1 \
-outputformat CustomMultiOutputFormat

Replace:

mapper.py with the full path of the mapper file
reducer.py with the full path of the reducer file

Replace if different:

The version in the jar file from 3.3.0 to your jar file version.
csv_dataset.csv with the dataset in HDFS
/output1 with the location in HDFS where you want to store the output.

When the MapReduce code executes successfully, inside the directory /output1 in HDFS (in this case, it is output1), there will be 3 files:

_SUCCESS
frequent
discarded

To see the frequent itemsets: go to frequent and download part-00000.

To see the discarded itemsets: go to discarded and download part-00000.

To run the next pass more efficiently, it is recommended to copy the list of discarded itemsets to your present working directory. The steps are as follows:

Either move or delete the existing part-00000 file. In this case, the file is deleted using:

rm part-00000

Copy the file from HDFS using:

hdfs dfs -copyToLocal /output1/discarded/part-00000

This command copies the file in hdfs://output1/part-00000 to the present working directory.

Now configure to run the next pass using

./nextpass.py

Repeat these steps of running the MapReduce code until:

The desired output is obtained in frequent itemsets.
No more frequent itemsets are obtained.

Shutting down HDFS

These two commands stop the YARN service and HDFS respectively:

stop-yarn.sh
stop-dfs.sh