Project - Jvelasquez980/MapReduce-Distributed-Processing GitHub Wiki
Project 3 – Distributed Processing with MapReduce (Hadoop)
Course: ST0263 – Special Topics in Telematics, 2025-1
University: EAFIT University
Due Date: June 2, 2025
General Objective
To implement a complete distributed data processing pipeline using HDFS and the MapReduce model. The goal is to understand how batch architectures and distributed storage systems work from the ground up.
Project Overview
This project simulates a real-world batch processing flow using Hadoop’s MapReduce framework. It allows students to gain hands-on experience with the key stages of distributed data processing:
- Data Acquisition: Manual download of open data in CSV, JSON, or plain text format.
- HDFS Upload: Data is uploaded to the Hadoop Distributed File System (HDFS), either manually or via a reproducible script.
- MapReduce Processing: Data is analyzed using one or more MapReduce programs implemented in Java or Python (using MRJob). At least one job should produce meaningful results (aggregation, filtering, counting, statistics, etc.).
- Result Delivery: Results are saved back to HDFS and exported to CSV. A lightweight API (Flask or FastAPI) is used to serve the results.
Deliverables
The GitHub repository must include:
- MapReduce code (
.java
or.py
) - HDFS upload scripts (if used)
- Sample input and output files
- Code for the API to visualize the results
- A detailed
README.md
with setup and execution instructions
Video Presentation
Maximum duration: 10 minutes
Must include:
- Description of the dataset and the motivation behind its selection
- Explanation of the data upload process
- Detailed walkthrough of the MapReduce logic
- Presentation and interpretation of the results
Scope
- Run and test MapReduce programs on a Hadoop cluster (e.g., Amazon EMR)
- Work with structured or semi-structured real-world datasets
- Use HDFS as the main data storage layer
- Demonstrate a working end-to-end pipeline: ingestion → processing → output
- Provide a minimal API for data visualization
Suggested Data Sources
You may choose data from any of the following open and free APIs or repositories:
-
Weather Data:
-
Mobility and Transport:
-
Financial Data:
-
Public Health:
-
E-commerce:
Feel free to clone this page and adjust it according to your specific dataset, tools, or project focus.