Course :: Big Data Workshop - up1/training-courses GitHub Wiki
Course :: Big Data Workshop (Data pipeline, processing and messaging)
Outline for 2 days
Day 1 :: Foundation and batching processing
- Introduction to Big Data & Data Pipelines
- Overview of Big Data ecosystem
- Architecture of modern data pipelines
- Roles of Hadoop, YARN, Spark, Flink, Kafka
- Hadoop and YARN Fundamentals
- HDFS(Hadoop Distributed File System) architecture
- Map-Reduce
- Working with HDFS
- YARN(Yet Another Resource Negotiator) resource management
- Working with YARN operations
- Workshop
- Cluster setup with Docker
- Apache Spark
- Spark core concepts (RDD, DataFrame, DAG)
- Batch processing
- Memory management
- Spark SQL and transformations
- ETL (Extract Transform, Loader)
- Workshop
- Cluster setup with Docker
Day 2 :: Real-time Processing and Messaging
- Apache Kafka for Messaging
- Kafka architecture: Broker, Topic, Producer, Consumer
- Kafka cluster
- Kafka monitoring
- Workshop with Kafka
- Cluster setup with Docker
- Streaming logs into Kafka topics
- Apache Flink for Real-time Processing
- Flink vs Spark: Key differences
- Event-driven architecture
- Flink architecture
- Workshop
- Cluster setup with Docker
- Real-time data stream processing
Outline for 3 days
Day 3
- Optimization & Monitoring
- Performance tuning tips
- Metrics, logging, and health checks
- Building a Complete Pipeline
- Kafka → Spark/Flink → Output (DB or file)
- Orchestration and monitoring tips
- Workshop
- Create a working data pipeline