Course :: Big Data Workshop - up1/training-courses GitHub Wiki

Course :: Big Data Workshop (Data pipeline, processing and messaging)

Outline for 2 days

Day 1 :: Foundation and batching processing

  • Introduction to Big Data & Data Pipelines
    • Overview of Big Data ecosystem
    • Architecture of modern data pipelines
    • Roles of Hadoop, YARN, Spark, Flink, Kafka
  • Hadoop and YARN Fundamentals
    • HDFS(Hadoop Distributed File System) architecture
    • Map-Reduce
    • Working with HDFS
    • YARN(Yet Another Resource Negotiator) resource management
    • Working with YARN operations
    • Workshop
      • Cluster setup with Docker
  • Apache Spark
    • Spark core concepts (RDD, DataFrame, DAG)
    • Batch processing
    • Memory management
    • Spark SQL and transformations
    • ETL (Extract Transform, Loader)
    • Workshop
      • Cluster setup with Docker

Day 2 :: Real-time Processing and Messaging

  • Apache Kafka for Messaging
    • Kafka architecture: Broker, Topic, Producer, Consumer
    • Kafka cluster
    • Kafka monitoring
    • Workshop with Kafka
      • Cluster setup with Docker
      • Streaming logs into Kafka topics
  • Apache Flink for Real-time Processing
    • Flink vs Spark: Key differences
    • Event-driven architecture
    • Flink architecture
    • Workshop
      • Cluster setup with Docker
      • Real-time data stream processing

Outline for 3 days

Day 3

  • Optimization & Monitoring
    • Performance tuning tips
    • Metrics, logging, and health checks
  • Building a Complete Pipeline
    • Kafka → Spark/Flink → Output (DB or file)
    • Orchestration and monitoring tips
    • Workshop
      • Create a working data pipeline