Course :: Big Data Workshop - up1/training-courses GitHub Wiki

Course :: Big Data Workshop (Data pipeline, processing and messaging)

Introduction to Big Data & Data Pipelines
- Overview of Big Data ecosystem
- Architecture of modern data pipelines
- Roles of Hadoop, YARN, Spark, Flink, Kafka
Hadoop and YARN Fundamentals
- HDFS(Hadoop Distributed File System) architecture
- Map-Reduce
- Working with HDFS
- YARN(Yet Another Resource Negotiator) resource management
- Working with YARN operations
- Workshop
  - Cluster setup with Docker
Apache Spark
- Spark core concepts (RDD, DataFrame, DAG)
- Batch processing
- Memory management
- Spark SQL and transformations
- ETL (Extract Transform, Loader)
- Workshop
  - Cluster setup with Docker

Optimization & Monitoring
- Performance tuning tips
- Metrics, logging, and health checks
Building a Complete Pipeline
- Kafka → Spark/Flink → Output (DB or file)
- Orchestration and monitoring tips
- Workshop
  - Create a working data pipeline