Home - ignacio-alorre/Spark GitHub Wiki

1- How Spark Works

Architecture and Features
Spark Job Scheduling
- Resource Allocation Across Applications
- The Spark Application
The Anatomy of a Spark Job
- The DAG
- Jobs
- Stages
- Tasks

2- Spark APIs

RDDs
Dataframes TODO Take parts from DataFrames, Datasets and Spark SQL

from existing information related with Spark from other other wikipages

Datasets Take parts from DataFrames, Datasets and Spark SQL
[RDDs vs Dataframes vs Datasets]

3- Working with Key/Value Data (TODO: Complete the pending part and add images where required, it is still unfinished this topic)

The Goldilocks Example
Actions on Key/Value Pairs
What's so Dangerous About the groupByKey Function
Choosing an Aggregation Operation
Multiple RDD operations
Partitioners and Key/Value Data
Dictionary of Ordered RDD operations
Secondary sort and repartitionAndSortWithinPartitions
Straggler Detection and Unbalanced Data

4- Effective Transformations

Narrow Versus Wide Transformations
- Implications for Performance
- Implications for Fault Tolerance
- The Special Case of coalesce
What Type of RDD Does Your Transformation Return?
Minimizing Object Creation
- Reusing Existing Objects
- Using Smaller Data Structures
Iterator-to-Iterator Transformations with mapPartitions
- What Is an Iterator-to-Iterator Transformation?
- Space and Time Advantages
- An Example
Set Operations
Shared Variables
- Broadcast Variables
- Accumulators
Reusing RDDs
- Cases for Reuse
- Deciding if Recompute Is Inexpensive Enough
- Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files

5- Joins

6- Interview Questions

Block 1
Block 2

7- Templates