Home - ignacio-alorre/Spark GitHub Wiki
1- How Spark Works
-
- Resource Allocation Across Applications
- The Spark Application
-
- The DAG
- Jobs
- Stages
- Tasks
2- Spark APIs
-
Dataframes TODO Take parts from DataFrames, Datasets and Spark SQL
from existing information related with Spark from other other wikipages
-
Datasets Take parts from DataFrames, Datasets and Spark SQL
-
[RDDs vs Dataframes vs Datasets]
3- Working with Key/Value Data (TODO: Complete the pending part and add images where required, it is still unfinished this topic)
- The Goldilocks Example
- Actions on Key/Value Pairs
- What's so Dangerous About the groupByKey Function
- Choosing an Aggregation Operation
- Multiple RDD operations
- Partitioners and Key/Value Data
- Dictionary of Ordered RDD operations
- Secondary sort and repartitionAndSortWithinPartitions
- Straggler Detection and Unbalanced Data
4- Effective Transformations
-
Narrow Versus Wide Transformations
- Implications for Performance
- Implications for Fault Tolerance
- The Special Case of coalesce
-
- Reusing Existing Objects
- Using Smaller Data Structures
-
Iterator-to-Iterator Transformations with mapPartitions
- What Is an Iterator-to-Iterator Transformation?
- Space and Time Advantages
- An Example
-
- Broadcast Variables
- Accumulators
-
- Cases for Reuse
- Deciding if Recompute Is Inexpensive Enough
- Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
5- Joins
6- Interview Questions
- Block 1
- Block 2