Resilient Distributed Datasets (RDDs) - awantik/spark GitHub Wiki

Spark revolves around the concept of a resilient distributed dataset (RDD).
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
Creating RDD :
- parallelizing an existing collection in your driver program
- referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase etc.

###Parallelized Collections

Created by calling SparkContext’s parallelize method on an existing iterable or collection in your driver program.
Collection is converted to distributed dataset and executed in parallel.

data = ["Hello", "World", "Here", "I", "come"] distData = sc.parallelize(data)
We can set partition manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

###External Datasets

PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

distFile = sc.textFile("data.txt")
textFile() can take urls as argument
Imp points for reading files
- If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
- All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz")
- The textFile method also takes an optional second argument for controlling the number of partitions of the file.
- SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs.

###RDD Operations

RDDs support two types of operations:
- transformations, which create a new dataset from an existing one, and
- actions, which return a value to the driver program after running a computation on the dataset.
All transformations in Spark are lazy, in that they do not compute their results right away.
- They just remember the transformations applied to some base dataset (e.g. a file).
- The transformations are only computed when an action requires a result to be returned to the driver program.
- This design enables Spark to run more efficiently.
Each transformed RDD may be recomputed each time you run an action on it.
- We can persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.
- There is also support for persisting RDDs on disk, or replicated across multiple nodes.