RDD Resilient Distributed Dataset - ayushmathur94/Spark GitHub Wiki

RDD

RDD known as Resilient Distributed Dataset in Spark is an Immutable (not susceptible to change / 
unchanging) distributed huge collection of object sets.
Each RDD is split into multiple partitions (smaller units), which may be computed on different 
aspects of nodes of the cluster.

RDD can be created in following distinct ways:
1.) By loading an external dataset (eg, calling sc.textFile(" ") method)
2.) By distributing a set of collection of objects (eg. a list or set) in the driver program 
    created. (eg. invoking parallelize method in the driver program)
3.) By applying Transformation operations on existing RDDs.

Every Spark Program (that is playing with RDDs) will do its functioning as follows : 
• Create some input RDDs programming from external data. 
• Transform RDDs to define new RDDs using transformations such as filter()
• Ask Spark to persist() any intermediate RDDs that will need to be reused as per the requirement.
• Create the actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.

Creating RDDs
The simplest way to create RDD is to take an already existing collection in your program and pass the same to the SparkContext's 
parallelize() method.

 parallelize() method in Java 
 JavaRDD lines = sc.parallelize(Arrays.asList("pandas" , "i like pandas"));  

 parallelize() method in Scala 
 val lines  = sc.parallelize(List("pandas", "i like pandas"));  

 parallelize() method in Python 
 lines = sc.parallelize(["pandas", "i like pandas"])

RDD Operations :

RDD performs two types of operations such as : 
• Transformations
• Actions

RDD Resilient Distributed Dataset - ayushmathur94/Spark GitHub Wiki

RDD Operations :

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️