RDD Resilient Distributed Dataset - ayushmathur94/Spark GitHub Wiki
RDD
RDD known as Resilient Distributed Dataset in Spark is an Immutable (not susceptible to change / unchanging) distributed huge collection of object sets. Each RDD is split into multiple partitions (smaller units), which may be computed on different aspects of nodes of the cluster.
RDD can be created in following distinct ways:
1.) By loading an external dataset (eg, calling sc.textFile(" ") method)
2.) By distributing a set of collection of objects (eg. a list or set) in the driver program
created. (eg. invoking parallelize method in the driver program)
3.) By applying Transformation operations on existing RDDs.
Every Spark Program (that is playing with RDDs) will do its functioning as follows : • Create some input RDDs programming from external data. • Transform RDDs to define new RDDs using transformations such as filter() • Ask Spark to persist() any intermediate RDDs that will need to be reused as per the requirement. • Create the actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.
Creating RDDs
The simplest way to create RDD is to take an already existing collection in your program and pass the same to the SparkContext's
parallelize() method.
parallelize() method in Java
JavaRDD lines = sc.parallelize(Arrays.asList("pandas" , "i like pandas"));
parallelize() method in Scala
val lines = sc.parallelize(List("pandas", "i like pandas"));
parallelize() method in Python
lines = sc.parallelize(["pandas", "i like pandas"])
RDD performs two types of operations such as : • Transformations • Actions