Resilient Distributed Datasets (RDD) ‐ Introduction - datacouch-io/spark-java GitHub Wiki
RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark, which is designed for distributed data processing. It is an immutable, distributed collection of objects, and it serves as the primary data abstraction in Spark. RDDs are partitioned across multiple nodes in a cluster and can be processed in parallel.
RDDs offer fault tolerance, which means they can recover from node failures and ensure that the data remains available for processing. They achieve this resilience through lineage information, which records the sequence of transformations used to build an RDD.
In this documentation, we will explore the basics of RDDs, their creation, operations, and transformations using Java.
You can create an RDD from an existing collection in your program. Here's an example using Java:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;
public class RDDCreationExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("RDDCreationExample").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
// Create an RDD from a Java collection (List)
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
// Perform operations on the RDD (e.g., map, filter, reduce)
JavaRDD<Integer> squaredRDD = rdd.map(x -> x * x);
// Collect and print the results
List<Integer> squaredList = squaredRDD.collect();
for (Integer num : squaredList) {
System.out.println(num);
}
// Stop the SparkContext
sc.stop();
}
}
You can create RDDs by loading data from external sources, such as files. Here's an example of loading data from a text file:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;
public class ExternalDataRDDExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("ExternalDataRDDExample").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load data from an external text file
JavaRDD<String> textFileRDD = sc.textFile("data/input.txt");
// Perform operations on the RDD (e.g., filter, map)
JavaRDD<String> filteredRDD = textFileRDD.filter(line -> line.contains("Spark"));
// Collect and print the results
List<String> filteredList = filteredRDD.collect();
for (String line : filteredList) {
System.out.println(line);
}
// Stop the SparkContext
sc.stop();
}
}
RDDs support two types of operations:
Transformations are operations that create a new RDD from an existing one. They are lazy operations, meaning they are not executed immediately but rather remembered and executed when an action is performed. Examples of transformations include map
, filter
, and groupBy
.
Here's an example of a transformation:
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
JavaRDD<Integer> squaredRDD = rdd.map(x -> x * x);
Actions are operations that trigger the execution of transformations and return a result to the driver program or write data to an external storage system. Examples of actions include collect
, count
, and saveAsTextFile
.
Here's an example of an action:
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
long count = rdd.count();
RDDs can be persisted in memory for faster access if they are going to be used multiple times. You can use the persist
method to specify the storage level (e.g., MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY) and decide whether to store the data serialized or not.
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
rdd.persist(StorageLevel.MEMORY_ONLY_SER); // Persist RDD in memory
RDDs are a powerful abstraction for distributed data processing in Apache Spark. They provide resilience, parallelism, and the ability to perform complex operations on distributed data. In this documentation, we've covered the basics of RDD creation, transformations, actions, and persistence. Spark's RDDs serve as the foundation for building scalable and fault-tolerant data processing applications.