Resilient Distributed Datasets (RDD) ‐ Introduction - datacouch-io/spark-java GitHub Wiki

Introduction to RDD

RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark, which is designed for distributed data processing. It is an immutable, distributed collection of objects, and it serves as the primary data abstraction in Spark. RDDs are partitioned across multiple nodes in a cluster and can be processed in parallel.

RDDs offer fault tolerance, which means they can recover from node failures and ensure that the data remains available for processing. They achieve this resilience through lineage information, which records the sequence of transformations used to build an RDD.

In this documentation, we will explore the basics of RDDs, their creation, operations, and transformations using Java.

Creating RDDs

Parallelized Collections

You can create an RDD from an existing collection in your program. Here's an example using Java:

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;

public class RDDCreationExample {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("RDDCreationExample").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        // Create an RDD from a Java collection (List)
        JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));

        // Perform operations on the RDD (e.g., map, filter, reduce)
        JavaRDD<Integer> squaredRDD = rdd.map(x -> x * x);

        // Collect and print the results
        List<Integer> squaredList = squaredRDD.collect();
        for (Integer num : squaredList) {
            System.out.println(num);
        }

        // Stop the SparkContext
        sc.stop();
    }
}

Loading External Data

You can create RDDs by loading data from external sources, such as files. Here's an example of loading data from a text file:

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;

public class ExternalDataRDDExample {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("ExternalDataRDDExample").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        // Load data from an external text file
        JavaRDD<String> textFileRDD = sc.textFile("data/input.txt");

        // Perform operations on the RDD (e.g., filter, map)
        JavaRDD<String> filteredRDD = textFileRDD.filter(line -> line.contains("Spark"));

        // Collect and print the results
        List<String> filteredList = filteredRDD.collect();
        for (String line : filteredList) {
            System.out.println(line);
        }

        // Stop the SparkContext
        sc.stop();
    }
}

RDD Operations

RDDs support two types of operations:

Transformations

Transformations are operations that create a new RDD from an existing one. They are lazy operations, meaning they are not executed immediately but rather remembered and executed when an action is performed. Examples of transformations include map, filter, and groupBy.

Here's an example of a transformation:

JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
JavaRDD<Integer> squaredRDD = rdd.map(x -> x * x);

Actions

Actions are operations that trigger the execution of transformations and return a result to the driver program or write data to an external storage system. Examples of actions include collect, count, and saveAsTextFile.

Here's an example of an action:

JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
long count = rdd.count();

Persistence

RDDs can be persisted in memory for faster access if they are going to be used multiple times. You can use the persist method to specify the storage level (e.g., MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY) and decide whether to store the data serialized or not.

JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
rdd.persist(StorageLevel.MEMORY_ONLY_SER); // Persist RDD in memory

Conclusion

RDDs are a powerful abstraction for distributed data processing in Apache Spark. They provide resilience, parallelism, and the ability to perform complex operations on distributed data. In this documentation, we've covered the basics of RDD creation, transformations, actions, and persistence. Spark's RDDs serve as the foundation for building scalable and fault-tolerant data processing applications.