Transformations - ayushmathur94/Spark GitHub Wiki

Transformations

Transformation returns a new type of RDD. ie. It passes the dataset to a function and returns new dataset. eg, Suppose that we have a log file, log.txt with a number of messages and we want to select only error messages. We can use filter() transformation.

Java Code

JavaRDD<String> inputRDD = sc.textFile("log.txt");
JavaRDD<String> errorsRDD = inputRDD.filter( 
                                    new Function<String,Boolean>(){
                                     public Boolean call(String x){
                                      return x.contains("error"); 
                                     }  
                                     });

Scala Code

val inputRDD = sc.textFile("log.txt");
val errorsRDD = inputRDD.filter(line => line.contains("error"));

Python Code

inputRDD = sc.textFile("log.txt")
errorRDD = inputRDD.filter(lambda x : "error" in x)
filter() operation does not mutate in the already existing input RDD, Instead it will return a pointer to an entirely new RDD.
 
⚠️ **GitHub.com Fallback** ⚠️