Transformations - ayushmathur94/Spark GitHub Wiki
Transformation returns a new type of RDD. ie. It passes the dataset to a function and returns new dataset. eg, Suppose that we have a log file, log.txt with a number of messages and we want to select only error messages. We can use filter() transformation.
Java Code
JavaRDD<String> inputRDD = sc.textFile("log.txt");
JavaRDD<String> errorsRDD = inputRDD.filter(
new Function<String,Boolean>(){
public Boolean call(String x){
return x.contains("error");
}
});
Scala Code
val inputRDD = sc.textFile("log.txt");
val errorsRDD = inputRDD.filter(line => line.contains("error"));
Python Code
inputRDD = sc.textFile("log.txt")
errorRDD = inputRDD.filter(lambda x : "error" in x)
filter() operation does not mutate in the already existing input RDD, Instead it will return a pointer to an entirely new RDD.