Difference between ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD: - vaquarkhan/Apache-Kafka-poc-and-notes GitHub Wiki

ShuffledRDD : ShuffledRDD is created while the data is shuffled over the cluster. If you use any transformation(e.g. join,groupBy,repartition, etc.) which shuffles your data it will create a shuffledRDD.
MapPartitionsRDD : MapPartitionsRDD will be created when you use mapPartition transformation.
ParallelCollectionRDD : ParallelCollectionRDD is created when you create the RDD with the collection object.

https://github.com/JerryLead/SparkInternals/tree/master/markdown

A reduceByKey operation still involves a shuffle, as it's still required to ensure that all items with the same key become part of the same partition.

However, this will be a much smaller shuffle operation than a groupByKey operation. A reduceByKey will perform the reduction operation within each partition before shuffling, thus reducing the amount of data to be shuffled.

By default spark create one partition for each block of the file in HDFS it is 64MB by default. You can also pass second argument as a number of partition when creating RDD.Let see example of creating RDD of text file
Use coalesce to repartition in decrease number of partition:Use coalesce if you decrease number of partition of the RDD instead of repartition. coalesce is usefull because its not shuffle data over network.