Spark - prabhu914/Hadoop-Interview-Question GitHub Wiki

RDD and partitions:

RDDS ARE JUST COLLECTION OF PARTITIONS data.

Partitions: 1)Logical division of the data.

2)All input,intermediate and output data can be represented as partitions.

3)Partitions are basic unit of parrellesim.

There are multiple chunks of data in HDFS. Using Inputformat API(Hadoop`s) framwork creates Partitions. By default no of partitions depends on no of chunk (one to one mapping). But we can change it.

So Spark internally uses Hadoop APIs to do Partitions of the data.

Logical division of data means it doesnt contains the data, when they need the data it gets data from the chunks(Physical division).

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.