Spark config details - animeshtrivedi/notes GitHub Wiki
Spark config details
spark.shuffle.sort.initialBufferSize 4096
: This is the size of the sort array.
Is spark.default.parallelism
important? It used in Spark SQL code for merging schema, as:
ParquetFileFormat
has
val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),
sparkSession.sparkContext.defaultParallelism)
which goes to SparkContext
, which has
def defaultParallelism: Int = {
assertNotStopped()
taskScheduler.defaultParallelism
}
which goes to TaskScheduler, which has
// Get the default level of parallelism to use in the cluster, as a hint for sizing jobs.
def defaultParallelism(): Int
which goes to TaskSchedulerImpl
, which has
override def defaultParallelism(): Int = backend.defaultParallelism()
which goes to SchedulerBackend
, which has
def defaultParallelism(): Int
which goes to CoarseGrainedSchedularBackend
, which has
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
Recommendation - always set the parallelism to an appropriate value.