Spark config details - animeshtrivedi/notes GitHub Wiki

Spark config details

spark.shuffle.sort.initialBufferSize 4096 : This is the size of the sort array.


Is spark.default.parallelism important? It used in Spark SQL code for merging schema, as: ParquetFileFormat has

 val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),
      sparkSession.sparkContext.defaultParallelism)

which goes to SparkContext, which has

def defaultParallelism: Int = {
    assertNotStopped()
    taskScheduler.defaultParallelism
  }

which goes to TaskScheduler, which has

// Get the default level of parallelism to use in the cluster, as a hint for sizing jobs.
  def defaultParallelism(): Int

which goes to TaskSchedulerImpl, which has

override def defaultParallelism(): Int = backend.defaultParallelism()

which goes to SchedulerBackend, which has

def defaultParallelism(): Int

which goes to CoarseGrainedSchedularBackend, which has

override def defaultParallelism(): Int = {
    conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
  }

Recommendation - always set the parallelism to an appropriate value.