Code commentary - animeshtrivedi/notes GitHub Wiki

Hard-coded netty max thread size.

This is stupid, as many times we run on power machines with > 40 cores. There is of course no way, netty with just 8 threads can do justice to 100 Gbps network.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/netty/SparkTransportConf.scala#L41

Hidden configuration parameters

Spark configurations are very poorly documented, specially the performance ones. Here is an example where module specific thread count is set for shuffle, file, or netty.

https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L48

Blocked broadcast factory

There is no obvious reason why is this not open to load on-demand

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala#L39

Blocked SQL serializer

There is no obvious reason why is this not open to load from user-specified class https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L39 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchange.scala#L59

Java DirectMemory limit hack

  /**
   * Uses internal JDK APIs to allocate a DirectByteBuffer while ignoring the JVM's
   * MaxDirectMemorySize limit (the default limit is too low and we do not want to require users
   * to increase it).
  */

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java#L157

Closure cleaning

Broadcast (once open) and shuffle implementations should get up-calls from the Executors about the pertinent events.

Mixed usage of Java/Kryo serialization

Example broadcast variable. The data might be serialized by Kryo(not sure!) but the Broadcast class variable is transmitted using Java serializer.

Hard-coded on/off heap locations when reading parquet files

public class VectorizedParquetRecordReader extends SpecificParquetRecordReaderBase<Object> {
  /**
   * The default config on whether columnarBatch should be offheap.
   */
  private static final MemoryMode DEFAULT_MEMORY_MODE = MemoryMode.ON_HEAP;

Hard-coded batch size in `ColumnarBatch` class

This variable determines how many rows are read as a batch. It is hard-coded to 4096. Why? Who knows. 4096 is a nice number ;)

public final class ColumnarBatch {
  private static final int DEFAULT_BATCH_SIZE = 4 * 1024;