Code commentary - animeshtrivedi/notes GitHub Wiki
This is stupid, as many times we run on power machines with > 40 cores. There is of course no way, netty with just 8 threads can do justice to 100 Gbps network.
Hidden configuration parameters
Spark configurations are very poorly documented, specially the performance ones. Here is an example where module specific thread count is set for shuffle, file, or netty.
There is no obvious reason why is this not open to load on-demand
There is no obvious reason why is this not open to load from user-specified class https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L39 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchange.scala#L59
/**
* Uses internal JDK APIs to allocate a DirectByteBuffer while ignoring the JVM's
* MaxDirectMemorySize limit (the default limit is too low and we do not want to require users
* to increase it).
*/
Broadcast (once open) and shuffle implementations should get up-calls from the Executors about the pertinent events.
Example broadcast variable. The data might be serialized by Kryo(not sure!) but the Broadcast class variable is transmitted using Java serializer.
public class VectorizedParquetRecordReader extends SpecificParquetRecordReaderBase<Object> {
/**
* The default config on whether columnarBatch should be offheap.
*/
private static final MemoryMode DEFAULT_MEMORY_MODE = MemoryMode.ON_HEAP;
This variable determines how many rows are read as a batch. It is hard-coded to 4096. Why? Who knows. 4096 is a nice number ;)
public final class ColumnarBatch {
private static final int DEFAULT_BATCH_SIZE = 4 * 1024;