Adding null sink - animeshtrivedi/notes GitHub Wiki
Adding a NULL sink to Spark/SQL file IO
Often for benchmarking I need a null sink. The advantage of a null sink is that it decouples storage overheads from the compute, and it helps to trigger computation in Spark/SQL code. Using a null sink, for example, I can test how fast can Spark read a source file (e.g. Parquet).
My null sink is called AtrFileFormat
and it is located in the package org.apache.spark.sql.execution.datasources.atr
. The code is at :
https://gist.github.com/animeshtrivedi/8fab18a325a5817a11437a5f6f7437f3
The idea behind is to return a OutputWriterFactor that returns a writer that is capable of writing Row/InternalRow to a storage systems. In our case we return AtrOutputWriter
. The code is at :
https://gist.github.com/animeshtrivedi/d3876938412bf8b8adcdb5e56e4c3066
The writer class does some accounting and counts how many rows it was asked to writer, and what is roughly the cost of the code path in time/row (assuming that only thing that is happening in the system is IO).