Adding null sink - animeshtrivedi/notes GitHub Wiki

Adding a NULL sink to Spark/SQL file IO

Often for benchmarking I need a null sink. The advantage of a null sink is that it decouples storage overheads from the compute, and it helps to trigger computation in Spark/SQL code. Using a null sink, for example, I can test how fast can Spark read a source file (e.g. Parquet).

My null sink is called AtrFileFormat and it is located in the package org.apache.spark.sql.execution.datasources.atr. The code is at : https://gist.github.com/animeshtrivedi/8fab18a325a5817a11437a5f6f7437f3

The idea behind is to return a OutputWriterFactor that returns a writer that is capable of writing Row/InternalRow to a storage systems. In our case we return AtrOutputWriter. The code is at : https://gist.github.com/animeshtrivedi/d3876938412bf8b8adcdb5e56e4c3066

The writer class does some accounting and counts how many rows it was asked to writer, and what is roughly the cost of the code path in time/row (assuming that only thing that is happening in the system is IO).