Ideas - animeshtrivedi/notes GitHub Wiki
-
Since SQL serialization is always (int, binary) - why not write them separately in two different files. You have to read them in sync but that should be ok. that eliminates these 4 bytes reads and ints.
-
for a zero-copy SQL serializer we can then just copy the data from the binary file and materialize only 'int' bytes read from the metadata file.
-
rdma local memory copy when the node is local - holding up the performance with multi-cores? In the network mode these things will be posted async? Or don't they?
Feb 2nd
Current status : network time is disproportionate to the total compute time. What can we do
- make sure that number of serializer instances are sane, cache them if possible;
- implement multistream sensibly;
- write your own shuffler with split file streams for SQL;
- profile the code path without IO - how much lower can you push and make network prominent?
- can we profile vanilla spark as well?
- think about the theme of the SQL paper? what are we trying to achieve here?
- Write java copy benchmark
- Write join benchmark for a single machine