Ideas - animeshtrivedi/notes GitHub Wiki

  • Since SQL serialization is always (int, binary) - why not write them separately in two different files. You have to read them in sync but that should be ok. that eliminates these 4 bytes reads and ints.

  • for a zero-copy SQL serializer we can then just copy the data from the binary file and materialize only 'int' bytes read from the metadata file.

  • rdma local memory copy when the node is local - holding up the performance with multi-cores? In the network mode these things will be posted async? Or don't they?


Feb 2nd

Current status : network time is disproportionate to the total compute time. What can we do

  • make sure that number of serializer instances are sane, cache them if possible;
  • implement multistream sensibly;
  • write your own shuffler with split file streams for SQL;
  • profile the code path without IO - how much lower can you push and make network prominent?
  • can we profile vanilla spark as well?
  • think about the theme of the SQL paper? what are we trying to achieve here?
  • Write java copy benchmark
  • Write join benchmark for a single machine