Arrow - animeshtrivedi/notes GitHub Wiki

notes

if you get netty errors of missing functions then delete the old netty files.

cmake -DCMAKE_BUILD_TYPE=Release -DARROW_USE_SSE=ON -DCMAKE_INSTALL_PREFIX=/home/atr/local/ ..

atr@flex13:~/zrl/github/external/apache/arrow$ git diff
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 43215b63..284bc97d 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -28,6 +28,8 @@ string(REGEX MATCH
 
 project(arrow VERSION "${ARROW_BASE_VERSION}")
 
+set(CMAKE_CXX_COMPILER g++)
+
 set(ARROW_VERSION_MAJOR "${arrow_VERSION_MAJOR}")
 set(ARROW_VERSION_MINOR "${arrow_VERSION_MINOR}")
 set(ARROW_VERSION_PATCH "${arrow_VERSION_PATCH}")
@@ -360,7 +362,7 @@ set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS}")
 string(REPLACE "-std=c++11" "" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
 
 # Add C++-only flags, like -std=c++11
-set(CMAKE_CXX_FLAGS "${CXX_ONLY_FLAGS} ${CMAKE_CXX_FLAGS}")
+set(CMAKE_CXX_FLAGS "${CXX_ONLY_FLAGS} ${CMAKE_CXX_FLAGS}  -O2 -Ofast -ffast-math -funroll-loops -march=native")
 
 # ASAN / TSAN / UBSAN
 if(ARROW_FUZZING)
@@ -386,6 +388,7 @@ if ("${ARROW_GENERATE_COVERAGE}")
   endif()
 endif()
 
+
 # CMAKE_CXX_FLAGS now fully assembled
 message(STATUS "CMAKE_CXX_FLAGS: ${CMAKE_CXX_FLAGS}")
 
diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc
index 6cd1546f..db2d5b30 100644
--- a/cpp/src/arrow/io/file.cc
+++ b/cpp/src/arrow/io/file.cc
@@ -376,7 +376,8 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer {
       is_mutable_ = true;
     } else {
       prot_flags_ = PROT_READ;
-      map_mode_ = MAP_PRIVATE;  // Changes are not to be committed back to the file
+      std::cout << "Are we here, in mmap reading mode, I am changing it MAP_SHARED | MAP_POPULATE \n";
+      map_mode_ = MAP_SHARED | MAP_POPULATE; //PRIVATE;  // Changes are not to be committed back to the file
       RETURN_NOT_OK(file_->OpenReadable(path));
 
       is_mutable_ = false;
atr@flex13:~/zrl/github/external/apache/arrow$

https://arrow.apache.org/docs/java/

Notes related to Apache Arrow

My repo is : https://github.com/animeshtrivedi/ArrowExample

code-notes

Round 2

io.netty.util.internal.PlatformDependent is used to handle Unsafe transformations.
IntHolder and NullableIntHolder are just wrapped classes to hold value, set, and size together. These type of holder classes are autogenerated.
Once data is serialized, ArrowBatch becomes ArrowBlock, which is nothing but offset, metadtalength, bodyLength. ArrowBlock is read in the beginning.

public void writeBatch() throws IOException {
    ensureStarted();
    try (ArrowRecordBatch batch = unloader.getRecordBatch()) {
      writeRecordBatch(batch);
    }
  }

  protected void writeRecordBatch(ArrowRecordBatch batch) throws IOException {
    ArrowBlock block = MessageSerializer.serialize(out, batch);
    LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d",
        block.getOffset(), block.getMetadataLength(), block.getBodyLength()));
    recordBlocks.add(block);
  }

What is the difference between a `Field` and a `MinorType`?

How values and bitmaps are encoded?

ValueVector.setInitialCapacity(int value) this call sets the "value" buffer size, and "Validity" buffer size as value/8.0 - as shown in BaseValueVector

 /* number of bytes for the validity buffer for the given valueCount */
  protected static int getValidityBufferSizeFromCount(final int valueCount) {
    return (int) Math.ceil(valueCount / 8.0);
  }

`ArrowWriter`, `ArrowFileWriter`, and `ArrowStreamWriter` (and associated readers)

There is an ArrowWriter abstract class that is implemented by ArrowFileWriter and ArrowStreamWriter. I am assuming that File writer can do seek and footer checks, and stream writer is more concerned about the network RPC type data transfer.

The general pattern of work is to

this.arrowFileWriter.start

loop{
 this.arrowFileWriter.write()
 this.arrowFileWriter.write()
 ...
 this.arrowFileWriter.write()
}
this.arrowFileWriter.end();
this.arrowFileWriter.close();

What is the status of the spark support for Arrow? There is Panda blog : https://arrow.apache.org/blog/2017/07/26/spark-arrow/ from an IBM guy.

what is the difference between `org.apache.arrow.vector.VectorSchemaRoot` and `org.apache.arrow.vector.types.pojo.Schema`?

# This derived from the Parquet schema 
this.arrowSchema = new Schema(childrenBuilder.build(), null);

then

# This takes the schema + root allocator 
this.arrowVectorSchemaRoot = VectorSchemaRoot.create(this.arrowSchema, this.ra);

Parquet and Arrow formats

The way things are written right now, the Arrow blocks reflect how the parquet was. So, here is how many rowgroup we get from Parquet:

Test prep finished, starting the execution now ...
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1440100
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 1441650
 	 setting up row count as 702149
	 [0] totalRows: 18000399 || ints: 172712190 , long 18000399 , float4 0 , double 206280767 , binary 0 binarySize 0 || runtimeInNS 26408439683 , totalBytesProcessed 2485098088 , bandwidth 0.75 Gbps.

Here is how Arrow reader looks like:

Number of arrow block are : 13 who sets these?
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1440100
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 1441650
	 arrow row count is 702149
	 [0] totalRows: 18000399 || ints: 172712190 , long 18000399 , float4 0 , double 206280767 , binary 0 binarySize 0 || runtimeInNS 4545486155 , totalBytesProcessed 2485098088 , bandwidth 4.37 Gbps.

Best possible split size?

this.arrowVectorSchemaRoot.setRowCount -> takes an integer, so that means I cannot save 100 billion rows in a single block. So how do I know what is the best possible split size?