Arrow - animeshtrivedi/notes GitHub Wiki
notes
- if you get netty errors of missing functions then delete the old netty files.
cmake -DCMAKE_BUILD_TYPE=Release -DARROW_USE_SSE=ON -DCMAKE_INSTALL_PREFIX=/home/atr/local/ ..
atr@flex13:~/zrl/github/external/apache/arrow$ git diff
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 43215b63..284bc97d 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -28,6 +28,8 @@ string(REGEX MATCH
project(arrow VERSION "${ARROW_BASE_VERSION}")
+set(CMAKE_CXX_COMPILER g++)
+
set(ARROW_VERSION_MAJOR "${arrow_VERSION_MAJOR}")
set(ARROW_VERSION_MINOR "${arrow_VERSION_MINOR}")
set(ARROW_VERSION_PATCH "${arrow_VERSION_PATCH}")
@@ -360,7 +362,7 @@ set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS}")
string(REPLACE "-std=c++11" "" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
# Add C++-only flags, like -std=c++11
-set(CMAKE_CXX_FLAGS "${CXX_ONLY_FLAGS} ${CMAKE_CXX_FLAGS}")
+set(CMAKE_CXX_FLAGS "${CXX_ONLY_FLAGS} ${CMAKE_CXX_FLAGS} -O2 -Ofast -ffast-math -funroll-loops -march=native")
# ASAN / TSAN / UBSAN
if(ARROW_FUZZING)
@@ -386,6 +388,7 @@ if ("${ARROW_GENERATE_COVERAGE}")
endif()
endif()
+
# CMAKE_CXX_FLAGS now fully assembled
message(STATUS "CMAKE_CXX_FLAGS: ${CMAKE_CXX_FLAGS}")
diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc
index 6cd1546f..db2d5b30 100644
--- a/cpp/src/arrow/io/file.cc
+++ b/cpp/src/arrow/io/file.cc
@@ -376,7 +376,8 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer {
is_mutable_ = true;
} else {
prot_flags_ = PROT_READ;
- map_mode_ = MAP_PRIVATE; // Changes are not to be committed back to the file
+ std::cout << "Are we here, in mmap reading mode, I am changing it MAP_SHARED | MAP_POPULATE \n";
+ map_mode_ = MAP_SHARED | MAP_POPULATE; //PRIVATE; // Changes are not to be committed back to the file
RETURN_NOT_OK(file_->OpenReadable(path));
is_mutable_ = false;
atr@flex13:~/zrl/github/external/apache/arrow$
https://arrow.apache.org/docs/java/
Notes related to Apache Arrow
My repo is : https://github.com/animeshtrivedi/ArrowExample
Round 2
io.netty.util.internal.PlatformDependent
is used to handle Unsafe transformations.IntHolder
andNullableIntHolder
are just wrapped classes to hold value, set, and size together. These type of holder classes are autogenerated.- Once data is serialized,
ArrowBatch
becomesArrowBlock
, which is nothing but offset, metadtalength, bodyLength. ArrowBlock is read in the beginning.
public void writeBatch() throws IOException {
ensureStarted();
try (ArrowRecordBatch batch = unloader.getRecordBatch()) {
writeRecordBatch(batch);
}
}
protected void writeRecordBatch(ArrowRecordBatch batch) throws IOException {
ArrowBlock block = MessageSerializer.serialize(out, batch);
LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d",
block.getOffset(), block.getMetadataLength(), block.getBodyLength()));
recordBlocks.add(block);
}
Field
and a MinorType
?
What is the difference between a How values and bitmaps are encoded?
ValueVector.setInitialCapacity(int value)
this call sets the "value" buffer size, and "Validity" buffer size as value/8.0 - as shown in BaseValueVector
/* number of bytes for the validity buffer for the given valueCount */
protected static int getValidityBufferSizeFromCount(final int valueCount) {
return (int) Math.ceil(valueCount / 8.0);
}
ArrowWriter
, ArrowFileWriter
, and ArrowStreamWriter
(and associated readers)
There is an ArrowWriter
abstract class that is implemented by ArrowFileWriter
and ArrowStreamWriter
. I am assuming that File writer can do seek and footer checks, and stream writer is more concerned about the network RPC type data transfer.
The general pattern of work is to
this.arrowFileWriter.start
loop{
this.arrowFileWriter.write()
this.arrowFileWriter.write()
...
this.arrowFileWriter.write()
}
this.arrowFileWriter.end();
this.arrowFileWriter.close();
What is the status of the spark support for Arrow? There is Panda blog : https://arrow.apache.org/blog/2017/07/26/spark-arrow/ from an IBM guy.
org.apache.arrow.vector.VectorSchemaRoot
and org.apache.arrow.vector.types.pojo.Schema
?
what is the difference between # This derived from the Parquet schema
this.arrowSchema = new Schema(childrenBuilder.build(), null);
then
# This takes the schema + root allocator
this.arrowVectorSchemaRoot = VectorSchemaRoot.create(this.arrowSchema, this.ra);
Parquet and Arrow formats
The way things are written right now, the Arrow blocks reflect how the parquet was. So, here is how many rowgroup we get from Parquet:
Test prep finished, starting the execution now ...
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1440100
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 1441650
setting up row count as 702149
[0] totalRows: 18000399 || ints: 172712190 , long 18000399 , float4 0 , double 206280767 , binary 0 binarySize 0 || runtimeInNS 26408439683 , totalBytesProcessed 2485098088 , bandwidth 0.75 Gbps.
Here is how Arrow reader looks like:
Number of arrow block are : 13 who sets these?
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1440100
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 1441650
arrow row count is 702149
[0] totalRows: 18000399 || ints: 172712190 , long 18000399 , float4 0 , double 206280767 , binary 0 binarySize 0 || runtimeInNS 4545486155 , totalBytesProcessed 2485098088 , bandwidth 4.37 Gbps.
Best possible split size?
this.arrowVectorSchemaRoot.setRowCount
-> takes an integer, so that means I cannot save 100 billion rows in a single block. So how do I know what is the best possible split size?