code notes - animeshtrivedi/notes GitHub Wiki
Field
/**
* ----------------------------------------------------------------------
* A field represents a named column in a record / row batch or child of a
* nested type.
*
* - children is only for nested Arrow arrays
* - For primitive types, children will have length 0
* - nullable should default to true in general
*/
public final class Field extends Table
Field has details about the layout with content like : (BIT, DATA, OFFSET, LENGTH) etc. This by my understanding dictates how data is encoded and stored. For example, a not-nullable content will NOT have BIT map or only variable length data sets like VARBINARY have OFFSET and LENGHT etc. The corresponding data locations are then can be acquired from the FieldVector
class.
ValueVector
:
/**
* An abstraction that is used to store a sequence of values in an individual column.
*
* A {@link ValueVector value vector} stores underlying data in-memory in a columnar fashion that is compact and
* efficient. The column whose data is stored, is referred by {@link #getField()}.
*
* It is important that vector is allocated before attempting to read or write.
*
* There are a few "rules" around vectors:
*
* <ul>
* <li>values need to be written in order (e.g. index 0, 1, 2, 5)</li>
* <li>null vectors start with all values as null before writing anything</li>
* <li>for variable width types, the offset vector should be all zeros before writing</li>
* <li>you must call setValueCount before a vector can be read</li>
* <li>you should never write to a vector once it has been read.</li>
* </ul>
*
* Please note that the current implementation doesn't enforce those rules, hence we may find few places that
* deviate from these rules (e.g. offset vectors in Variable Length and Repeated vector)
*
* This interface "should" strive to guarantee this order of operation:
* <blockquote>
* allocate > mutate > setvaluecount > access > clear (or allocate to start the process over).
* </blockquote>
*/
public interface ValueVector extends Closeable, Iterable<ValueVector>
ValueVector
is also inherited by the IntVector
(and other data types) and FieldVector
/**
* A vector corresponding to a Field in the schema
* It has inner vectors backed by buffers (validity, offsets, data, ...)
*/
public interface FieldVector extends ValueVector
There are a bunch of Writers and Readers around
public interface BaseWriter extends AutoCloseable, Positionable
public interface BaseReader extends Positionable
How do I use them?
ArrowBlock
does not have much in its definition:
public class ArrowBlock implements FBSerializable {
private final long offset;
private final int metadataLength;
private final long bodyLength;
ArrowReader
is also a dictionary provider which is then implemented by ArrowFileReader
and ArrowStreamReader
public abstract class
ArrowReader<T extends ReadChannel>
implements DictionaryProvider, AutoCloseable
VectorSchemaRoot
is where all the schema and buffering related objected are acquired.
Is the Row count is done block by block basis? Yes. loadNextBatch
function in ArrowReader
sets the root.setRowCount(0);
to zero and then loads the next data set which I presume sets the row count.
public Boolean visit(ArrowRecordBatch message) {
try {
loader.load(message);
There is a VectorLoader
class that loads "something" and sets the row count.