code notes - animeshtrivedi/notes GitHub Wiki

Field

/**
 * ----------------------------------------------------------------------
 * A field represents a named column in a record / row batch or child of a
 * nested type.
 *
 * - children is only for nested Arrow arrays
 * - For primitive types, children will have length 0
 * - nullable should default to true in general
 */
public final class Field extends Table

Field has details about the layout with content like : (BIT, DATA, OFFSET, LENGTH) etc. This by my understanding dictates how data is encoded and stored. For example, a not-nullable content will NOT have BIT map or only variable length data sets like VARBINARY have OFFSET and LENGHT etc. The corresponding data locations are then can be acquired from the FieldVector class.

ValueVector :

/**
 * An abstraction that is used to store a sequence of values in an individual column.
 *
 * A {@link ValueVector value vector} stores underlying data in-memory in a columnar fashion that is compact and
 * efficient. The column whose data is stored, is referred by {@link #getField()}.
 *
 * It is important that vector is allocated before attempting to read or write.
 *
 * There are a few "rules" around vectors:
 *
 * <ul>
 * <li>values need to be written in order (e.g. index 0, 1, 2, 5)</li>
 * <li>null vectors start with all values as null before writing anything</li>
 * <li>for variable width types, the offset vector should be all zeros before writing</li>
 * <li>you must call setValueCount before a vector can be read</li>
 * <li>you should never write to a vector once it has been read.</li>
 * </ul>
 *
 * Please note that the current implementation doesn't enforce those rules, hence we may find few places that
 * deviate from these rules (e.g. offset vectors in Variable Length and Repeated vector)
 *
 * This interface "should" strive to guarantee this order of operation:
 * <blockquote>
 * allocate &gt; mutate &gt; setvaluecount &gt; access &gt; clear (or allocate to start the process over).
 * </blockquote>
 */
public interface ValueVector extends Closeable, Iterable<ValueVector>

ValueVector is also inherited by the IntVector (and other data types) and FieldVector

/**
 * A vector corresponding to a Field in the schema
 * It has inner vectors backed by buffers (validity, offsets, data, ...)
 */
public interface FieldVector extends ValueVector

There are a bunch of Writers and Readers around

public interface BaseWriter extends AutoCloseable, Positionable

public interface BaseReader extends Positionable

How do I use them?

ArrowBlock does not have much in its definition:

public class ArrowBlock implements FBSerializable {

  private final long offset;
  private final int metadataLength;
  private final long bodyLength;

ArrowReader is also a dictionary provider which is then implemented by ArrowFileReader and ArrowStreamReader

public abstract class 
ArrowReader<T extends ReadChannel> 
implements DictionaryProvider, AutoCloseable

VectorSchemaRoot is where all the schema and buffering related objected are acquired.

Is the Row count is done block by block basis? Yes. loadNextBatch function in ArrowReader sets the root.setRowCount(0); to zero and then loads the next data set which I presume sets the row count.

public Boolean visit(ArrowRecordBatch message) {
        try {
          loader.load(message);

There is a VectorLoader class that loads "something" and sets the row count.

code notes - animeshtrivedi/notes GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️