NMatrix Developer Guide - wlevine/nmatrix GitHub Wiki

The goal of this page is to give a general overview of NMatrix for those who may want to hack on the code, and to make it easier for new contributors to jump right in. This is obviously still a work in progress, but will hopefully fill out over time.

Portions of this guide were taken from the NMatrix Red Book by John Woods.

Terminology

####dtype ext/nmatrix/nmatrix.h The data type stored in a matrix, e.g., int8, int16, float64, rational128, Ruby object. These are generally represented by symbols, and include both the basic type and the number of bits in their names (except for Ruby objects, which vary system to system). These are represented by enum dtype_t in nmatrix.h. ####stype ext/nmatrix/nmatrix.h The storage type of a matrix, such as dense, list-of-lists, or “new” Yale. We generally refer to these as simply dense, list, or yale. These are represented by enum stype_t in nmatrix.h. ####itype ext/nmatrix/nmatrix.h Yale matrices have this property, which is similar to dtype but indicates the type of integer storage used for the matrix indices. The itype of a matrix is a function of the matrix’s shape. itype and dtype may not be interchanged in function pointer arrays except under very specific circumstances. These are represented by enum itype_t in nmatrix.h. ####shape The dimensions of a matrix, e.g., n-by-m, which is given in the order: [rows, columns, …].

Namespaces and Prefixes

While most of NMatrix is now written in C++, it is important that compatibility be maintained with C. If a templated global C++ function needs to be accessed from C (e.g., for either API or the Ruby interface), the following convention is utilized:

The C++ function is placed in a namespace (e.g., namespace nm { }) or is declared static if possible. The C function receives the prefix nm_, e.g., nm_multiply() (this function also happens to be static).
C macros are capitalized and generally have the prefix NM_, as with NM_DTYPE().
C functions (and macros, for consistency) are placed within extern “C” { } blocks to turn off C++ mangling.
C macros (in extern blocks) may represent C++ constants (which are always defined in namespace nm {} or a child thereof).

Components

Dense matrices

This row-major, contiguous format is the most basic and fully developed of the NMatrix types. At one dimension, these are used to represent vectors; with two or more dimensions, they are the default NMatrix type.

List matrices

List matrices are n-dimensional sparse types represented by singly-linked lists of key-value pairs. The lists are sorted by key. Non-existent keys indicate “empty” cells/rows/columns.

Empty need not mean all-zeros; list matrices allow the user to set the default value, so you may have a list matrix consisting almost entirely of 1s, consuming minimal space.

List is generally used for the process of constructing matrices, which are then cast to either yale or dense.

Yale matrices

There are actually two types of Yale matrices, “old” and “new.” Most sparse matrix algorithms can handle either. We use the latter. #####“Old” Yale This is the format used by SciPy and MATLAB. Briefly, three arrays are required: IA, JA, and A. A contains the non-zero entries of the matrix in row-major order. IA stores the indices of the first entry in each row; in other words, IA[1] - IA[0] gives the number of non-zero entries in the first row, and IA[2] - IA[1] gives the number of non-zero entries in the second row, etc.

The entries in JA correspond exactly to the entries in A, providing the column index for those entries. Thus, IA[0] not only points to the first value in A and the first non-zero entry in the matrix, but also to stored column index in JA.

The last value in IA gives the number of non-zero entries in the matrix.

In theory, rows need not be sorted by columns; however, if they are sorted, retrieval is O(log2(n)) instead of O(n). #####“New” Yale “New” Yale is based on the observation that diagonal entries are accessed more frequently than non-diagonal entries. Thus, diagonal entries are stored separately and statically; thus, an n by m matrix will always have at least n+1 entries, and these can be accessed in constant time.

In theory, “new” Yale matrices may have a non-zero default value; however, most or all published sparse algorithms depend upon a zero default, so NMatrix expects and requires a zero default.

“New” Yale combines the IA and JA arrays into a single IJA array. The IA portion is always n entries long, corresponding exactly to the D (diagonal) portion of A (which is stored at the beginning of A). Thus, A[1] gives the [1,1] position in the matrix, and IJA[1] gives the index of the first non-diagonal non-zero in the second row.

Separating D from the rest of A is a zero location, which determines the representation of 0 used by the matrix. As mentioned, this could theoretically be a non-zero value; but that would mess up sparse matrix multiplication, among other things.

Separating IA from JA in IJA is NDNZ, the number of non-diagonal non-zero entries in the matrix. This is the same value that terminates the IA array in “old” Yale.

What happens if the matrix is not square? First, remember that the IA portion of IJA and the D portion of A must line up. Secondly, remember that you need an IA entry for each row. Then a “tall” matrix (e.g., more rows than columns) will have at least one undefined entry in D.

What about a “short” matrix? That is, what if you have three rows and five columns? Then the A and IJA arrays will look very similar to a square 3x3 matrix, with only three positions for the diagonal and IA arrays.

If you want to see how a Yale matrix is stored, try creating one in your favorite Ruby console, and then call #extend NMatrix::YaleFunctions. That will load up a bevy of debugging functions for Yale, such as yale_ija, yale_ia, yale_ja, yale_d, yale_lu (for lower and upper), yale_a, and so on. You can then view examples of “new” Yale storage really easily.

Template tables

While compilers typically optimize out switch() statements, we made a decision that they were awkward to write over and over again. Copying NArray, we made use of function pointer arrays, indexed on enumerators, in place of conditionals. These arrays were turned into statically-declared, local macros (most of which are defined in data/data.h).

For example, let’s say you want to call a templated function which varies for each dtype. The C++ function might look like this:

template <typename DType>
void* foo(void* data_ptr) {
  DType data = *reinterpret_cast<DType*>(data_ptr);
  DType* ret = ALLOC_N(char, sizeof(DType));
  *ret = data + 4;
  return reinterpret_cast<void*>(ret);
}

Because this C++ function is going to be called from C, the signature needs to look the same for foo<int8_t> as it does for foo<Complex64>. Therefore, we use void*.

Note: At some point in the future, these should all be converted to anonymous structs or something similar.

If we weren’t using a C accessor, we could define it like a normal C++ template (e.g., template <typename DType> DType foo(const DType& data_ptr) { … }). However, we would then need to place it in a header file in order to prevent run-time unresolved symbol errors.

The C accessor for foo could look like this:

static void* foo(void* data_ptr, dtype_t dtype) {
  DTYPE_TEMPLATE_TABLE(nm::foo, void*, void* data_ptr)
  return ttable[dtype](data_ptr);
}

and, of course, it must go in an extern “C” {} block.

If you want to name the table something other than ttable, use NAMED_DTYPE_TEMPLATE_TABLE(my_name, ...) instead.

It’s worth looking at how these template table macros are defined (see data/data.h). Each template parameter means another dimension of function pointers, so these tables grow exponentially with the number of parameters. As a result, we forbid use of more than three template arguments in template tables. You will generally only see three for functions that involve a yale matrix, which has to be templated on both itype and dtype.

Common Template Tables:

NAMED_DTYPE_TEMPLATE_TABLE: unary operations on dense and list matrices (depending only on dtype).
NAMED_LR_DTYPE_TEMPLATE_TABLE: binary operations on dense and list matrices (like left_matrix == right_matrix).
NAMED_OP_LR_DTYPE_TEMPLATE_TABLE: element-wise binary operations on dense and list matrices (like left_matrix + right_matrix).
NAMED_LI_DTYPE_TEMPLATE_TABLE: unary operations for Yale matrices, which must be indexed on itype and dtype.
NAMED_LRI_DTYPE_TEMPLATE_TABLE: binary operations for Yale matrices, which must be indexed on itype and left and right dtype. This one actually has a bug -- as it assumes that both the left and right matrices will have the same itype. Since itype depends only on shape, it’s possible for left and right matrices to actually have different itypes.
NAMED_OP_LR_DTYPE_TEMPLATE_TABLE: binary element-wise operations for Yale matrices. These are guaranteed to have the same itype, so the template tables only need to be indexed on left and right dtype and the operator (e.g., +).

Interacting with the ruby garbage collector

####Background

From within C/C++ code, ruby exposes objects as the typedef VALUE, which as of this writing is an unsigned long. Internally, this might be just the (slightly modified) value of a number (for things like fixnum), or a casted pointer to a ruby data structure.

The ruby garbage collector is a mark and sweep type collector. For ruby programs (ignoring C extensions for the moment), this means, in the simplest sense, that the interpreter keeps track of all objects in existence, and when garbage collection runs, a first pass marks all objects that are accessible from the code, and a second pass frees up all the objects that are no longer accessible. In a C extension, this might cause problems: the C code might use ruby VALUEs that aren't accessible from the ruby side (and thus wouldn't be marked), but still shouldn't be freed up. There are a number of mechanisms in place to prevent this problem:

Garbage collection only runs once C code has returned to ruby, or during a call to a ruby C API method. This means that you don't need to worry about something like a dedicated GC thread starting garbage collection at any arbitrary point in your function; only defined points can be problematic.
Data_Wrap_Struct: this is a ruby C API method that is used to wrap a C struct in a ruby VALUE (see ruby's README.ext for more details). It allows you to pass a marking function and a freeing function. If the ruby garbage collector marks this ruby VALUE, then the marking function will be called. By creating an appropriate marking function, it's possible to mark VALUEs hidden in the C struct and prevent them from being garbage collected. For NMatrix, this mechanism is key for the implementation of object-dtype NMatrix objects.
Ruby checks the stack for VALUEs and pointers to VALUEs still in use by your C code. This is pretty neat. If you for instance have a case where your code is:

VALUE x = ...;
rb_call_some_c_api_method();
return x;

Then ruby should see x on the stack and make sure not to garbage collect it during that api call. The same is true if x is a VALUE* to some VALUE(s) on the heap.

The problem

Two cases aren't sufficiently dealt with by these mechanisms.

You have a pointer on the stack to some struct that internally contains VALUEs, but you don't have a pointer to those VALUEs (or the VALUEs themselves) on the stack, and you want to make a ruby C API call. This would be simply solved by just putting the VALUEs on the stack before the API call if not for the second problem.
Optimizing compilers. If you're running the compiler with any optimizations turned on, it's hard to guarantee that any particular VALUE is actually on the stack when you need it to be. Given that NMatrix is a library for scientific computing, in which it's common to be CPU-limited, turning off optimizations is not ideal.

The typical solution

The typical solution to the problem of the optimizing compiler is to mark VALUEs as volatile, a keyword that (simplistically) indicates that some code that the compiler doesn't know about (whether hardware, another thread, etc.) might interact with the variable declared volatile. This generally means that the compiler won't optimize volatile variables out because there might be some unintended side effect.

To solve the problem using volatile:

Find everywhere there's a call to a ruby API method (or a call to an NMatrix method that calls a ruby API method, etc.).
Before each call, ensure that all VALUEs in use by the code (whether normally declared directly or as part of a struct, etc.) are stored in a volatile variable on the stack.

However, it's not completely clear whether this will prevent all optimizations that would cause issues with the garbage collector. Even if volatile does prevent all problematic optimizations, it's not clear that this is desirable from a performance perspective (however, more testing would be needed to figure this out). A reasonable interpretation of recent C++ specifications might also be that use of volatile is discouraged except for hardware interactions. Thus, just marking all VALUEs volatile is perhaps not ideal.

The NMatrix solution

NMatrix has chosen instead to use a static data structure that tracks all VALUEs currently in use by the code and that can tell the garbage collector to mark these at the appropriate time. (This structure is created in ruby_nmatrix.c in __nm_initialize_value_container and marked by __nm_mark_value_container.) This structure is currently just a linked list implementation of a stack of VALUEs in use. While this in theory is not great because it requires a linear-time search for a VALUE in order to remove it when it's no longer needed, in practice, this removal usually happens in the reverse order that VALUEs were added, so only the top of the stack needs to be checked.

In order for this solution to work, a function must ensure that any VALUEs in use are registered in this data structure before any ruby API call that could cause garbage collection. To do this, there are functions nm_register_value(VALUE&), nm_register_values(VALUE*, size_t), nm_register_storage(stype_t, STORAGE*), and nm_register_nmatrix(NMATRIX*).

Here's an (admittedly contrived) example:

VALUE* do_some_calculation() {
  VALUE result = some_other_function();
  nm_register_value(result);
  VALUE* result_holder = NM_ALLOC(VALUE);
  *result_holder = result;
  nm_unregister_value(result);
  return result_holder;
}

In this example it's possible with optimizations on that result might be garbage collected in the NM_ALLOC macro, which just expands to the ruby ALLOC macro, which in turns expands to an allocation function that can run garbage collection if memory is low. The nm_register_value call ensures that result is safe during the ruby API call.

Note from this example that a call to nm_unregister_value is also required. All of the nm_register_ functions have a matching nm_unregister_ partner. This should be called after the last dangerous ruby API call to remove the VALUE from the registration data structure. Otherwise the VALUE will never be freed up during GC and will effectively be leaked. This is analogous to malloc and free, although one important difference is that by convention, a VALUE registered in a function should always be unregistered in that same function (since it's only necessary to maintain the registration during ruby API calls). This makes it easier to avoid memory leaks resulting from having to track registrations across function boundaries. For the current implementation of the registration structure (a stack), it's most efficient to unregister VALUEs in reverse order from their registrations, so if there's a choice, do this.

What happens when we're not sure?

The registration approach is complicated in that it's not always easy to know whether a given registration is required or not, since the appearance of bugs will depend on the precise timing of when GC runs and what compiler optimizations are happening. It's undesirable to have too many extra registrations because each registration imposes a slight performance decrease, and NMatrix is designed to be high-performance. For cases where we're not sure that a registration is needed or isn't needed, we've added a preprocessor macro NM_CONSERVATIVE, which just deletes a statement. For example:

NM_CONSERVATIVE(nm_register_value(result));

would make this registration not happen in production code. This does make the code messier, though, so if you're totally sure a registration isn't needed, don't put it in at all.

For debugging purposes, you can turn back on all the NM_CONSERVATIVE statements by defining NM_GC_CONSERVATIVE during compilation. This is the first thing to try if you're seeing a bug where it looks like something is being garbage collected too soon. If this fixes it, it should then be possible to script going through all the NM_CONSERVATIVE statements and disable each one in turn to find which is the problematic one.

Also note that when marking registrations with NM_CONSERVATIVE its partner unregistration should also be marked conservative. While unregistering a VALUE that wasn't registered won't cause a serious problem, it may incur a performance overhead from checking through all the VALUEs that are registered to see if any match.

NMatrix memory allocation macros

As an additional debugging tool, we've defined NMatrix-specific allocation macros that mirror the ruby ones. For example, NM_ALLOC(type) does the same thing as ruby's ALLOC(type). We've also added NM_FREE to replace xfree. Use these NMatrix-specific ones in NMatrix code. When debugging memory issues, this allows easier instrumentation of allocations and deallocations throughout the NMatrix code without having to worry about redefining ALLOC everywhere that its definition is included.

Code style

We generally follow common ruby style conventions.

Use two spaces for indentation both in ruby and C/C++ code.
GitHub's Ruby Styleguide is generally pretty good. We just don't use TomDoc.
For tests, use rspec, preferring newer expect syntax over should. See Issue #150.
Documentation currently uses rdoc, but will eventually switch to YARD once C++ support is in place.

When in doubt, check nearby code and follow its example!

Testing

We aim for complete unit test coverage of NMatrix. If submitting a pull request, please include tests of whatever functionality you're adding. If it's a bug, write a test that reveals the bug. If it's a new feature, in addition to fine-grained testing of the components, add tests for all the required functionality of the feature.