Notes 2013.06.29 - kevinlawler/kona GitHub Wiki

See: Primitive Operations


Register Machine

Kona Primitive Style: https://github.com/kevinlawler/kona/wiki/Primitive-operations

The point of the primitive style is to perform operations on vectors as fast as the physical device will allow. This is accomplished by performing operations on arrays of fixed-width types, items stored side by side in memory, with no intermediate space, and correctly memory-aligned. In the current system, iterating over such storage with a plain C loop, perhaps optimized by -O3, is enough to go as fast as possible, which is at the limit of how fast data can be retrieved from memory (memory bottleneck). The tricky part here is handling the iteration correctly for all types. The existing implementation, essentially, has a for loop for each possible combination, which becomes impractical with a few more base possibilities.

The current implementation of Kona handles operations such as "+" by taking in any two K structures as arguments and then resolving everything from there within a single, contained function. This system works pretty well. Adverbs are handled by applying these functions in various ways. Optimized adverbs (for say +/) are created by special-casing the code. (I think versions of Arthur's K must also have special cased adverbs, regardless of whether he had a different adverb application system, since some verb-adverb combinations were clearly optimized while some clearly weren't.) 

Kona Integer Types
8-byte
long-long

Kona Float Types
8-byte
double

Expanded Register Integer Types
1-bit 1-byte 2-byte 4-byte 8-byte
bit/boolean char/byte short int long-long

Expanded Register Float Types
4-byte 8-byte
float double

Type Containers:
atoms
vectors
lists (including nested)


 2+*|1-a+b /parsing this likely needs a stack
       a+b /but we're only concerned about this piece. there's a lot going on here
           /this is any one of {{atom,vector,list} x {integer_types} x {float_types}}^2

A new Kona engine would probably use a stack. One subpiece of that stack-based execution engine would likely be a register machine implemented to handle combinations of input types to primitives. This register machine should operate at the limit implied by the memory bottleneck. In other words, it shouldn't be any slower than a plain C loop.

The write-up describes how such a register machine might look. It is not really even a "virtual register machine", in the sense that it does not execute a full language of virtual instructions, but it is modeled after a register machine. It seems to me to be a model that works for handling the large number of combinations of types and containers. Other models may be better. This is an experiment. It is possible the register machine model is totally wrong and should not be used.

The expansion of types makes the old method undesirable for primitive operations such as addition. A register-like machine makes more sense. Probably this register-like machine can call up to the overriding stack engine, for instance during conformable list evaluation. The separation is not yet clear.



Recommendations:

0. Read data into registers using memcpy (or whichever way is best).

0. Make it branchless. Don't use "if". Use function pointers as necessary. If you have to use cases, use a switch instead of if/else.

0. Make it thread-safe. Make code functional. Keep variables function-local.

0. Handle conformable lists sensibly, potentially by calling up to the stack machine.

0. Don't use % modulo to resolve differences in conformable object size. It's slow.

0. Handle adverb operations sensibly. +/a a+/b a+/:b a+\:b a+/:/:b a+'b etc. Is it impossible to do this gracefully and automatically for all adverbs, without special-casing the code?

0. Identify "reusable" incoming data structures and use those to store output when possible, instead of allocating a new structure.

0. Do not use the "long double", 80-bit, or 128-bit floating-point types (as of 2013.06.23). Today these types are handled in software and are slow. Use of "double" or "long long" does not seem to slow anything down.

0. Avoid "__thread" whenever possible. It's slow.

0. Handle infinities correctly, particularly when casting down.

Notes:

0. You'll probably need a NOOP function.

0. All integer types can cast to "long long", and so on. If all machines had a 128-bit floating-point register, you could mostly solve the problem by casting all intermediate inputs to 128-bit floats (even 64-bit ints) and then casting the 128-bit quad down to the desired format in the correct way. For now some combination of casts to 64-bit integer and 64-bit double may suffice. Because of the bottleneck in reading data from memory, performing these casting operations in the CPU likely happens fast enough to add zero overhead. Is this true or not?


//Drop this in as main.c in Kona, and
//if Kona hasn't changed too much, this should work

#include "incs.h"
#include "k.h"

//__thread slows this down too much
//long double slows this down appreciably
//double does not slow this down
//int64 does not slow this down


//direct assignment with correct typing is significantly faster than memcpy
C LEFTFIXEDWIDTH;  //how wide is one piece of data?
C RIGHTFIXEDWIDTH;
C SAVEFIXEDWIDTH;

I LEFTLOADI = 0;   //position in the vector
I RIGHTLOADI = 0;
I SAVEI = 0;

I myregA = 0;  //registers. floats will need their own registers.
I myregB = 0; 
I myregC = 0; 

K leftArg;
K rightArg;
K saveArg;


void LEFT_LOAD_J() 
{
  myregA = kI(leftArg)[LEFTLOADI];
  //memcpy(&myregA, kI(leftArg)+LEFTLOADI, LEFTFIXEDWIDTH);//slow
  LEFTLOADI++;
}
void RIGHT_LOAD_J() 
{

  myregB = kI(rightArg)[RIGHTLOADI];
  //memcpy(&myregB, kI(rightArg)+RIGHTLOADI, RIGHTFIXEDWIDTH);//slow
  RIGHTLOADI++;
}

void MYOP_ADD_J()
{
  myregC = myregA + myregB;
}

void MYSAVE_J()
{
  kI(saveArg)[SAVEI] = myregC;
  //memcpy(kI(saveArg)+SAVEI, &myregC, SAVEFIXEDWIDTH);//slow
  SAVEI++;
}

int main(int argc,S*argv)
{
  kinit();
  args(argc,argv);
  boilerplate();


  K a = _(a: !1e6);   //count up form 0
  K b = _(b: 1+!1e6); //count up from 1
  K c = _(c: &1e6);   //zeroes vector


DO(2,

  LEFTLOADI = 0;
  RIGHTLOADI = 0;
  SAVEI = 0;
  leftArg = a;
  rightArg = b;
  saveArg = c;

  //LEFTFIXEDWIDTH = 8;
  //RIGHTFIXEDWIDTH = 8;
  //SAVEFIXEDWIDTH = 8;

  void (*LEFT_LOAD)();
  void (*RIGHT_LOAD)();
  void (*MYOP)();
  void (*MYSAVE)();

  //the cases could be the primitives (+ * - % ...)
  //I also like the idea of passing a struct/array that has these funcs specified
  //then calling the functions via the struct.members/struct->pointers
  //that might not be possible actually, as you need to vary based on input type

  SW(1)
  {
    CS(1,
      LEFT_LOAD = LEFT_LOAD_J;
      RIGHT_LOAD = RIGHT_LOAD_J;
      MYOP = MYOP_ADD_J;
      MYSAVE = MYSAVE_J;
    )
  }

  TIME(
  DO(1000000,
    LEFT_LOAD();
    RIGHT_LOAD();
    MYOP();
    MYSAVE();
    )
    )
)
    show(saveArg);

DO(2, TIME(_(a+b)))


  attend(); //loop on stdin/inet
  R 0;
}

Things I Would Change About the Way Kona Implements K

1. Don't zero memory out. Assume memory is malloc-like uninitialized (don't calloc by default)
2. reuse vectors marked for death (a+b output is same vector as say b maybe if b refcount is 1)
3. Assume malloc/mmap never returns null (true on some systems?). Simply crash if out-of-memory and can't react. Don't check everywhere
3b. Assume K'S MEMORY RETURNER never returns null instead. handle memory problems there. ? In other words don't use macros everywhere. Wait to mark data structures as "in use" until the function returns cleanly. This way failed stuff is just writing over "garbage". All is garbage until marked otherwise. You can handle this in the main loop: Maybe it's setting the reference count to 1. Nested objects? (this fails)
3c. "malloc" never returns zero and sometimes does return zero, it's complicated: see "malloc never fails" http://www.scvalex.net/posts/6/
3d. there may be a way around some of this. you can do this by mapping and unlinking a big file, maybe on tmpfs. see notes on KSWAP
4. Store more values explictly (function valence) instead of implicitly. Speed is more important
5. Store memory manager block size explicitly. no log2 stuff. Store whether a block is a file mmapped. probably add a few bytes for things like mutexes (?), sorted/union/group descriptors, etc.
6. check for type conformance as part of dispatch table? (but need more than yes/no)
7. Optimized dictionaries implemented before writing all the dictionary stuff
8. Build some manager (compare PHP's C/C++) that gracefully releases memory if an error is detected and you have to leave your function. With a stack architecture you probably get this for free or nearly free. The stack can handle freeing memory during error bubble-up. We should be able to remove M() and U() calls from the code. See also remarks on not checking for failed mallocs
9. Allow adverbs etc. to handle type checking outside verbs
10. Classify verbs to allow something more general to handle "atomic" verbs - arthur say: atomic iff a v b is a v' b
11. Be more mindful of 32-bit and 64-bit differences, use sizeof more often
12. more types? make it easy to drop in new types (workflow for `dates` and such. base on root class.)
13. multithreading
14. would c++ style template macros solve the q type problem? possibly, but probably poorly. A register machine seems correct.
15. Syms should probably be stored/interned as [unsigned] autoincrement integers (1, 2,...) not pointers. maybe
16. ...stack based execution? or not? probably yes (or something similar like register machine) see: https://groups.google.com/d/msg/kona-dev/acPihIZjx2Q/YD8YIOUdY7sJ
17. If using stack or register based architecture, get the virtual machine code correct up front
18. Probably the way types are handled in Q is that bool, short, int, long etc are all cast to long since it's OK to do that in the register as long as shorts are coming out of main memory at machine speed.
19. our c-based optimized amend is so complicated it's probably the wrong way to do it (though already built & nice to have now). bakul points out copying the assigned-to/assigned-from (depends what you want) can be very nice. arthur seems to suggest amends were accomplished by always copying. our amend is hugely complicated. note: the deep mremap issues we were solving were sidestepped by arthur by copying the vector.
20. our valence calculation is also probably the wrong implementation
21. not understanding nv->v, va->v style tables probably led to incorrect models.
22. dv_ex/vf_ex family, etc. are probably the result of not getting stack machine right.

⚠️ **GitHub.com Fallback** ⚠️