What makes an interpreted language fast?

Version française

This is a question I've been asking myself for years. LispE is not my first programming language - far from it. I have proposed in the past Tamgu - a rather rich language that merges in the same syntax a Prolog, a Haskell and a kind of Python interpreter.

Although LispE borrows many elements from Tamgu, it offers a much more streamlined implementation. The initial choices sometimes have pernicious effects that end up weighing down a given approach, but which when caught in the web of kernel code, become unbreakable. Sometimes the best solution is to start over.

The choice we made in the case of LispE is to use C++ classes in the spirit of functional programming. Thus, each object in LispE is an instance of a particular class associated with an eval method (see Interpreter) Furthermore, each instruction in the language is a particular method in the List class. In fact, it only takes a few minutes to enrich the language with a new method.

Thus each object knows how to evaluate itself, so there is no virtual machine to execute the instructions centrally, which has the effect of making multi-tasking almost transparent. On the other hand, the question of efficiency arises immediately.

Is this implementation fast?

Stochastic Gradient Descent

To answer this question, we have implemented a Stochastic Gradient Descent algorithm. First of all, the choice of this algorithm is not accidental. It is absolutely fundamental in Artificial Intelligence today, including in the most advanced approaches such as Transformer. Obviously, the code we propose here is unlikely to appear in any professional code. In Python, the implementation would use PyTorch or numpy.

On the other hand, this algorithm is sufficiently complex that a comparison between LispE and Python is as interesting as it is meaningful.

We will not describe the two programs in detail.

Their code is available here:

On the other hand, we have tried to keep the same coding style on both sides.

Least Squares Loss Function

We assume that our data have the following structure:

# X and Y have the same size
X = [[1, 0.72, 0.32], [1, 0.75, 0.12], [1, 0.53, 0.65] etc.]
Y = [6.93, 5.99, 1.46, 1.44, 4.51, 1.25, 2.53, 6.88 etc.]

# a has the same number of element as one element from X
a = [0.1, 0.1, 0.1]

For example, the least squares loss function:

𝑆 = ∑(𝑦𝑖 - <𝑎, 𝑥𝑖>)²

is implemented as follows in Python:

def Loss(a, X,Y):
     # sAX contains the calculation of: <a, xi>
     sAX = []
     for x in X:
         sAX.append(sum([w * e for (w,e) in zip(a,x)]))
     return sum([(yy - e)**2 for (yy,e) in zip(Y,sAX)])

We use list comprehension where list elements are combined via the zip operator.

In LispE, the code is very similar:

(setq Loss (sum (zipwith (λ(x y) (• (• y - (sum (• a * x))) ^^ 2)) X Y)))

Note that the computation of sAX is implicit in LispE. Indeed, the language knows by default how to multiply two lists, element by element. The • operator makes it possible to write a mathematical formula in its infixed form:

(• a * x) actually corresponds to: (* a x)

When we run both codes on the same machine, here an iMac equipped with an Intel Core i7 quad-core processor at 4.2 GHz with 32 GB of memory, we get the following results:

LispE:  9ms
Python: 45ms

The difference is particularly significant: 1 to 5.

Fun Facts

The number of iterations in Python and LispE to reach the same values is a little different. For some reasons, it takes 449 iterations with Python and 452 with LispE.

Still, Python and LispE utilise the same 64 bits float representation, but calculus errors tend to stack up to the point that a little bit can be set down deep on the mantissa, which for some reasons would not occur with Python, blurring the comparison between two loss values.

However, we also tested with 32 bits floats on LispE (simply replaces numbers with floats.)

In that case the results are quite different... REALLY different.

It only takes 175 iterations to converge in 3ms.

Reasons for such a difference

To be honest, the first version of LispE did not offer such performance. In fact, we followed the well-known principle that premature optimization is the root of all evil. We focused on the language implementation first before addressing performance issues. Fortunately, the implementation choices turned out to be quite sound (the result of years of errancy). Only a few minor local changes were needed to improve the performance.

Specialized Types

The first choice was to offer in LispE specialized lists in the form of particular C++ classes, each with its own method overloads.

Thus, the Numbers class offers specific methods for addition or multiplication, which minimizes the overhead that a generic access to a list would have entailed.

//A list of type Numbers has its own addition method
//We can also detect the type of e in order to choose the most 
//efficient way to compute the addition
Element* Numbers::plus_direct(LispE* lisp, Element* e) {
    switch (e->type) {
        case t_numbers: {
	//We "caste" e into an object of the same type
            Numbers* n = (Numbers*)e;
            long szl = list.size();
            long i = n->list.size();

           szl = lmin(szl, i);

	//the addition of the two lists is done locally
            for (i = 0; i < szl; i++) {
                list.vector[i] += n->list[i];
            }
            return this;
etc.

The fact that the addition or multiplication of two lists is done directly, almost without intermediate objects, has a very significant impact on performance, by almost 40% compared to the initial version.

In Python, this absence of specialized objects is largely made up for by numpy.

Object pools

C++ is often contrasted with programming languages whose memory management is done via a Garbage Collector. As usual this is only partly true. Of course C++ forces the programmer to carefully analyze his code to check that all created objects have been cleaned up at the end of the analysis. Memory leaks are not an urban legend. They are one of the main reasons why C++ has the reputation of being a difficult language.

But there is a second aspect that is often forgotten: memory fragmentation. Indeed, the allocation and deallocation of objects can have disastrous impacts on the execution speed. Since objects are never quite the same size, their creation and destruction can leave gaps, which forces the system to regularly reorganize the memory to avoid saturating it too quickly. These operations end up having an extravagant cost on execution, dividing performance by two or three. The solution is to reuse objects as much as possible. Instead of destroying an unused object, it is much more interesting to place it in a pool of objects of the same type that can be drawn from when needed. Of course, only the most frequent objects, such as numbers, strings or lists are handled through these object pools, since they can eat a sizeable part of the memory. At the end of the execution, these object pools can be destroyed in just one step.

In LispE we have about twenty pools for the most frequent types (see lispe.h).

The use of these pools in LispE has reduced the computation time by a factor of 3.

Custom dictionary: binHash

Each call to a function or lambda requires variables that need to be stored in the execution stack. We have created a custom dictionary whose keys are 16 bits integers. Actually, variable and function names are all encoded as 16-bit integers. This allows the use of 65536 different names, which is the size of a good English dictionary.

This dictionary is composed of a vector where each element is a table of 64 elements. Hence, this vector has a maximum size of 1024 tables (1024 * 64 = 65536.) Note that these dictionaries are also managed as a pool, which makes it possible to reuse these structures at will.

To position an element in a dictionary, you just have to calculate the index of its table in the vector and then its position in this table:

//Compute indexes and positions for "name":

index = name >> 6; //i.e. division by 64
position = name & 0x3F; //is the remainder of the division by 64

As we can see, the calculation of a key is almost straightforward. Moreover, as variables are necessarily unique at a time t, collisions cannot be a problem.

The initial version was based on a std::unordered_map, whose replacement allowed us to divide the execution time by 2.

Reference Counter

Finally, another aspect on which we have worked a lot is how the reference counter is handled (see status in the Element class). In an interpreter, reference counters of elements in containers are incremented only once, when they are stored. However, when you delete a container, then you have to go through each element and decrement their reference counter. This deletion can trigger a recursive traversal for destruction for each element in each sub-containers, which can take time.

However, since we use specialized lists, which contain values and not objects, the destruction of these lists does not require any traversal and can be done in one single step. Consequently, their creation and destruction is much faster than that of a generic list. This choice reduces the execution time by almost 20%, which again significantly improves performance.

Conclusion

The first version of LispE took 85ms to run the same gradient descent code. Today it takes only 9ms.

The changes presented here are not the only ones that have improved performance. Other aspects have also been considered, such as reducing the number of arguments in function calls to reduce the size of the interpreter's execution stack. We also tried to reduce the number of if in the code as much as possible, following Daniel Lemire's advice to reduce the side effects of out-of-order execution on modern processors. But while these changes have improved efficiency, none has had the spectacular effect of the changes described here.

C++ is a language known for producing some of the most efficient binaries in the world (along with C and Rust). But the fact remains that sometimes it takes only a few small changes to achieve the most surprising improvements.

2.5 LispE vs. Python: A Stochastic Gradient Descent Comparison - naver/lispe GitHub Wiki

What makes an interpreted language fast?

Stochastic Gradient Descent

Least Squares Loss Function

Fun Facts

Reasons for such a difference

Specialized Types

Object pools

Custom dictionary: binHash

Reference Counter

Conclusion

⚠️ GitHub.com Fallback ⚠️

2.5 LispE vs. Python: A Stochastic Gradient Descent Comparison - naver/lispe GitHub Wiki

What makes an interpreted language fast?

Stochastic Gradient Descent

Least Squares Loss Function

Fun Facts

Reasons for such a difference

Specialized Types

Object pools

Custom dictionary: binHash

Reference Counter

Conclusion

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️