Sample ~ Data Encoding - miniboxing/ildl-plugin GitHub Wiki

If you haven't read the [[introduction|Tutorial--Introduction]] and [[first example|Tutorial--Example-(Part-1)]] already, it's a good time to do so, as this benchmark description assumes familiarity with the ildl-plugin.

We will show a benchmark that uses the transformation developed during the example section. You will finally see why we spent so much energy defining the large transformation description object in the tutorial: it brings speedups of up to 12x.

The benchmark files are available in the ildl-plugin repository, in the tests/benchmarks/src/ildl/benchmark/gcd directory. If you have imported the ildl-* projects in the Scala IDE, this benchmark is available under the ildl-benchnarks project, in the src directory, in package ildl.benchmark.gcd.

Problem

Have you noticed...?

Have you noticed that, in the introductory example, we defined complex number with integer components instead of floating-point ones, as it is usually done? Did it seem peculiar?

Complex numbers with integers as both their real and imaginary parts are known as Gaussian integers, and are a countable subset of all complex numbers. The operations defined on Gaussian integers are similar to complex number operations, with one exception: division is not precise, as it rounds the result to the nearest Gaussian integer.

This matches the behaviour of integers:

scala> 5 / 2
res0: Int = 2
scala> 5 % 2
res1: Int = 1

The interesting thing about Gaussian integers is that we can define the Greatest Common Divisor (GCD), which is computed exactly as for integers. This following stack exchange page explains more: http://math.stackexchange.com/questions/82350/gcd-of-gaussian-integers.

Solution

The following snippet is the Euclidean algorithm that finds the Greatest Common Divisor (GCD) for integers:

def gcd_direct(n1: Int, n2: Int): Int = {
  @annotation.tailrec
  def gcd_inner(n1: Int, n2: Int): Int = {
    val remainder = n1 % n2
    if (remainder == 0) n2 else gcd_inner(n2, remainder)
  }

  if (n1 >= n2)
    gcd_inner(n1, n2)
  else
    gcd_inner(n2, n1)
}

The code comprises two parts:

  • a recursive inner loop, which computes the reminder of dividing the two numbers and only stops when it is zero, which we mark as @tailrec in this snippet to make the Scala compiler aware that we want it to be transformed into a tail-recursive method;
  • an initial setup code, which makes sure the first input to the inner loop is always greater than the second one.

This method generalizes directly to Gaussian Integers:

def gcd_direct(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
  @annotation.tailrec
  def gcd_inner(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
    val remainder = n1 % n2
    if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
  }
  if (n1.norm >= n2.norm)
    gcd_inner(n1, n2)
  else
    gcd_inner(n2, n1)
}

The code for this method, along with its ildl-transformed counterparts are located in the GreatestCommonDivisor.scala file.

The performance of this code is suboptimal, as with each iteration, new pairs of integers are allocated and discarded. We will now dive into how this can be optimized.

Transformations

It would be nice to write the code above (which is a straightforward encoding of the algorithm from mathematics) but execute an improved version, where pairs of integers are replaced by unboxed values. To this end, we can extend the IntPairAsComplex transformation object, which we coded in the example sections, with division and reminder operations, producing the IntPairAsGaussianInteger transformation:

adrt(IntPairAsGaussianInteger) {
  def gcd_adrt(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
    @annotation.tailrec
    def gcd_inner(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
      val remainder = n1 % n2
      if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
    }
    if (n1.norm >= n2.norm)
      gcd_inner(n1, n2)
    else
      gcd_inner(n2, n1)
  }
}

This code runs 12x faster than the original code. A natural question to ask is:

Where does this ridiculous speedup come from?

Well, that's a good question. We set out to find exactly what produces this speedup. To do this, we created two intermediate transformations, which slowly optimize different aspects of the integer pairs. First, we specialize the pairs, then we encode the result as a java.lang.Long and finally we switch to scala.Long, which the Scala compiler is able to unbox:

(3, 4) ===> Complex(3, 4) ===> java.lang.Long ===> scala.Long
   \             ^                    ^                ^
    \___________/                    /                /
     \ step1.IntPairAsGaussianInt.../                /
      \                            /                /
       \                          /                /
        \________________________/                /
         \ step2.IntPairAsGaussianInteger        /
          \                                     /
           \                                   /
            \_________________________________/
              step3.IntPairAsGaussianInteger

We will now show each step individually.

Step 1: Specializing the Tuple of Integers

(3, 4) ===> Complex(3, 4)

The Tuple2 class in the Scala library is generic, which allows us to use it to store different types of data, from integers to strings and to any object we define. However, this comes at the price of boxing the primitive values.

The type parameters of the Tuple2 class are marked as @specialized, so we shouldn't box.

That's true. But there is another aspect to this: due to Scala bug SI-3585 there are multiple fields in the specialized class: two generic fields and two specialized fields. This increases the heap footprint and causes GC cycles more often. Thus, we replaced it by the Complex class, specialized by hand:

  case class Complex(_1: Int, _2: Int)

We will not show the entire transformation object, but the interested readers can find it in the step1 directory.

As we will see later, this brought a 2.2x speed increase, so it was a good decision to use a hand-specialized class.

Step 2: Encoding the Gaussian Integer

Knowing we would eventually want to encode the data as a long integer, we decided a middle step between an unboxed long integer and the Complex case class would actually be the java.lang.Long:

  • like Complex, it is a heap-based object
  • unlike Complex and like scala.Long, it requires an encoding logic

Based on this decision, we developed step2 of our transformation: encoding pairs of integers as java.lang.Long.

Unfortunately, this actually brought a 15% slowdown, corresponding to the extra cost of encoding the Gaussian Integer values into long numbers. Armed with thin knowledge, we went on to step3.

Step 3: Unboxing the Gaussian Integer

Encoding the Gaussian integer as a long integer slows down operations. However, it also opens the door to unboxing, namely passing the long integer on the stack between methods, thus completely side-stepping the cost of object allocation.

Unboxing is achieved by transforming the Gaussian Integers to scala.Long, a Scala type which is equivalent to both long and java.lang.Long: the Scala compiler unboxes values of type scala.Long to long whenever possible and, when interacting with erased generics, which require boxing, transforms them to java.lang.Long. This is completely opaque to the programmer, as the scala.Long type covers both cases.

By transforming our pairs of integers to scala.Long, we allowed the Scala compiler to unbox the encoded Gaussian Integers, completely avoiding garbage collection. This produced a speedup of 6x compared to the encoded Gaussian Integers (step2) and 12x compared to the original code.

The transformation description object is stored in step3 of our transformation.

Benchmark

To run the benchmarks in a local installation, do:

$ cd ildl-plugin
$ sbt 'ildl-benchmarks/runMain ildl.benchmark.gcd.BenchmarkRunner'

Results:

| Benchmark              | Time | Speedup | Garbage* | GC time* | Garbage+ | GC time+ |
|                        | (ms) |         | (MB)     | (ms)     | (MB)     | (ms)     |
| ---------------------- | ---- | ------- | -------- | -------- | -------- | -------- |
| 10K GCD runs, original | 28.1 |    none |        0 |        0 |     13.5 |       13 |
| 10K GCD runs, step 1   | 12.5 |    2.2x |        0 |        0 |      2.5 |       10 |
| 10K GCD runs, step 2   | 15.0 |    1.9x |        0 |        0 |      8.7 |       11 |
| 10K GCD runs, step 3   |  2.2 |   12.7x |        0 |        0 |      0.5 |        9 |
  • * GC pause time during the benchmark run
  • + GC pause time between the benchmark runs (mainly preparing the data)

Conclusion: the speedups observed range between 2.2x and 12.7x.

From Here:

⚠️ **GitHub.com Fallback** ⚠️