Sample ~ Deforestation - miniboxing/ildl-plugin GitHub Wiki

If you haven't read the [[introduction|Tutorial-~~-Introduction]] and [[first example|Tutorial-~~-Example-(Part-1)]] already, it's a good time to do so, as this benchmark description assumes familiarity with the ildl-plugin.

This section will present another one of the adrt use cases: collection deforestation. In the process, we will show how the adrt scope enables programmers to write transformations that match and surpass dedicated optimization tools, such as the scala-blitz optimizer.

The benchmark files are available in the ildl-plugin repository, in the tests/benchmarks/src/ildl/benchmark/deforest directory. If you have imported the ildl-* projects in the Scala IDE, this benchmark is available under the ildl-benchnarks project, in the src directory, in package ildl.benchmark.deforest.

Problem

In this section we will look at the least squares method for fitting a straight line that best approximates a set of (x, y) points given. The benchmarked methods are located in the LeastSquares.scala file and are called using the scalameter benchmarking framework from Benchmark.scala.

Given a List[(Double, Double)], we can compute the slope and the offset of a straight line that best approximates the data:

  def leastSquaresDirect(data: List[(Double, Double)]): (Double, Double) = {
    val size = data.length
    val sumx = data.map(_._1).sum
    val sumy = data.map(_._2).sum
    val sumxy = data.map(p => p._1 * p._2).sum
    val sumxx = data.map(p => p._1 * p._1).sum

    val slope  = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
    val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)

    (slope, offset)
  }

Of course, the ultimate goal is performance. Yet, using eager collections, it is compromised:

scala> val list = 1 to 1E6
val list: List[Int] = ...

scala> list.map(_ + 1).map(_ * 2).sum
val res0: Int = ...

In the above code, the first call to map creates a new collection, where each element has been incremented by one. This collection is used for another call map, leading to the creation of the third list where each element is multiplied by 2. Already at this point, the first and second collections can be safely discarded -- and will indeed be removed from the heap through garbage collection.

Later, the call to sum produces a single value, summing up all the elements of the third collection and making it redundant. At this point, the code has created 3 intermediate collections which took time and memory to create and will later take time to garbage collect.

Solution

The natural answer to this is eliminating intermediate collections that are produced as a result of successive comprehensions applied to a collection, an optimization called deforestation. While we can't afford to explain deforestation here, it is described in the literature: http://homepages.inf.ed.ac.uk/wadler/topics/deforestation.html

One of the ways this can be achieved is through the adrt scope:

  adrt(erased.ListAsLazyList){
    def leastSquaresADRTGeneric(data: List[(Double, Double)]): (Double, Double) = {
      // same code as before:
      val size = data.length
      val sumx = data.map(_._1).sum
      val sumy = data.map(_._2).sum
      val sumxy = data.map(p => p._1 * p._2).sum
      val sumxx = data.map(p => p._1 * p._1).sum

      val slope  = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
      val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)

      (slope, offset)
    }
  }

Next, we will see how this is achieved.

Transformation

We tried several solutions to optimizing the least squares regression case:

using the scala-blitz tool
using adrt scopes, in two cases: erased and miniboxed
manually rewriting the scopes by hand

We will present each case with its explanation.

Step 1: ScalaBlitz

The scala-blitz tool promises to auto-magically improve collection performance, just by using the optimize scope:

  def leastSquaresBlitz(data: List[(Double, Double)]): (Double, Double) = 
    optimize {
      // same code as before:
      val size = data.length
      val sumx = data.map(_._1).sum
      val sumy = data.map(_._2).sum
      val sumxy = data.map(p => p._1 * p._2).sum
      val sumxx = data.map(p => p._1 * p._1).sum

      val slope  = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
      val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)

      (slope, offset)
    }

As part of the benchmarking process, aside from the original benchmark, we have also tested scala-blitz. The code can be found here: https://github.com/miniboxing/ildl-plugin/blob/oopsla15/tests/benchmarks/src/ildl/benchmark/deforest/LeastSquares.scala#L171-L182. The scala-blitz tool was quickly able to cut the execution time by 2.4x. This raised several questions:

Can our adrt scope do it?
Can it be done even better?

Step 2: ADR Transformation

A simple solution that avoids intermediate lists is to collect all the functions that are mapped over a collection and only compute the result when necessary. In our example, a result is only necessary when calling sum. To do this, we create a LazyList interface:

/**
 *  This is the lazy list we're planning to use instead of
 *  scala.collection.immutable.List[T].
 *
 *  The list can be in two states:
 *   * it's just a wrapper over a List[T], with no accumulated maps
 *   * it accumulated maps, so it's a List[T] with a function that
 *     composes the accumulated maps
 */
abstract sealed trait LazyList[T] {
  /** Map */
  def map[U, That](f: T => U): LazyList[U]

  /** Fold */
  def foldLeft[U](z: U)(f: (U, T) => U): U

  /** Length */
  def length: Int

  /** Force: get a list */
  def force: List[T]
}

This interface has with two possible implementations:

a simple wrapper over a list, which we call LazyListWrapper and
a container holding a list and the composition of the functions that have to be applied, which we call LazyListMapper

The LazyListWrapper implementation is shown below:

class LazyListWrapper[T](list: List[T]) extends LazyList[T] {

  def map[U, That](f: T => U) =
    new LazyListMapper(list, f)

  def foldLeft[U](z: U)(f: (U, T) => U): U = {
    var lst = list
    var acc  = z
    while(lst != Nil) {
      acc = f(acc, lst.head)
      lst = lst.tail
    }
    acc
  }

  def length: Int = 
    list.length // since we don't support filter yet

  def force: List[T] = list
}

The LazyListWrapper class corresponds to a newly created LazyList, which hasn't seen any map calls so far. Contrarily, the LazyListMapper has an aggregate function, which composes all the mapped functions until that point:

class LazyListMapper[T, To](list: List[To], fs: To => T) extends LazyList[T] {

  def map[U, That](f: T => U) =
    new LazyListMapper(list, fs andThen f)

  def foldLeft[U](z: U)(f: (U, T) => U): U = {
    var lst = list
    var acc  = z
    while(lst != Nil) {
      acc = f(acc, fs(lst.head))
      lst = lst.tail
    }
    acc
  }

  def length: Int = 
    list.length // since we don't support filter yet

  def force: List[T] = list.map(fs)
}

These classes form the basic blocks of our transformation. As with the other benchmarks, we have split the benchmark into several steps:

List[T] ===> LazyList[T] ===> LazyList[@miniboxed T] -+-> manual traversal -+-> manual fusion
   \             ^                    ^              /                     /
    \___step1___/                    /              /                     /
     \ erased.ListAsLazyList        /              /                     /
      \                            /              /                     /
       \                          /              /                     /
        \___step2________________/              /                     /
          miniboxed.ListAsLazyList             /                     /
                                            manual                manual
                                       transformation         transformation

We will now go through each transformation.

Step 2, Part 1: Lazyfying the List

Can we automate this transformation?

YES, we can use the adrt scope to automate the transformation from List[T] and LazyList[T]. To start, we need to specify how a list is transformed:

object ListAsLazyList extends TransformationDescription {

  // conversions:
  def toRepr[T](list: List[T]): LazyList[T] @high = new LazyListWrapper(list)
  def fromRepr[T](lazylist: LazyList[T] @high): List[T] = lazylist.force
}

The generic fromRepr and toRepr methods allow the adrt scope to transform between List[T] and LazyList[T]. That is great. But there's a problem here:

Then we need to add operations, such as map:

  def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That

To do so, we add the following operations to the transformation object:

object ListAsLazyList extends TransformationDescription {

  // conversions:
  def toRepr[T](list: List[T]): LazyList[T] @high = new LazyListWrapper(list)
  def fromRepr[T](lazylist: LazyList[T] @high): List[T] = lazylist.force

  // operations:
  // optimizing the length:
  def extension_length[T](lazylist: LazyList[T] @high) =
    lazylist.length

  // optimizing the map method:
  def extension_map[T, U, That](lazylist: LazyList[T] @high)
                               (f: T => U)(implicit bf: CanBuildFrom[List[T], U, That]): LazyList[U] @high =
    ...

  // optimizing the foldLeft method:
  def extension_foldLeft[T, U](lazylist: LazyList[T] @high)
                              (z: U)(f: (U, T) => U): U =
    lazylist.foldLeft(z)(f)

  // optimizing the sum method:
  def extension_sum[T, U >: T](lazylist: LazyList[T] @high)
                              (implicit num: Numeric[U]): U =
    lazylist.foldLeft(num.zero)(num.plus)
}

The code here is describing exactly how to perform the List[T] operations directly on a LazyList[T]. This is further described in the [[Transformation Description page|Details-~-Transformation-Description]]. But let us focus on the content of the extension_map method:

  def extension_map[T, U, That](lazylist: LazyList[T] @high)
                               (f: T => U)(implicit bf: CanBuildFrom[List[T], U, That]): LazyList[U] @high = {

    // sanity check => we could accept random canBulildFrom objects,
    // but that makes the transformation slightly more complex
    assert(bf == List.ReusableCBF, "The LazyList transformation only supports " +
                                   "using the default `CanBuildFromObject`" +
                                   "from the Scala collections library.")
    lazylist.map(f)
  }

There are several things to see here:

The map method of List[T] takes more parameters than the LazyList[T] map, specifically the CanBuildFrom object
The CanBuildFrom object is matched against the standard collections object that would produce another List[U] -- so we cannot inject a custom buildFrom object
At the expense of the least surprise principle, we could force point 2) in the type system:

  // optimizing the map method:
  def extension_map[T, U, That](lazylist: LazyList[T] @high)
                               (f: T => U)(implicit bf: List.ReusableCBF.type): LazyList[U] @high = {

This would prevent other CanBuildFrom objects from type-checking and would therefore produce the warning we've seen before when map is invoked with a custom CanBuildFrom object.

Another thing to notice is the definition of the extension_foldLeft method. This is the signature for foldLeft in List[A]:

  def foldLeft[B](z: B)(f: (B, A) => B): B = ...

If we look at the extension_foldLeft method, we can see that it has two type parameters instead of one:

  def extension_foldLeft[T, U](lazylist: LazyList[T] @high)
                              (z: U)(f: (U, T) => U): U =
    lazylist.foldLeft(z)(f)

Although the arguments must match exactly, the type parameters will be re-inferred when redirecting methods. This allows programmers to add or remove type parameters when defining the transformation description object.

Using this ildl transformation improved the running time by almost 10x compared to the scala-blitz optimizer.

Step 2, Part 2: Adding Miniboxing

During a discussion with @DarkDimius, one of the authors of scala-blitz, he pointed out that, for the least squares example, scala-blitz only performs specialization, not deforestation. Therefore, a natural question is: Can we specialize LazyList and its subclasses?

That is a good question. We tried doing just that:

created a miniboxed version of LazyList
we created a transformation description object that uses the LazySpecializedList.

Adding another version of the least squares method was very easy:

  adrt(miniboxed.ListAsLazyList){
    def leastSquaresADRTSpecialized(data: List[(Double, Double)]): (Double, Double) = {
      ... // same code as before
    }
  }

This version has further improved the running time by 50% compared to the erased version of the LazyList.

As a curiosity, the Scala compiler has 25 phases, but when both ildl-plugin and miniboxing-plugin are active, they add 21 extra phases, 15 for the miniboxing transformation and 6 for the ildl-plugin. Of course, including these plugins into the Scala compiler would reduce the phases a lot. :)

Step 4: Manual traversal

Trying to understand the best-case scenario, we tried to write the list traversals by hand:

  /** Least squares with manual traversal */
  def leastSquaresManual1(data: List[(Double, Double)]): (Double, Double) = {
    val size = data.length
    var list = data
    var sumx = 0.0
    var sumy = 0.0
    var sumxy = 0.0
    var sumxx = 0.0
    // val sumx = data.map(_._1).sum
    list = data
    while (!list.isEmpty) {
      sumx = sumx + list.head._1
      list = list.tail
    }
    // val sumy = data.map(_._2).sum
    list = data
    while (!list.isEmpty) {
      sumy = sumy + list.head._2
      list = list.tail
    }
    // val sumxy = data.map(p => p._1 * p._2).sum
    list = data
    while (!list.isEmpty) {
      sumxy = sumxy + list.head._1 * list.head._2
      list = list.tail
    }
    // val sumxx = data.map(p => p._1 * p._1).sum
    list = data
    while (!list.isEmpty) {
      sumxx = sumxx + list.head._1 * list.head._1
      list = list.tail
    }

    val slope  = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
    val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)

    (slope, offset)
  }

This is the manual counterpart of the automated deforestation that we have seen in the previous examples. It reduces the running time to some extent, as there is still some abstraction in applying the functions instead of doing the element manipulation by hand. This example serves to show the additional benefit coming from inlining the functions.

The code can be found here: https://github.com/miniboxing/ildl-plugin/blob/oopsla15/tests/benchmarks/src/ildl/benchmark/deforest/LeastSquares.scala#L185-L221

Step 5: Fusion

Finally, to understand the most that can be achieved, we fused all the operations into a single traversal:

  /** Least squares with manual traversal + fusion */
  def leastSquaresManual2(data: List[(Double, Double)]): (Double, Double) = {
    val size = data.length
    var list = data
    var sumx = 0.0
    var sumy = 0.0
    var sumxy = 0.0
    var sumxx = 0.0
    // val sumx = data.map(_._1).sum
    // val sumy = data.map(_._2).sum
    // val sumxy = data.map(p => p._1 * p._2).sum
    // val sumxx = data.map(p => p._1 * p._1).sum
    list = data
    while (!list.isEmpty) {
      sumx = sumx + list.head._1
      sumy = sumy + list.head._2
      sumxy = sumxy + list.head._1 * list.head._2
      sumxx = sumxx + list.head._1 * list.head._1
      list = list.tail
    }

    val slope  = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
    val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)

    (slope, offset)
  }

This is the most effective transformation, since it performs both horizontal and vertical fusion. The list is traversed only once instead of four times, as a result of horizontal fusion. However, the conditions for which this technique is applicable are very restrictive (e.g. a closed world restriction). This example is meant to give a lower bound on the running time, assuming unlimited power to transform the program.

The code can be found here: https://github.com/miniboxing/ildl-plugin/blob/oopsla15/tests/benchmarks/src/ildl/benchmark/deforest/LeastSquares.scala#L223-L249

Benchmark

Okay, what are the results?

Running the benchmark can be done in sbt:

$ cd ildl-plugin
$ sbt 'ildl-benchmarks/runMain ildl.benchmark.deforest.BenchmarkRunner'

The results obtained on our test machine for 5M elements are:

| Benchmark              | Time | Speedup | Garbage* | GC time* | Garbage+ | GC time+ |
|                        | (ms) |         | (MB)     | (ms)     | (MB)     | (ms)     |
| ---------------------- | ---- | ------- | -------- | -------- | -------- | -------- |
| LSLR, original         | 8264 |    none |     1166 |     7547 |      809 |     5317 |
| LSLR, scala-blitz      | 3464 |    2.4x |      468 |     2936 |     1165 |     5236 |
| LSLR, adrt generic     |  429 |   19.3x |      701 |        3 |      933 |     5210 |
| LSLR, adrt miniboxed   |  280 |   29.5x |        0 |        0 |      701 |     5193 |
| LSLR, manual traversal |  195 |   42.4x |        0 |        0 |      702 |     5269 |
| LSLR, manual fusion    |   79 |  105.0x |        0 |        0 |      702 |     5282 |

* GC pause time during the benchmark run
+ GC pause time between the benchmark runs (mainly preparing the data)

Note that the benchmark runs for sizes between 1 million elements and 5 million elements, but here we report only numbers for 5M elements.

There are some observations/fine print disclaimers:

Our simple LazyList[T] only handles map, but it could handle operations such as filter, groupBy and foldRight as well by following the examples in the literature;
The deforestation example here should not be mistaken for fusion (which is only done in the last manual transformation). In a fusion transformation, the five values in the least squares method (size, sumx, sumy, sumxx and sumxy) are computed in a single list traversal, making the execution much more efficient. However, since the sum method needs to return a value, this is currently not possible to do with the adrt infrastructure;
The numbers we obtained this time are significantly better than the ones reported in the paper. We realized the difference when we implemented the manual traversal, which took only 195 ms. By tracing back the difference, we realized we were triggering an unwarranted garbage collection run exactly before executing the adrt scope, also counting that towards the execution time. Now, running a garbage collection cycle in the setup phase, we do not incur any GC cycle for the miniboxed adrt version (and the manual versions) and we only incur one GC cycle in the generic case;
Note that, given an large enough heap (in this case, 8GB), the original benchmark runs in 700 ms, instead of 8264 ms. This time corresponds to traversing the list and allocating the intermediate heap objects involved in the processing. However, on the HotSpot virtual machine, allocation is highly optimized, but garbage collection is more expensive. If we give the benchmark 8GB of heap we're essentially ignoring the garbage collection cost, allowing the benchmark to keep allocating without incurring GC overhead. Still, when the heap space runs out (we don't take this into account), we expect the GC overhead to be substantial.

From Here:

see the next sample transformation: optimizing locality
get back to the home page