Sample ~ Deforestation - miniboxing/ildl-plugin GitHub Wiki
If you haven't read the [[introduction|Tutorial--Introduction]] and [[first example|Tutorial--Example-(Part-1)]] already, it's a good time to do so, as this benchmark description assumes familiarity with the ildl-plugin
.
This section will present another one of the adrt
use cases: collection deforestation. In the process, we will show how the adrt
scope enables programmers to write transformations that match and surpass dedicated optimization tools, such as the scala-blitz
optimizer.
The benchmark files are available in the ildl-plugin
repository, in the tests/benchmarks/src/ildl/benchmark/deforest
directory. If you have imported the ildl-*
projects in the Scala IDE, this benchmark is available under the ildl-benchnarks
project, in the src
directory, in package ildl.benchmark.deforest
.
In this section we will look at the least squares method for fitting a straight line that best approximates a set of (x, y)
points given. The benchmarked methods are located in the LeastSquares.scala
file and are called using the scalameter benchmarking framework from Benchmark.scala
.
Given a List[(Double, Double)]
, we can compute the slope
and the offset
of a straight line that best approximates the data:
def leastSquaresDirect(data: List[(Double, Double)]): (Double, Double) = {
val size = data.length
val sumx = data.map(_._1).sum
val sumy = data.map(_._2).sum
val sumxy = data.map(p => p._1 * p._2).sum
val sumxx = data.map(p => p._1 * p._1).sum
val slope = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)
(slope, offset)
}
Of course, the ultimate goal is performance. Yet, using eager collections, it is compromised:
scala> val list = 1 to 1E6
val list: List[Int] = ...
scala> list.map(_ + 1).map(_ * 2).sum
val res0: Int = ...
In the above code, the first call to map
creates a new collection, where each element has been incremented by one. This collection is used for another call map
, leading to the creation of the third list where each element is multiplied by 2. Already at this point, the first and second collections can be safely discarded -- and will indeed be removed from the heap through garbage collection.
Later, the call to sum
produces a single value, summing up all the elements of the third collection and making it redundant. At this point, the code has created 3 intermediate collections which took time and memory to create and will later take time to garbage collect.
The natural answer to this is eliminating intermediate collections that are produced as a result of successive comprehensions applied to a collection, an optimization called deforestation. While we can't afford to explain deforestation here, it is described in the literature: http://homepages.inf.ed.ac.uk/wadler/topics/deforestation.html
One of the ways this can be achieved is through the adrt
scope:
adrt(erased.ListAsLazyList){
def leastSquaresADRTGeneric(data: List[(Double, Double)]): (Double, Double) = {
// same code as before:
val size = data.length
val sumx = data.map(_._1).sum
val sumy = data.map(_._2).sum
val sumxy = data.map(p => p._1 * p._2).sum
val sumxx = data.map(p => p._1 * p._1).sum
val slope = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)
(slope, offset)
}
}
Next, we will see how this is achieved.
We tried several solutions to optimizing the least squares regression case:
- using the scala-blitz tool
- using
adrt
scopes, in two cases: erased and miniboxed - manually rewriting the scopes by hand
We will present each case with its explanation.
The scala-blitz tool promises to auto-magically improve collection performance, just by using the optimize
scope:
def leastSquaresBlitz(data: List[(Double, Double)]): (Double, Double) =
optimize {
// same code as before:
val size = data.length
val sumx = data.map(_._1).sum
val sumy = data.map(_._2).sum
val sumxy = data.map(p => p._1 * p._2).sum
val sumxx = data.map(p => p._1 * p._1).sum
val slope = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)
(slope, offset)
}
As part of the benchmarking process, aside from the original benchmark, we have also tested scala-blitz
. The code can be found here: https://github.com/miniboxing/ildl-plugin/blob/oopsla15/tests/benchmarks/src/ildl/benchmark/deforest/LeastSquares.scala#L171-L182. The scala-blitz
tool was quickly able to cut the execution time by 2.4x. This raised several questions:
- Can our
adrt
scope do it? - Can it be done even better?
A simple solution that avoids intermediate lists is to collect all the functions that are mapped over a collection and only compute the result when necessary. In our example, a result is only necessary when calling sum
. To do this, we create a LazyList
interface:
/**
* This is the lazy list we're planning to use instead of
* scala.collection.immutable.List[T].
*
* The list can be in two states:
* * it's just a wrapper over a List[T], with no accumulated maps
* * it accumulated maps, so it's a List[T] with a function that
* composes the accumulated maps
*/
abstract sealed trait LazyList[T] {
/** Map */
def map[U, That](f: T => U): LazyList[U]
/** Fold */
def foldLeft[U](z: U)(f: (U, T) => U): U
/** Length */
def length: Int
/** Force: get a list */
def force: List[T]
}
This interface has with two possible implementations:
- a simple wrapper over a list, which we call
LazyListWrapper
and - a container holding a list and the composition of the functions that have to be applied, which we call
LazyListMapper
The LazyListWrapper
implementation is shown below:
class LazyListWrapper[T](list: List[T]) extends LazyList[T] {
def map[U, That](f: T => U) =
new LazyListMapper(list, f)
def foldLeft[U](z: U)(f: (U, T) => U): U = {
var lst = list
var acc = z
while(lst != Nil) {
acc = f(acc, lst.head)
lst = lst.tail
}
acc
}
def length: Int =
list.length // since we don't support filter yet
def force: List[T] = list
}
The LazyListWrapper
class corresponds to a newly created LazyList
, which hasn't seen any map calls so far. Contrarily, the LazyListMapper
has an aggregate function, which composes all the mapped functions until that point:
class LazyListMapper[T, To](list: List[To], fs: To => T) extends LazyList[T] {
def map[U, That](f: T => U) =
new LazyListMapper(list, fs andThen f)
def foldLeft[U](z: U)(f: (U, T) => U): U = {
var lst = list
var acc = z
while(lst != Nil) {
acc = f(acc, fs(lst.head))
lst = lst.tail
}
acc
}
def length: Int =
list.length // since we don't support filter yet
def force: List[T] = list.map(fs)
}
These classes form the basic blocks of our transformation. As with the other benchmarks, we have split the benchmark into several steps:
List[T] ===> LazyList[T] ===> LazyList[@miniboxed T] -+-> manual traversal -+-> manual fusion
\ ^ ^ / /
\___step1___/ / / /
\ erased.ListAsLazyList / / /
\ / / /
\ / / /
\___step2________________/ / /
miniboxed.ListAsLazyList / /
manual manual
transformation transformation
We will now go through each transformation.
Can we automate this transformation?
YES, we can use the adrt
scope to automate the transformation from List[T]
and LazyList[T]
.
To start, we need to specify how a list is transformed:
object ListAsLazyList extends TransformationDescription {
// conversions:
def toRepr[T](list: List[T]): LazyList[T] @high = new LazyListWrapper(list)
def fromRepr[T](lazylist: LazyList[T] @high): List[T] = lazylist.force
}
The generic fromRepr
and toRepr
methods allow the adrt
scope to transform between List[T]
and LazyList[T]
. That is great. But there's a problem here:
Then we need to add operations, such as map
:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That
To do so, we add the following operations to the transformation object:
object ListAsLazyList extends TransformationDescription {
// conversions:
def toRepr[T](list: List[T]): LazyList[T] @high = new LazyListWrapper(list)
def fromRepr[T](lazylist: LazyList[T] @high): List[T] = lazylist.force
// operations:
// optimizing the length:
def extension_length[T](lazylist: LazyList[T] @high) =
lazylist.length
// optimizing the map method:
def extension_map[T, U, That](lazylist: LazyList[T] @high)
(f: T => U)(implicit bf: CanBuildFrom[List[T], U, That]): LazyList[U] @high =
...
// optimizing the foldLeft method:
def extension_foldLeft[T, U](lazylist: LazyList[T] @high)
(z: U)(f: (U, T) => U): U =
lazylist.foldLeft(z)(f)
// optimizing the sum method:
def extension_sum[T, U >: T](lazylist: LazyList[T] @high)
(implicit num: Numeric[U]): U =
lazylist.foldLeft(num.zero)(num.plus)
}
The code here is describing exactly how to perform the List[T]
operations directly on a LazyList[T]
. This is further described in the [[Transformation Description
page|Details-~-Transformation-Description]]. But let us focus on the content of the extension_map
method:
def extension_map[T, U, That](lazylist: LazyList[T] @high)
(f: T => U)(implicit bf: CanBuildFrom[List[T], U, That]): LazyList[U] @high = {
// sanity check => we could accept random canBulildFrom objects,
// but that makes the transformation slightly more complex
assert(bf == List.ReusableCBF, "The LazyList transformation only supports " +
"using the default `CanBuildFromObject`" +
"from the Scala collections library.")
lazylist.map(f)
}
There are several things to see here:
- The
map
method ofList[T]
takes more parameters than theLazyList[T]
map, specifically theCanBuildFrom
object - The
CanBuildFrom
object is matched against the standard collections object that would produce anotherList[U]
-- so we cannot inject a custombuildFrom
object - At the expense of the least surprise principle, we could force point 2) in the type system:
// optimizing the map method:
def extension_map[T, U, That](lazylist: LazyList[T] @high)
(f: T => U)(implicit bf: List.ReusableCBF.type): LazyList[U] @high = {
This would prevent other CanBuildFrom
objects from type-checking and would therefore produce the warning we've seen before when map is invoked with a custom CanBuildFrom
object.
Another thing to notice is the definition of the extension_foldLeft
method. This is the signature for foldLeft
in List[A]
:
def foldLeft[B](z: B)(f: (B, A) => B): B = ...
If we look at the extension_foldLeft
method, we can see that it has two type parameters instead of one:
def extension_foldLeft[T, U](lazylist: LazyList[T] @high)
(z: U)(f: (U, T) => U): U =
lazylist.foldLeft(z)(f)
Although the arguments must match exactly, the type parameters will be re-inferred when redirecting methods. This allows programmers to add or remove type parameters when defining the transformation description object.
Using this ildl
transformation improved the running time by almost 10x compared to the scala-blitz optimizer.
During a discussion with @DarkDimius, one of the authors of
scala-blitz
, he pointed out that, for the least squares example,scala-blitz
only performs specialization, not deforestation. Therefore, a natural question is: Can we specializeLazyList
and its subclasses?
That is a good question. We tried doing just that:
- created a miniboxed version of
LazyList
- we created a transformation description object that uses the
LazySpecializedList
.
Adding another version of the least squares method was very easy:
adrt(miniboxed.ListAsLazyList){
def leastSquaresADRTSpecialized(data: List[(Double, Double)]): (Double, Double) = {
... // same code as before
}
}
This version has further improved the running time by 50% compared to the erased version of the LazyList
.
As a curiosity, the Scala compiler has 25 phases, but when both ildl-plugin
and miniboxing-plugin
are active, they add 21 extra phases, 15 for the miniboxing transformation and 6 for the ildl-plugin. Of course, including these plugins into the Scala compiler would reduce the phases a lot. :)
Trying to understand the best-case scenario, we tried to write the list traversals by hand:
/** Least squares with manual traversal */
def leastSquaresManual1(data: List[(Double, Double)]): (Double, Double) = {
val size = data.length
var list = data
var sumx = 0.0
var sumy = 0.0
var sumxy = 0.0
var sumxx = 0.0
// val sumx = data.map(_._1).sum
list = data
while (!list.isEmpty) {
sumx = sumx + list.head._1
list = list.tail
}
// val sumy = data.map(_._2).sum
list = data
while (!list.isEmpty) {
sumy = sumy + list.head._2
list = list.tail
}
// val sumxy = data.map(p => p._1 * p._2).sum
list = data
while (!list.isEmpty) {
sumxy = sumxy + list.head._1 * list.head._2
list = list.tail
}
// val sumxx = data.map(p => p._1 * p._1).sum
list = data
while (!list.isEmpty) {
sumxx = sumxx + list.head._1 * list.head._1
list = list.tail
}
val slope = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)
(slope, offset)
}
This is the manual counterpart of the automated deforestation that we have seen in the previous examples. It reduces the running time to some extent, as there is still some abstraction in applying the functions instead of doing the element manipulation by hand. This example serves to show the additional benefit coming from inlining the functions.
The code can be found here: https://github.com/miniboxing/ildl-plugin/blob/oopsla15/tests/benchmarks/src/ildl/benchmark/deforest/LeastSquares.scala#L185-L221
Finally, to understand the most that can be achieved, we fused all the operations into a single traversal:
/** Least squares with manual traversal + fusion */
def leastSquaresManual2(data: List[(Double, Double)]): (Double, Double) = {
val size = data.length
var list = data
var sumx = 0.0
var sumy = 0.0
var sumxy = 0.0
var sumxx = 0.0
// val sumx = data.map(_._1).sum
// val sumy = data.map(_._2).sum
// val sumxy = data.map(p => p._1 * p._2).sum
// val sumxx = data.map(p => p._1 * p._1).sum
list = data
while (!list.isEmpty) {
sumx = sumx + list.head._1
sumy = sumy + list.head._2
sumxy = sumxy + list.head._1 * list.head._2
sumxx = sumxx + list.head._1 * list.head._1
list = list.tail
}
val slope = (size * sumxy - sumx * sumy) / (size * sumxx - sumx * sumx)
val offset = (sumy * sumxx - sumx * sumxy) / (size * sumxx - sumx * sumx)
(slope, offset)
}
This is the most effective transformation, since it performs both horizontal and vertical fusion. The list is traversed only once instead of four times, as a result of horizontal fusion. However, the conditions for which this technique is applicable are very restrictive (e.g. a closed world restriction). This example is meant to give a lower bound on the running time, assuming unlimited power to transform the program.
The code can be found here: https://github.com/miniboxing/ildl-plugin/blob/oopsla15/tests/benchmarks/src/ildl/benchmark/deforest/LeastSquares.scala#L223-L249
Okay, what are the results?
Running the benchmark can be done in sbt
:
$ cd ildl-plugin
$ sbt 'ildl-benchmarks/runMain ildl.benchmark.deforest.BenchmarkRunner'
The results obtained on our test machine for 5M elements are:
| Benchmark | Time | Speedup | Garbage* | GC time* | Garbage+ | GC time+ |
| | (ms) | | (MB) | (ms) | (MB) | (ms) |
| ---------------------- | ---- | ------- | -------- | -------- | -------- | -------- |
| LSLR, original | 8264 | none | 1166 | 7547 | 809 | 5317 |
| LSLR, scala-blitz | 3464 | 2.4x | 468 | 2936 | 1165 | 5236 |
| LSLR, adrt generic | 429 | 19.3x | 701 | 3 | 933 | 5210 |
| LSLR, adrt miniboxed | 280 | 29.5x | 0 | 0 | 701 | 5193 |
| LSLR, manual traversal | 195 | 42.4x | 0 | 0 | 702 | 5269 |
| LSLR, manual fusion | 79 | 105.0x | 0 | 0 | 702 | 5282 |
-
*
GC pause time during the benchmark run
-
+
GC pause time between the benchmark runs (mainly preparing the data)
Note that the benchmark runs for sizes between 1 million elements and 5 million elements, but here we report only numbers for 5M elements.
There are some observations/fine print disclaimers:
- Our simple
LazyList[T]
only handlesmap
, but it could handle operations such asfilter
,groupBy
andfoldRight
as well by following the examples in the literature; - The deforestation example here should not be mistaken for fusion (which is only done in the last manual transformation). In a fusion transformation, the five values in the least squares method (
size
,sumx
,sumy
,sumxx
andsumxy
) are computed in a single list traversal, making the execution much more efficient. However, since thesum
method needs to return a value, this is currently not possible to do with theadrt
infrastructure; - The numbers we obtained this time are significantly better than the ones reported in the paper.
We realized the difference when we implemented the manual traversal, which took only
195 ms
. By tracing back the difference, we realized we were triggering an unwarranted garbage collection run exactly before executing theadrt
scope, also counting that towards the execution time. Now, running a garbage collection cycle in the setup phase, we do not incur any GC cycle for the miniboxed adrt version (and the manual versions) and we only incur one GC cycle in the generic case; - Note that, given an large enough heap (in this case,
8GB
), the original benchmark runs in700 ms
, instead of8264 ms
. This time corresponds to traversing the list and allocating the intermediate heap objects involved in the processing. However, on the HotSpot virtual machine, allocation is highly optimized, but garbage collection is more expensive. If we give the benchmark8GB
of heap we're essentially ignoring the garbage collection cost, allowing the benchmark to keep allocating without incurring GC overhead. Still, when the heap space runs out (we don't take this into account), we expect the GC overhead to be substantial.
- see the next sample transformation: optimizing locality
- get back to the home page