Lecture08 - nus-cs2030/2324-s2 GitHub Wiki

Java Streams

In this lecture, we focus on the stream programming model where we define Java Streams to perform iteration (or looping). Streams are lazily evaluated which enables us to write infinite streams. Streams also take advantage of multi-core processors by allowing computation to be done in parallel.

External vs Interval Stream

Let us start with a simple iteration problem:

shell> int sum = 0
sum ==> 0

jshell> for (int x = 1; x <= 10; x = x + 1) {
   ...>    sum = sum + x;
   ...> }

jshell> sum
sum ==> 55

There are many state changes involved in this iterative code; one can count the number of assignments.

Compare this a stream version below. Here we make use of an integer primitive stream IntStream.

shell> int sum = IntStream.rangeClosed(1, 10).
   ...> sum()
sum ==> 55

Notice that other than assigning the resulting sum, there is no further assignments, hence it is (side-)effect free. One may argue that the computation of rangeClosed and sum are abstract away, however we can equivalently write an effect-free solution using the more general iterate and reduce operations with no need for further assignments.

jshell> int sum = IntStream.iterate(1, x -> x <= 10, x -> x + 1).
   ...> reduce(0, (x, y) -> x + y)
sum ==> 55

A stream represents a sequence of elements generated via a generator (or data source). These elements go through some processing via intermediate pipeline operations, and end up at a terminator with a result.

It is also interesting to note that the iterate method suggests taking in a seed value, followed by a predicate as the second argument, and then a function. Since IntStream only processes integer elements, the predicate and the function are their primitive equivalents, i.e. IntPredicate of type (int -> boolean) and IntUnaryOperator of type (int -> int).

At this point, you are advised to familiarize yourself with some of the operations of IntStream in the Java API. Take note of the generators (static factory methods), the intermediate pipeline operations (methods that return IntStream) and the terminators (methods that do not return IntStream).

Most methods do not have side-effects (i.e. no mutable state), other than IntStream.builder and collect. These will not be used in our course. If you are interested, IntStream.builder allows us to build a stream by adding elements imperatively, and collect allows us to define an imperative reduction and construct a mutable collection out of the stream.

Function Typing vs Java Typing

As Java is inherently OOP, each function type that a stream's higher order methods (e.g. map and filter) take in is associated with a name. You have already seen IntUnaryOperator which is a mapping from integer to integer, hence (int -> int). Indeed when using say, the map method of IntStream, one only needs to note the behaviour of the function, i.e. taking in an integer input argument and returning an integer argument, so as to enable us to effectively write stream expressions:

jshell> IntStream.rangeClosed(1,10).map(x -> x * 2).filter(x -> x % 3 == 0).count() // 6, 12, 18
$.. ==> 3

You will, however, need to be more carefully when associating types to the functions, e.g.

jshell> Function<Integer,Integer> f = x -> x * 2
f ==> $Lambda$..

jshell> Predicate<Integer> p = x -> x % 3 == 0
p ==> $Lambda$..

jshell> IntStream.rangeClosed(1,10).map(f).filter(p).count()
|  Error:
|  incompatible types: java.util.function.Function<java.lang.Integer,java.lang.Integer> cannot be converted to java.util.function.IntUnaryOperator
|  IntStream.rangeClosed(1,10).map(f).filter(p).count()
|

jshell> IntUnaryOperator f = x -> x * 2
f ==> $Lambda$..

jshell> IntPredicate p = x -> x % 3 == 0
p ==> $Lambda$..

jshell> IntStream.rangeClosed(1,10).map(f).filter(p).count()
$.. ==> 3

ImList vs Stream

It is also interesting to compare our ImList that we have extended as a collection pipeline during one of our recitation sessions and a Java stream. ImList is made up of a finite list of elements and strictly evaluated; while Stream is lazily evaluated. You can observe the difference in behaviour below:

jshell> new ImList<Integer>(List.of(1, 2, 3)).
   ...> map(x -> { System.out.println(x.toString()); return x * 2;})
1
2
3
$.. ==> [2, 4, 6]

jshell> new ImList<Integer>(List.of(1, 2, 3)).
   ...> map(x -> { System.out.println(x.toString()); return x * 2;}).
   ...> reduce(0, (x,y) -> x + y)
1
2
3
$.. ==> 12

Both pipelines above will perform mapping regardless of whether reduce method terminates the pipeline. Compare this with a solution using IntStream

jshell> IntStream.of(1, 2, 3).
   ...> map(x -> { System.out.println(x + ""); return x * 2;})
$.. ==> java.util.stream.IntPipeline$4@4bf558aa

jshell> IntStream.of(1, 2, 3).
   ...> map(x -> { System.out.println(x + ""); return x * 2;}).
   ...> reduce(0, (x,y) -> x + y)
1
2
3
$.. ==> 12

Notice that no operation is done on the elements in the absence of a terminal.

Another difference is that ImList can be processed multiple times, while streams can only be processed once.

jshell> ImList<Integer> list = new ImList<Integer>(List.of(1, 2, 3)).
   ...> map(x -> x * 2)
list ==> [2, 4, 6]

jshell> list.reduce(0, (x, y) -> x + y) // sum the elements
$.. ==> 12

jshell> list.reduce(0, (x, y) -> x + 1) // count the number of elements
$.. ==> 3

jshell> IntStream stream = IntStream.of(1, 2, 3).
   ...> map(x -> x * 2)
stream ==> java.util.stream.IntPipeline$4@497470ed

jshell> stream.reduce(0, (x, y) -> x + y)
$.. ==> 12

jshell> stream.reduce(0, (x, y) -> x + 1)
|  Exception java.lang.IllegalStateException: stream has already been operated upon or closed
|        at AbstractPipeline.evaluate (AbstractPipeline.java:229)
|        at IntPipeline.reduce (IntPipeline.java:515)
|        at (#26:1)

Example: isPrime method

Here is an example of using streams to determine if a given integer is prime or otherwise:

jshell> boolean isPrime(int n) {
   ...>     return n > 1 && IntStream.range(2, n).
   ...>     noneMatch(x -> n % x == 0);
   ...> }
|  created method isPrime(int)

jshell> isPrime(3)
$.. ==> true

jshell> isPrime(9)
$.. ==> false

jshell> isPrime(11)
$.. ==> true

An alternative stream implementation is given below that uses more general pipeline operations:

jshell> boolean isPrime(int n) { // alternative solution using filter and count
   ...>     return n > 1 && IntStream.range(2, n).
   ...>     filter(x -> n % x == 0).
   ...>     reduce(0, (x, y) -> x + 1) == 0; // or count() == 0;
   ...> }
|  modified method isPrime(int)

How do we make use of isPrime to generate the first five hundred primes? Doing this in an iterative way requires us to consciously count the number of valid primes as the while loop progresses.

jshell> int n = 2;
n ==> 2

jshell> int numOfPrimes = 0;
numOfPrimes ==> 0

jshell> while (numOfPrimes < 500) {
   ...>     if (isPrime(n)) {
   ...>         System.out.println(n);
   ...>         numOfPrimes = numOfPrimes + 1;
   ...>     }
   ...>     n = n + 1;
   ...> }
2
3
5
:
3571

With streams, one can just declare to iterate successive values starting from 2, filter those values that are prime, and limit these to 500 values.

jshell> IntStream.iterate(2, x -> x + 1).
   ...> filter(x -> isPrime(x)).
   ...> limit(500).
   ...> forEach(x -> System.out.println(x))
2
3
5
:
3571

Indeed, iterate suggests a limitless (or infinite) iteration of elements; compare this with the three-argument version that we used earlier. This is possible because streams are lazily evaluated. One can construct an infinite stream in the following way

jshell> IntStream.iterate(1, x -> x + 1).
   ...> filter(x -> isPrime(x))
$.. ==> java.util.stream.IntPipeline$..

and no evaluation is done since there are no terminal operations; you cannot do that using imperative control flow. However, if you do have a terminal (e.g. forEach) then you will need to limit the number of elements first.

At times you may see that printing each element using forEach is written as forEach(System.out::println). This syntax uses method reference. You are advised not to use any method reference due to limitations to our grader.

The reduce method

As mentioned earlier, reduce results in a final outcome by aggregating the stream elements. There are two forms of reduce: the one-argument version and the two-argument version.

The two-argument version is the usual one that requires a starting integer value to be specified followed by an IntBinaryOperator function ((int, int) -> int)

jshell> IntStream.rangeClosed(1, 10).
   ...> reduce(0, (x, y) -> x + y)
$.. ==> 55

On the other hand, the one-argument reduce begins reduction from the first element. If there is only one element, this value is returned wrapped in an OptionalInt (an Optional that operates only on integers; different from Optional<Integer>). If there are no elements in the stream, OptionalInt.empty is returned. Otherwise the reduced result wrapped in OptionalInt is returned.

jshell> IntStream.rangeClosed(1, 1).
   ...> reduce((x, y) -> x + y)
$.. ==> OptionalInt[1]

jshell> IntStream.rangeClosed(1, -1).
   ...> reduce((x, y) -> x + y)
$.. ==> OptionalInt.empty

jshell> IntStream.rangeClosed(1, 10).
   ...> reduce((x, y) -> x + y)
$.. ==> OptionalInt[55]

Nested loops using flatMap

We have seen the use of stream for looping. What about nested loops? One can make use of flatMap.

jshell> IntStream.rangeClosed(1, 3).
   ...> flatMap(x -> IntStream.rangeClosed(x, 3).map(y -> x * y)).
   ...> forEach(x -> System.out.print(x + " "))
1 2 3 4 6 9

The flatMap operation takes in a function of the form (int -> IntStream). What if we use map instead? It will result in a compilation error since map expects the function of the form (int -> int).

jshell> IntStream.rangeClosed(1, 3).
   ...> map(x -> IntStream.rangeClosed(x, 3).map(y -> x * y)).
   ...> forEach(x -> System.out.print(x + " "))
|  Error:
|  incompatible types: bad return type in lambda expression
|      java.util.stream.IntStream cannot be converted to int
|  map(x -> IntStream.rangeClosed(x, 3).map(y -> x * y)).
|

Generic Stream

So far we have worked with the primitive stream IntStream. Java also provides a generic Stream<T> type. Here is the equivalent generic stream implementation for the code above:

jshell> Stream.<Integer>of(1,2,3).
   ...> flatMap(x -> Stream.<Integer>iterate(x, y -> y <= 3, y -> y + 1).map(y -> x * y)).
   ...> forEach(x -> System.out.print(x + " "))
1 2 3 4 6 9

If we use map instead you will notice that there is no longer a compilation error.

jshell> Stream.<Integer>of(1,2,3).
   ...> map(x -> Stream.<Integer>iterate(x, y -> y <= 3, y -> y + 1).map(y -> x * y)).
   ...> forEach(x -> System.out.print(x + " "))
java.util.stream.ReferencePipeline$.. java.util.stream.ReferencePipeline$.. java.util.stream.ReferencePipeline$..

The above generates a stream of three streams! This is because map takes in a Function<T,R> where T is bound to Integer and R is bound to Stream<T>; ReferencePipeline is an implementation of Stream<T> that has been exposed!

Two noteworthy methods that allow you to convert from primitive to generic streams are

  • boxed() which maps each primitive element to its wrapper type;
  • mapToObj(..) which maps each primitive element to any reference type.

Lazy evaluation and infinite streams

We have discussed lazy evaluation in streams and their relation to infinite streams. Let us study the following pipeline closely.

jshell> Stream.<Integer>iterate(1, x -> x + 1).
   ...> map(x -> { System.out.println("map1: " + x); return x;}).
   ...> map(x -> { System.out.println("map2: " + x); return x;}).
   ...> limit(5).
   ...> toList()
map1: 1
map2: 1
map1: 2
map2: 2
map1: 3
map2: 3
map1: 4
map2: 4
map1: 5
map2: 5
$14 ==> [1, 2, 3, 4, 5]

We know that iterate will generate an infinite stream and limit will process only the first five stream elements. Within the two map operations, notice that it is not the case that all five elements go through the first map operation before they all go through the second map. Rather, each element does through two map operations one after another.

Specifically, when a terminal is invoked (in this case toList()), a request for a value is initiated and passed upstream:

  • toList signals to the limit operation for a value;
  • limit signals to the upstream map operation for a value (provided it has not reached the limit);
  • map requests its upstream map operation for a value;
  • map requests a value from iterate.

The iterate generator then generates a value and passes the value downstream for processing:

  • iterate passes an element to downstream map;
  • map performs transformation and passes the result to its downstream map;
  • map performs transformation and passes the result to limit;
  • limit passes the value downstream, while taking note of the number of values that has passed through it;
  • toList adds the value to the resultant list.

This is the essence of lazy evaluation in streams.

List comprehension notation

Let us start our discussion with set comprehension from Math. Suppose we want to generate a set comprising pairs of integers where the first value of the pair ranges from 1 to 3, and the second value of the pair ranges from the first value to 3. The notation is

$${(x,y)~|~x \in {1,..,3}; y \in {x,..,3} }$$

In python, we can generate the pair of values using list comprehension notation. For example,

>>> [ (x,y) for x in range(1, 4) for y in range(x, 4) ]
[(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]

>>> [ x * y for x in range(1, 4) for y in range(x, 4) ]
[1, 2, 3, 4, 6, 9]

How do we generate such a list in Java?

One can make use of Stream.Builder to construct a finite stream imperatively:

jshell> Stream.Builder<Pair<Integer,Integer>> builder = Stream.<Pair<Integer,Integer>>builder()
builder ==> java.util.stream.Streams$StreamBuilderImpl@..

jshell> for (int x = 1; x <= 3; x = x + 1) {
   ...>     for (int y = x; y <= 3; y = y + 1) {
   ...>         builder.accept(new Pair<Integer,Integer>(x, y)); // mutable!
   ...>     }
   ...> }

jshell> Stream<Pair<Integer,Integer>> stream = builder.build()
stream ==> java.util.stream.ReferencePipeline$..

jshell> stream.toList()
$.. ==> [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]

Alternatively, we can make use of map and flatMap

jshell> List.<Integer>of(1, 2, 3).stream().
   ...> flatMap(x -> Stream.<Integer>iterate(x, x -> x <= 3, x -> x + 1).map(y -> new Pair<Integer,Integer>(x, y))).
   ...> toList()
$.. ==> [(1, 1), (1, 2), (1, 3), (2, 2), (3, 3)]

With a given list comprehension notation, one can make use of a translation scheme to generate the equivalent Java stream using map, flatMap and filter.

Here are some examples:

  • [ 2 * x | x <- [1,2,3] ] where [1,2,3] denotes a stream generator is equivalent to
Stream.of(1,2,3).map(x -> 2 * x)
  • [ x + y | x <- [1,2,3,4], y <- [1,2,3] ] comprising two stream generators is equivalent to
Stream.of(1,2,3,4).flatMap(x -> Stream.of(1,2,3).map(y -> x + y))

What about [z + x + y | z <- [1,2], x <- [1,2,3,4], y <- [1,2,3]? Notice that this can be simplified to

Stream.of[1,2].flatMap(z -> [ z + x + y | x -> [1,2,3,4], y -> [1,2,3]])

and expanding the inner list comprehension gives

Stream.of(1,2).flatMap(z -> Stream.of(1,2,3,4).flatMap(x -> Stream.of(1,2,3).map(y -> z + x + y)))

Moreover, a list comprehension may also comprise of a generator followed by a test. An example in python would be

>>> [ (x,y) for x in range(1,4) if x % 2 == 1 for y in range(x,4) ]
[(1, 1), (1, 2), (1, 3), (3, 3)]

with the additional condition that x values generated must be odd.

Such a list comprehension, denoted [ (x, y) | x <- [1,2,3], odd(x), y <- [x,..,3]] is equivalent to

jshell> Stream.of(1,2,3).filter(x -> x % 2 == 1).
   ...> flatMap(x -> Stream.iterate(x, y -> y <= 3, y -> y + 1).
   ...>    map(y -> new Pair<Integer,Integer>(x,y))).
   ...> toList()
$.. ==> [(1, 1), (1, 2), (1, 3), (3, 3)]

From the examples above, we can devise a translation scheme as follows:

  • [ e | i <- str] <-> str.map(i -> e) where str is a stream generator, and e is an expression over stream elements i

  • [ e | i <- str1, j <-str2, ..E..] <-> str1.flatMap(i -> [ e | j <- str2, ..E..]) with stream generators str1, str2, ..E.., and e is an expression over i, j and possibly elements from other generators ..E..

  • [ e | i <- str, test, ..E.. ] <-> [ e | i <- filter(str,test), ..E..] where filter(str,test) denotes a stream generator resulting from the application of test on elements generated from str; filter(str,test) will be equivalent to str.filter(i -> test)

Correctness of streams

To ensure the correct execution of streams, one must obey the following usage rules:

  • Stream operations must not interfere with stream data. As long as we keep to our discipline of effect-free coding, this will not be an issue.

  • Stream operations should preferably be stateless. In other words, how an element is processed should not be dependent on neighbouring elements.

The latter is especially important when streams are processed in parallel using the parallel() operator. Let us use the previous example of generating the first ten prime numbers.

jshell> IntStream.iterate(2, x -> x + 1).
   ...> filter(x -> isPrime(x)).
   ...> limit(10).
   ...> peek(x -> System.out.println(x)).
   ...> forEach(x -> {})
2
3
5
7
11
13
17
19
23
29

Rather than output at the forEach terminal, we make use of peek to output the elements as soon as it reaches the operation. Since the stream here is sequential, we would expect the stream elements to be processed one by one starting from 2.

Now we make the stream parallel.

jshell> IntStream.iterate(2, x -> x + 1).
   ...> parallel().
   ...> filter(x -> isPrime(x)).
   ...> limit(10).
   ...> peek(x -> System.out.println(x)).
   ...> forEach(x -> {})
17
23
19
11
13
2
3
5
7
29

Notice now that the stream elements are no processed in order as multiple processors are available to process different parts of the stream. It should be noted that the elements still fall between 2 and 29. This is because limit is stateful. Try to perform the peek before limit and observe the output.

In addition, one needs to be mindful that performing a reduction on a parallel stream requires that the reduction be associative. For example, summing values is associative, i.e. ((1 + 2) + 3) is the same as (1 + (2 + 3)). Here is summation using a sequential stream.

jshell> IntStream.iterate(2, x -> x + 1).
   ...> limit(10).
   ...> reduce(0, (x, y) -> { System.out.println("Adding " + x + " and " + y); return x + y;})
Adding 0 and 2
Adding 2 and 3
Adding 5 and 4
Adding 9 and 5
Adding 14 and 6
Adding 20 and 7
Adding 27 and 8
Adding 35 and 9
Adding 44 and 10
Adding 54 and 11
$.. ==> 65

Here is the output when summing a parallel stream

jshell> IntStream.iterate(2, x -> x + 1).
   ...> parallel().
   ...> limit(10).
   ...> reduce(0, (x, y) -> { System.out.println("Adding " + x + " and " + y); return x + y;})
Adding 0 and 10
Adding 0 and 5
Adding 0 and 9
Adding 0 and 6
Adding 0 and 8
Adding 0 and 3
Adding 0 and 7
Adding 7 and 8
Adding 0 and 4
Adding 5 and 6
Adding 4 and 11
Adding 0 and 2
Adding 2 and 3
Adding 5 and 15
Adding 0 and 11
Adding 10 and 11
Adding 9 and 21
Adding 15 and 30
Adding 20 and 45
Adding 65 and 0
$.. ==> 65

No matter how summing proceeds, we are guaranteed that the final result will always be the same. It is also interesting to note that 0 is added to more than one stream element, which provides further evidence that several processors starts reduction at the same time with the same starting value. One has to be mindful that the the starting value provided will not give a wrong result when reduced in parallel.

Now contrast addition with division which is a non-associative operation. First the sequential version.

jshell> DoubleStream.iterate(1.0, x -> x + 1).
   ...> limit(4).
   ...> reduce(24.0, (x, y) -> { System.out.println("Dividing " + x + " by " + y); return x / y;})
Dividing 24.0 by 1.0
Dividing 24.0 by 2.0
Dividing 12.0 by 3.0
Dividing 4.0 by 4.0
$.. ==> 1.0

Note that `((((24.0/1.0)/2.0)/3.0)/4.0)` gives 1.0.  What if we
parallelize the stream?

jshell> DoubleStream.iterate(1.0, x -> x + 1).
   ...> limit(4).parallel().
   ...> reduce(24.0, (x, y) -> { System.out.println("Dividing " + x + " by " + y); return x / y;})
Dividing 24.0 by 3.0
Dividing 24.0 by 4.0
Dividing 8.0 by 6.0
Dividing 24.0 by 2.0
Dividing 24.0 by 1.0
Dividing 24.0 by 12.0
Dividing 2.0 by 1.3333333333333333
Dividing 1.5 by 24.0
$.. ==> 0.0625

The result is no longer correct!

Even though parallelizing a stream using a multi-core processor would suggest a linear speedup in computation, always keep in mind that there is an overhead in managing parallel tasks. As such do not parallelize trivial tasks.

⚠️ **GitHub.com Fallback** ⚠️