ExtendedIterators - erhard-lab/gedi GitHub Wiki

Table of Contents

ExtendedIterators

Introduction

Iterators are a basic concept in java and have been introduced with Java 1.2 in 1998. They are interfaces with the following methods

boolean hasNext()
T next()

(I skipped the infamous remove method as well as forEachRemaining, as they are default methods as of Java 8). They basically represent one-direction cursors on more special concepts such as collections or arrays.

With Java 8, Oracle also introduced Spliterators and Streams, which allow for a straight-forward parallelization and came along with a, to most Java developers, novel style of handling collections or arrays, called fluent API, e.g.

myList
    .stream()
    .filter(s -> s.startsWith("c"))
    .map(String::toUpperCase)
    .sorted()
    .forEach(System.out::println);

While Spliterators are a great concept, they are tedious to implement and do not allow for operations like merging or multiplexing (see below). This is why I introduced ExtendedIterators for Gedi, which

  • are just good old Iterators (and can therefore be used with all methods that require Iterators)
  • provide a fluent API
  • lazy evaluation (i.e. only a terminal operation such as forEach or a next will cause the original iterator to advance)
  • allow for straight-forward parallelization
  • and many more useful operations

Before continuing reading, make sure you understood the concept of FunctionalInterfaces.

Implementation

ExtendedIterator is just a new interface (in the package gedi.util.function) that extends the iterator interface (what a surprise):

public interface ExtendedIterator<T> extends Iterator<T> {

	// all the magic is hidden here

}

Within the new interface, there are various default methods, i.e. methods that are implemented in the interface itself and may not be implemented by implementing classes, i.e. every class implementing ExtendedIterator only have to implement hasNext and next, just as for normal Iterators. Each of the default methods return in turn an ExtendedIterator, allowing for fluent programming:

myExtendedIterator
	.filter(s -> s.startsWith("c"))
	.map(String::toUpperCase)
	.sort()
	.forEachRemaining(System.out::println)

(Btw. thanks Oracle for making the last line inconsistent between Spliterators and Iterators. This gives me permission to also deviate from their naming scheme, e.g. why did they call it sorted and not just sort, just as they called it map and not mapped?)

Creating ExtendedIterators

The most important point, where ExtendedIterators are created in Gedi are the GenomicRegionStorage.ei(...) methods, as GenomicRegionStorage is the central interface in Gedi, and the ei methods just allow to iterate over all entries, all entries on a specific reference sequence, or intersecting a specific GenomicRegion.

As ExtendedIterator is not part of the Java API, Gedi has to provide a way to convert standard Java API concepts to ExtendedIterators. This is done via the class EI (which is ExtendedIterator for the lazy) with a variety of different static methods. There are several overloaded methods named wrap that allow to, well, wrap Iterables (basically everything from the collections framework including Lists and Sets), normal Iterators, Enumerations (the Java concept to iterate over things prior to the collections framework), arrays, slices of arrays, Spliterators and Matchers. You can also obtain an ExtendedIterator repeating either a constant object n times, or objects returned by a Supplier or IntFunction. There are also methods to iterate over the lines from a text file, all overlapping substring of a given string, or the parts after splitting a string.

What can you do with ExtendedIterators

For the examples here, assume that ei is an object created via GenomicRegionStorage.ei from a storage containing Transcript objects (just as the one returned by Genomic.getTranscripts). I.e. this can be achieved by

ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei = Genomic.get("<id>").getTranscripts().ei();

As you might know, a GenomicRegionStorage contains GenomicRegion objects for several ReferenceSequences (usually chromosomes) along with a specific data object (Transcripts here). These three things are bundeled in ImmutableReferenceGenomicRegion objects, which are all returned by the ei() method.

Important: ExtendedIterator was not written from ground up, but grew. So several methods may be renamed or removed in the future, when I find time to structure it a better way. However, all methods described here will remain

Use it in a for loop

One of the ugliest things in java probably is the idiomatic while loop to handle iterators:

while (ei.hasNext()) {
	T element = ei.next();
	// do something

}

Personally, I find the following much nicer:

for (T element : ei.loop()) {
	// do something
}

Basic operations

map

Create a new iterator where each element is mapped by a Function from the original iterator:

ei.map(r->r.getData());

This is an iterator over all Transcript objects.

filter

Create a new iterator of all elements that satisfy a condition:

ei.filter(r->r.getRegion().getNumParts()>1);

This is an iterator over all spliced transcripts.

removeNulls

Create a new iterator with all null elements removed.

HashMap<String,String> symbols = readIdToSymbolMapping();
ei.map(r->symbols.get(r.getData().getGeneId())).removeNulls();

As the Map.get method returns null for all transcripts without any gene symbol, it is necessary to remove those.

head, skip

Return a new iterator over the first n elements or skipping the first n elements.

ei.head(100);
ei.skip(100);

sort, unique

Return a new iterator where all elements are sorted, or where elements occuring a second or third time are removed. As there may be extremely many elements, sort has the ability to serialize elements to disk (via ORM) to keep the memory footprint low. There are also various overloaded sort methods to use BinarySeralizer instead or to specify the order via a Comparator. unique either assumes that elements are sorted (by specifying the parameter true), in which case it only has to remember the last element, or that elements are not sorted (by false). In this case, it has to remember all previous elements. Use this with caution, as they are stored in memory.

ei.sort().unique(true);

Putting these together

ei.filter(r->r.getData().isCoding()).
  .map(r->symbols.get(r.getData().getGeneId()))
  .removeNulls()
  .sort()
  .unique(true)
  .head(10);

The first ten (lexicographically) gene symbols having an annotated coding transcript.

folding

Fold several subsequent elements into a single element (they must be sorted to do this; elements that are equal according to the comparator are folded), for instance to create an iterator of ArrayLists containing the transcripts for each gene:

Comparator<> comp = (a,b)->a.getData().getGeneId().compareTo(b.getData().getGeneId());
ei.sort(comp).fold(comp,()->new ArrayList<>(),(r,l)->{l.add(r); return r;});

There is also another way to do that, that is for instance handy to read in files containing specific blocks of lines (such as fasta files):

EI.lines("test.fasta).block(l->l.startsWith(">"));

This will return an iterator of ArrayLists, each containing the lines of a fasta entry. If elements are supposed to be handeled differently, there is an overloaded method, where you have to specify a Supplier for the elements of the resulting iterator (which is above ()->new ArrayList() by default) and a BiFunction to merge the elements from the original iterator in the resulting iterator (which above is (line,list)->{list.add(line); return list;}).

unfolding

It may come as a surprise, but unfolding is just the opposite of folding.

Comparator<> comp = (a,b)->a.getData().getGeneId().compareTo(b.getData().getGeneId());
ei.sort(comp).fold(comp,()->new ArrayList<>(),(r,l)->{l.add(r); return r;}).filter(l->l.size()==1).unfold(l->l.iterator());

This just the transcripts of genes with a single transcript. Note, that the function in unfold just has to return an iterator.

This is of course not only handy to unfold something that has been folded before:

ei.unfold(r->EI.wrap(r.getRegion().getParts()));

This iterates over all exons of all transcripts (getParts returns a List which is wrapped into an iterator).

Terminal operations

In addition to the obvious forEachRemaining-Java API method, which is IMHO just an uglier variant of the above mentioned loop method, if you do not have a single defined method for it (as System.out::println), there are several ways how to terminate iterators:

Output

Write everything to stdout, a file, a Writer, a LineWriter or an OutputStream

ei.print();
ei.print("filename");
ei.print(wr_or_stream);

To Java types

Concatenate everything into a string:

ei.concat(",");
ei.concat(",",r->r.toLocationString());

Fill an array:

ei.toArray(); // Object[]
ei.toArray(ImmutableReferenceGenomicRegion.class); // you get the idea

To an existing collection or an ArrayList:

ei.toCollection(set);
ei.list();

Count the elements

ei.count();

Indexing

Puts all elements into a HashMap, thereby creating an index on them. There are several overloaded methods, differing mainly in the way, how ambiguities are resolved.

The most basic version can be applied like this:

index = ei.index(r->r.getData().getGeneId());

This creates an index on the gene identifiers, e.g. to get the ReferenceGenomicRegion object by invoking index.get("ENST00000324221"). If you are interested not in the ReferenceGenomicRegion itself but in a different aspect, you can specify the values of the HashMap:

index = ei.index(r->r.getData().getGeneId(), r->r.getData().getTranscriptId());

This gives an index from gene identifiers to transcript identifiers.

All other overloading are available for both of these versions (with or without specifying the value), and handle cases where the key is not unique (i.e. when there are multiple transcripts with the same gene identifier in the example, which is the case for most human genes). The above versions would throw a RuntimeException in such cases.

// just overwrite, in the index, there will be the last transcript of all transcripts with the same identifier
index = ei.indexOverwrite(r->r.getData().getGeneId(), r->r.getData().getTranscriptId());

// put the smallest (according to the given `Comparator`) into the index (in this case the lexicographically smallest transcript identifier)
index = ei.indexSmallest(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(), (a,b)->a.compareTo(b));

// create a new key if it is already present, here append the number of failed attempts to insert the key
index = ei.indexAdapt(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(), (n, rgr, geneid)->geneid+"_"+n);

// create a mapping to `ArrayList`s of transcript identifiers
index = ei.indexMulti(r->r.getData().getGeneId(), r->r.getData().getTranscriptId());

// create a mapping to `HashSet`s of transcript identifiers
index = ei.indexMulti(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(),geneid->new HashSet<>());

// create a mapping to concatenated transcript identifiers
index = ei.indexMultiCombine(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(),(geneid,transcriptid,presentids)->presentids+","+transcriptid);

All these overloadings are also available without specifying the value to index (here the transcript id).

Reducing

Most of the above methods are just special cases of reducing the elements. The reduce methods can be used to simulate those by specifying an object and a BiFunction, or just a BinaryOperator (it is also very similar to folding, see above, with the difference that all elements are folded):

// equivalent to ei.list();
ei.reduce(new ArrayList<>(),(r,list)->{list.add(r); return list;});

// to something completely different:
ei.map(r->r.getRegion().getTotalLength()).reduce(0,(a,b)->a+b);

// this is equivalent:
ei.map(r->r.getRegion().getTotalLength()).reduce((a,b)->a+b);

Create a single Iterator from multiple Iterators

Chain

The most basic way to combine iterators is to chain them:

ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei1 = Genomic.get("<id1>").getTranscripts().ei();
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei2 = Genomic.get("<id2>").getTranscripts().ei();
ei1.chain(ei2);

This creates a new iterator that first goes over all transcripts in genome id1 and then over all in genome id2.

Fuse

Fusing makes sense when you have two or more iterators where you exactly know that the have the same number of elements and you want to go over them in parallel, e.g. you have two fastq files for forward and reverse reads, have already blocked them (let's call those iterators fw and rv) and now want to have an iterator over the read pairs:

fw.fuse(rv).map(arr->arr[0].get(1)+arr[1].get(1));

The element from fw will be at slot 0 of the array (here called arr), the one from rv will be in slot 1. Thus, this piece of code will return an iterator that goes over the concatenated read pair sequences (as the sequence is in the second line of each fastq entry).

Alternating

Very similar to fusing iterators is to alternate them:

fw.alternating(rv);

This will return an iterator that goes over forward and backward reads in an alternating fashion.

Merge and Parallel

You can think of merge and parallel as the more intelligent twins of chain and fuse, respectively: They also receive a Comparator and put everything in order.

ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei1 = Genomic.get("<id>").getTranscripts().ei("1+");
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei2 = Genomic.get("<id>").getTranscripts().ei("1-");
ei1.merge((a,b)->a.getRegion().compareTo(b.getRegion()),ei2);

This will return an iterator that goes over all transcripts from chromosome 1 ordered according to their location (which is only possible as ei1 and ei2 are already sorted according to location).

ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei1 = Genomic.get("h.ens75").getTranscripts().ei("1+");
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei2 = Genomic.get("h.ens86").getTranscripts().ei("1+");
ei1.parallel((a,b)->a.getRegion().compareTo(b.getRegion()),ImmutableReferenceGenomicRegion.class,ei2);

This will return an iterator over pairs of ImmutableReferenceGenomicRegions (i.e. arrays of length 2). Each pair contains two transcripts, if the transcript is available in the genome annotations h.ens75 and h.ens86, and only one transcript (at slot 0 or 1), if it is only present in one of those. This is only possible because transcripts are already sorted according to the comparator, and the comparator never returns 0 (as transcript regions are by definition of GenomicRegionStorage unique).

SideEffects

All these methods return new iterator that basically does the same as the original one, but performs additional operations.

progress

Uses a Progress object to show the progress of the iteration (this is from the Gedi API, currently there is only a ConsoleProgress implementation that just writes the progress to stdout or stderr every 100ms).

// return a new iterator that writes the progress into a ConsoleProgress object
ei.progress();

// return a new iterator that writes the progress into a ConsoleProgress object
ei.progress();

// return a new iterator that writes the progress into a ConsoleProgress object that is configured for n the total number of elements
ei.progress(n);

// return a new iterator that writes the progress and current gene id into a ConsoleProgress object configured for output on stderr
ei.progress(new ConsoleProgress(System.err),n,r->r.getData().getGeneId());

initAction,endAction

Perform a specific action at the beginning (i.e. right before the first element is returned by next) or end of the iteration (i.e. when hasNext returns false for the first time).

ei.initAction(r->log.info("Started iteration with element "+r").endAction(()->log.info("Finished iteration"));

SideEffect

The basic sideEffect iterator is able to do anything with each element (using a Consumer) before going over it.

ei.sideEffect(r->log.log(Level.FINEST,"Encountered "+r));

This writes every element encountered to the log (which I strongly advice against doing it, as the string concatenation is executed everytime, independent on whether FINEST elements are configured for logging).

Parallelization

The coolest feature of ExtendedIterators is that they allow for straight-forward parallelization:

ei.parallelized(eit->eit.map(r->heavyOperation(r)));

This will basically create blocks of elements and process each block in a separate Thread. For each block, eit is an extended iterator over the elements of this block, and the function has to return a new extended iterator. After parallelized, the extended iterators of all blocks are unfolded. Of note, this is the only operation that already does some work when it is called (it fills an internal queue, creates the blocks, executes the threads and waits until its output queue is drained by next. Also, a proper ordering is not guaranteed!

Caveat: You have to know what parallelized does, especially, when you want to fold within the parallelization (as elements that you wanted to fold may be distributed to different threads).

There are also overloaded methods that allow to specify the number of threads (default: number of processors) and the block size (default: 1024). Generally, the less memory each element needs and the quicker each processing is, the greater the block size should be.

In addition, there is a parallelized side effect method:

ei.parallelizedSideEffect(r->heavyOperation(r),1024);

The side effect is executed on a separate thread. Naturally, if the side effect takes more time than the original iteration, elements will accumulate that have not been processed by the side effect. The original thread will block, when more than 1024 elements await processing by the heavyOperation.

⚠️ **GitHub.com Fallback** ⚠️