ExtendedIterators - erhard-lab/gedi GitHub Wiki
- ExtendedIterators
- Introduction
- Implementation
- Creating ExtendedIterators
- What can you do with ExtendedIterators
Iterators are a basic concept in java and have been introduced with Java 1.2 in 1998. They are interfaces with the following methods
boolean hasNext()
T next()
(I skipped the infamous remove
method as well as forEachRemaining
, as they are default methods as of Java 8). They basically represent one-direction cursors on more special concepts such as collections or arrays.
With Java 8, Oracle also introduced Spliterators and Streams, which allow for a straight-forward parallelization and came along with a, to most Java developers, novel style of handling collections or arrays, called fluent API, e.g.
myList
.stream()
.filter(s -> s.startsWith("c"))
.map(String::toUpperCase)
.sorted()
.forEach(System.out::println);
While Spliterators are a great concept, they are tedious to implement and do not allow for operations like merging or multiplexing (see below). This is why I introduced ExtendedIterators for Gedi, which
- are just good old Iterators (and can therefore be used with all methods that require Iterators)
- provide a fluent API
- lazy evaluation (i.e. only a terminal operation such as
forEach
or anext
will cause the original iterator to advance) - allow for straight-forward parallelization
- and many more useful operations
Before continuing reading, make sure you understood the concept of FunctionalInterfaces.
ExtendedIterator is just a new interface (in the package gedi.util.function
) that extends the iterator interface (what a surprise):
public interface ExtendedIterator<T> extends Iterator<T> {
// all the magic is hidden here
}
Within the new interface, there are various default
methods, i.e. methods that are implemented in the interface itself and may not be implemented by implementing classes, i.e. every class implementing ExtendedIterator
only have to implement hasNext
and next
, just as for normal Iterator
s. Each of the default
methods return in turn an ExtendedIterator
, allowing for fluent programming:
myExtendedIterator
.filter(s -> s.startsWith("c"))
.map(String::toUpperCase)
.sort()
.forEachRemaining(System.out::println)
(Btw. thanks Oracle for making the last line inconsistent between Spliterators and Iterators. This gives me permission to also deviate from their naming scheme, e.g. why did they call it sorted
and not just sort
, just as they called it map
and not mapped
?)
The most important point, where ExtendedIterators are created in Gedi are the GenomicRegionStorage.ei(...)
methods, as GenomicRegionStorage is the central interface in Gedi, and the ei methods just allow to iterate over all entries, all entries on a specific reference sequence, or intersecting a specific GenomicRegion.
As ExtendedIterator
is not part of the Java API, Gedi has to provide a way to convert standard Java API concepts to ExtendedIterator
s. This is done via the class EI
(which is ExtendedIterator for the lazy) with a variety of different static methods. There are several overloaded methods named wrap
that allow to, well, wrap Iterable
s (basically everything from the collections framework including List
s and Set
s), normal Iterator
s, Enumeration
s (the Java concept to iterate over things prior to the collections framework), arrays, slices of arrays, Spliterator
s and Matcher
s. You can also obtain an ExtendedIterator repeat
ing either a constant object n times, or objects returned by a Supplier
or IntFunction
. There are also methods to iterate over the lines from a text file, all overlapping substring of a given string, or the parts after splitting a string.
For the examples here, assume that ei
is an object created via GenomicRegionStorage.ei
from a storage containing Transcript
objects (just as the one returned by Genomic.getTranscripts
). I.e. this can be achieved by
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei = Genomic.get("<id>").getTranscripts().ei();
As you might know, a GenomicRegionStorage contains GenomicRegion objects for several ReferenceSequences
(usually chromosomes) along with a specific data object (Transcript
s here). These three things are bundeled in ImmutableReferenceGenomicRegion objects, which are all returned by the ei()
method.
Important: ExtendedIterator was not written from ground up, but grew. So several methods may be renamed or removed in the future, when I find time to structure it a better way. However, all methods described here will remain
One of the ugliest things in java probably is the idiomatic while
loop to handle iterators:
while (ei.hasNext()) {
T element = ei.next();
// do something
}
Personally, I find the following much nicer:
for (T element : ei.loop()) {
// do something
}
Create a new iterator where each element is mapped by a Function
from the original iterator:
ei.map(r->r.getData());
This is an iterator over all Transcript
objects.
Create a new iterator of all elements that satisfy a condition:
ei.filter(r->r.getRegion().getNumParts()>1);
This is an iterator over all spliced transcripts.
Create a new iterator with all null elements removed.
HashMap<String,String> symbols = readIdToSymbolMapping();
ei.map(r->symbols.get(r.getData().getGeneId())).removeNulls();
As the Map.get
method returns null for all transcripts without any gene symbol, it is necessary to remove those.
Return a new iterator over the first n elements or skipping the first n elements.
ei.head(100);
ei.skip(100);
Return a new iterator where all elements are sorted, or where elements occuring a second or third time are removed. As there may be extremely many elements, sort
has the ability to serialize elements to disk (via ORM) to keep the memory footprint low. There are also various overloaded sort methods to use BinarySeralizer instead or to specify the order via a Comparator
. unique
either assumes that elements are sorted (by specifying the parameter true
), in which case it only has to remember the last element, or that elements are not sorted (by false
). In this case, it has to remember all previous elements. Use this with caution, as they are stored in memory.
ei.sort().unique(true);
ei.filter(r->r.getData().isCoding()).
.map(r->symbols.get(r.getData().getGeneId()))
.removeNulls()
.sort()
.unique(true)
.head(10);
The first ten (lexicographically) gene symbols having an annotated coding transcript.
Fold several subsequent elements into a single element (they must be sorted to do this; elements that are equal according to the comparator are folded), for instance to create an iterator of ArrayList
s containing the transcripts for each gene:
Comparator<> comp = (a,b)->a.getData().getGeneId().compareTo(b.getData().getGeneId());
ei.sort(comp).fold(comp,()->new ArrayList<>(),(r,l)->{l.add(r); return r;});
There is also another way to do that, that is for instance handy to read in files containing specific blocks of lines (such as fasta files):
EI.lines("test.fasta).block(l->l.startsWith(">"));
This will return an iterator of ArrayList
s, each containing the lines of a fasta entry. If elements are supposed to be handeled differently, there is an overloaded method, where you have to specify a Supplier
for the elements of the resulting iterator (which is above ()->new ArrayList()
by default) and a BiFunction
to merge the elements from the original iterator in the resulting iterator (which above is (line,list)->{list.add(line); return list;}
).
It may come as a surprise, but unfolding is just the opposite of folding.
Comparator<> comp = (a,b)->a.getData().getGeneId().compareTo(b.getData().getGeneId());
ei.sort(comp).fold(comp,()->new ArrayList<>(),(r,l)->{l.add(r); return r;}).filter(l->l.size()==1).unfold(l->l.iterator());
This just the transcripts of genes with a single transcript. Note, that the function in unfold just has to return an iterator.
This is of course not only handy to unfold something that has been folded before:
ei.unfold(r->EI.wrap(r.getRegion().getParts()));
This iterates over all exons of all transcripts (getParts
returns a List
which is wrapped into an iterator).
In addition to the obvious forEachRemaining
-Java API method, which is IMHO just an uglier variant of the above mentioned loop
method, if you do not have a single defined method for it (as System.out::println
), there are several ways how to terminate iterators:
Write everything to stdout, a file, a Writer
, a LineWriter or an OutputStream
ei.print();
ei.print("filename");
ei.print(wr_or_stream);
Concatenate everything into a string:
ei.concat(",");
ei.concat(",",r->r.toLocationString());
Fill an array:
ei.toArray(); // Object[]
ei.toArray(ImmutableReferenceGenomicRegion.class); // you get the idea
To an existing collection or an ArrayList
:
ei.toCollection(set);
ei.list();
Count the elements
ei.count();
Puts all elements into a HashMap
, thereby creating an index on them. There are several overloaded methods, differing mainly in the way, how ambiguities are resolved.
The most basic version can be applied like this:
index = ei.index(r->r.getData().getGeneId());
This creates an index on the gene identifiers, e.g. to get the ReferenceGenomicRegion
object by invoking index.get("ENST00000324221")
. If you are interested not in the ReferenceGenomicRegion
itself but in a different aspect, you can specify the values of the HashMap
:
index = ei.index(r->r.getData().getGeneId(), r->r.getData().getTranscriptId());
This gives an index from gene identifiers to transcript identifiers.
All other overloading are available for both of these versions (with or without specifying the value), and handle cases where the key is not unique (i.e. when there are multiple transcripts with the same gene identifier in the example, which is the case for most human genes). The above versions would throw a RuntimeException
in such cases.
// just overwrite, in the index, there will be the last transcript of all transcripts with the same identifier
index = ei.indexOverwrite(r->r.getData().getGeneId(), r->r.getData().getTranscriptId());
// put the smallest (according to the given `Comparator`) into the index (in this case the lexicographically smallest transcript identifier)
index = ei.indexSmallest(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(), (a,b)->a.compareTo(b));
// create a new key if it is already present, here append the number of failed attempts to insert the key
index = ei.indexAdapt(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(), (n, rgr, geneid)->geneid+"_"+n);
// create a mapping to `ArrayList`s of transcript identifiers
index = ei.indexMulti(r->r.getData().getGeneId(), r->r.getData().getTranscriptId());
// create a mapping to `HashSet`s of transcript identifiers
index = ei.indexMulti(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(),geneid->new HashSet<>());
// create a mapping to concatenated transcript identifiers
index = ei.indexMultiCombine(r->r.getData().getGeneId(), r->r.getData().getTranscriptId(),(geneid,transcriptid,presentids)->presentids+","+transcriptid);
All these overloadings are also available without specifying the value to index (here the transcript id).
Most of the above methods are just special cases of reducing the elements. The reduce
methods can be used to simulate those by specifying an object and a BiFunction
, or just a BinaryOperator
(it is also very similar to folding, see above, with the difference that all elements are folded):
// equivalent to ei.list();
ei.reduce(new ArrayList<>(),(r,list)->{list.add(r); return list;});
// to something completely different:
ei.map(r->r.getRegion().getTotalLength()).reduce(0,(a,b)->a+b);
// this is equivalent:
ei.map(r->r.getRegion().getTotalLength()).reduce((a,b)->a+b);
The most basic way to combine iterators is to chain them:
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei1 = Genomic.get("<id1>").getTranscripts().ei();
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei2 = Genomic.get("<id2>").getTranscripts().ei();
ei1.chain(ei2);
This creates a new iterator that first goes over all transcripts in genome id1 and then over all in genome id2.
Fusing makes sense when you have two or more iterators where you exactly know that the have the same number of elements and you want to go over them in parallel, e.g. you have two fastq files for forward and reverse reads, have already block
ed them (let's call those iterators fw
and rv
) and now want to have an iterator over the read pairs:
fw.fuse(rv).map(arr->arr[0].get(1)+arr[1].get(1));
The element from fw
will be at slot 0 of the array (here called arr
), the one from rv
will be in slot 1. Thus, this piece of code will return an iterator that goes over the concatenated read pair sequences (as the sequence is in the second line of each fastq entry).
Very similar to fusing iterators is to alternate them:
fw.alternating(rv);
This will return an iterator that goes over forward and backward reads in an alternating fashion.
You can think of merge
and parallel
as the more intelligent twins of chain
and fuse
, respectively: They also receive a Comparator
and put everything in order.
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei1 = Genomic.get("<id>").getTranscripts().ei("1+");
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei2 = Genomic.get("<id>").getTranscripts().ei("1-");
ei1.merge((a,b)->a.getRegion().compareTo(b.getRegion()),ei2);
This will return an iterator that goes over all transcripts from chromosome 1 ordered according to their location (which is only possible as ei1
and ei2
are already sorted according to location).
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei1 = Genomic.get("h.ens75").getTranscripts().ei("1+");
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei2 = Genomic.get("h.ens86").getTranscripts().ei("1+");
ei1.parallel((a,b)->a.getRegion().compareTo(b.getRegion()),ImmutableReferenceGenomicRegion.class,ei2);
This will return an iterator over pairs of ImmutableReferenceGenomicRegion
s (i.e. arrays of length 2). Each pair contains two transcripts, if the transcript is available in the genome annotations h.ens75 and h.ens86, and only one transcript (at slot 0 or 1), if it is only present in one of those. This is only possible because transcripts are already sorted according to the comparator, and the comparator never returns 0 (as transcript regions are by definition of GenomicRegionStorage
unique).
All these methods return new iterator that basically does the same as the original one, but performs additional operations.
Uses a Progress
object to show the progress of the iteration (this is from the Gedi API, currently there is only a ConsoleProgress
implementation that just writes the progress to stdout or stderr every 100ms).
// return a new iterator that writes the progress into a ConsoleProgress object
ei.progress();
// return a new iterator that writes the progress into a ConsoleProgress object
ei.progress();
// return a new iterator that writes the progress into a ConsoleProgress object that is configured for n the total number of elements
ei.progress(n);
// return a new iterator that writes the progress and current gene id into a ConsoleProgress object configured for output on stderr
ei.progress(new ConsoleProgress(System.err),n,r->r.getData().getGeneId());
Perform a specific action at the beginning (i.e. right before the first element is returned by next
) or end of the iteration (i.e. when hasNext
returns false for the first time).
ei.initAction(r->log.info("Started iteration with element "+r").endAction(()->log.info("Finished iteration"));
The basic sideEffect
iterator is able to do anything with each element (using a Consumer
) before going over it.
ei.sideEffect(r->log.log(Level.FINEST,"Encountered "+r));
This writes every element encountered to the log (which I strongly advice against doing it, as the string concatenation is executed everytime, independent on whether FINEST elements are configured for logging).
The coolest feature of ExtendedIterators is that they allow for straight-forward parallelization:
ei.parallelized(eit->eit.map(r->heavyOperation(r)));
This will basically create blocks of elements and process each block in a separate Thread
. For each block, eit
is an extended iterator over the elements of this block, and the function has to return a new extended iterator. After parallelized, the extended iterators of all blocks are unfolded. Of note, this is the only operation that already does some work when it is called (it fills an internal queue, creates the blocks, executes the threads and waits until its output queue is drained by next
. Also, a proper ordering is not guaranteed!
Caveat: You have to know what parallelized
does, especially, when you want to fold
within the parallelization (as elements that you wanted to fold may be distributed to different threads).
There are also overloaded methods that allow to specify the number of threads (default: number of processors) and the block size (default: 1024). Generally, the less memory each element needs and the quicker each processing is, the greater the block size should be.
In addition, there is a parallelized side effect method:
ei.parallelizedSideEffect(r->heavyOperation(r),1024);
The side effect is executed on a separate thread. Naturally, if the side effect takes more time than the original iteration, elements will accumulate that have not been processed by the side effect. The original thread will block, when more than 1024 elements await processing by the heavyOperation
.