SPRUCE Backlog - peshkira/c3po GitHub Wiki

Spruce Backlog

This is a simple page summarising the work I have done for the SPRUCE Award. Note, that it covers only days, where I have worked on this project.

April 4:

Talk with SCAPE partners (TUW/KEEPS/SB) to align the work that will be done, in order not to duplicate work. Discussion about object identifiers, meta-data consolidation, further integration with SCOUT, HBase backend, abstracted backend interface and more. Idea: Create a ROADMAP for the tool that will be publicly visible in this repository for other developers to see.

April 15:

Designing the new persistence layer interface. Thinking about the new filtering convention and how the potential changes will impact the framework and the codebase.

April 16:

Designing the new Controller. Creating some graphs (which might be useful for the Dev Guide). Thinking about later integration with the refactored WebApp and new REST API. More work has to be done here.

April 17:

Adding new persistence layer interface. Deprecating a lot of code. Removing some unnecessary files. Cleaning up the API project and documenting the new classes. Refactoring the model classes where necessary. Updating the Configurator, so that the persistence layer is chosen via a configuration.

April 18:

Starting implementation of the default persistence layer, which involved a lot of refactoring and moving code around.

April 22:

First simple implementation of the interface with Mongo. Implementing the Filter serializer and a new filtering convention that each backend provider has to follow. (All filters are connected with a logical AND except if they go over the same property, then a logical OR is used). Also adding some filter caching.

April 24:

Fixing some failing tests, removing some old unnecessary tests. Adding new tests. Starting refactoring of the map/reduce framework. As not all backends will be able to support map/reduce the current framework is being refactored as a mongo specific implementation. The numeric aggregations work. Next up the histograms...

April 26:

Optimized Mongo Persistence Layer a bit. Refactored the histograms retrieval. Now the results are also cached, which will make the Web App much snappier, when it is reimplemented. Deprecating a lot of stuff. Writing mongo persistence layer tests. Adding Javadocs. Bonus - we created a logo :)

April 27:

Finished persistence layer abstraction (only deprecated methods have to be removed). Fixed bugs in Mongo update method. Added new tests. Made Serializers more robust. Started redesigning the Controller.

April 29:

Started changing the adaptor interface. Now it is much easier for a third party developer to extend an adaptor. The framework does the whole thread management and the only thing that one has to care about is the parsing and adaptation to the internal model. Fixed some Controller issues (Thread management). Added new Consolidator thread pool. Refactoring the structure of the api project a bit (to meet the new changes) and removing some old boilerplate code.

April 30:

Cleaned up the FITS Adaptor. Removing some features, that should be handled by the Abstract adaptor. Starting some work on pre and post processing rules. Started cleaning up the Controller and preparing it for the new changes. Adding javadocs

May 2:

Refactoring application configurations in order to make it more clear what the controller expects and what it can obtain by itself. Adding some new processing rules. Allowing the configuration of the processing rules. Started work on the consolidator implementation (some parts are tricky, needs more work)

May 3:

Finished initial consolidator. Currently it consolidates only if the uid (file path) is the same. Partners from the SCAPE project have committed to think of (and potentially implement) different strategies for making the consolidator use these based on configurations. Also fixed some minor bugs. Doing some preliminary tests of the new Controller/Gatherer/Adaptor/Consolidator components. The new implementation shows about 3x speedup during processing on a 40K files collection. Will have to investigate if this is the case for larger collections...

May 4:

May the 4th be with you! ;)

May 6:

Refactoring the Profile generation to use the new persistence layer. Seems to work. Might need some minor improvements, but the first results are good. Experimenting with multi threading to make gathering more performant.

May 9:

Refactoring the sampling algorithms and enhancing the profile generator.

May 10:

Refactoring the CSV Generator, all of the command line interface (switching to a new library that is better than Apache Commons CLI). Deprecated Methods are now gone for good.

May 12:

Adding 'SamplesCommand' on the CLI that lets the user to output the identifiers of sample records on the console or in a file. One can select the number of files and which algorithm to use. Also adding a remove command, that lets the user to remove a whole collection from the command line.

May 13:

Optimising the Gatherer process. Now it is IO bound. Parsing of the files takes a couple of milliseconds on average, but reading the streams from disk is the bottleneck. Parallelising the reading seems to make things worse because of the thread switch overhead. Currently the streams for larger collections (govdocs) need between 5 and 15 milliseconds to be read. You can do the math in order to see how long will it take to process govdocs. On an SSD it should be faster.

May 14:

Adding support for archive files during gathering. Now you can gather data in zip, jar, tar, tgz, tar.gz, tbz2, gz, bz2. Note that this is slower than reading the data directly.

May 15:

Adding java docs for all classes. Adding Dev Guide. Some more performance tests

May 16:

Two performance test - flat folders are much faster then non flat folders.

May 17:

Skype call with the SCAPE project partners TUW/KEEPS/SB. Sorted out what was done. Collected items for the ROADMAP and future work. Made them promise to contribute :P. TUW will perform some tests of the new version.

May 20:

Release of version 0.4.0. With that the work for the SPRUCE Award is finished. I will spend some time during the next months, to update the web app. Meanwhile, I urge you to take a look at the code base and contribute.

Thanks to the SPRUCE Project!