Home - janpolowinski/dswarm-documentation GitHub Wiki

d:swarm is a data management platform that can be used for the lossless transformation of data from heterogeneous sources into a flexible (elastic), data model. This data model can serve as a single source for providing Linked Open Data (LOD).

d:swarm is a middle ware solution. It forms the basis of all data management processes in a library or any other (cultural) institution dedicated to the handling of data and metadata. Structurally, d:swarm goes in between existing data management systems (e.g. Integrated Library Systems) and existing front end applications (e.g. the library catalogue or discovery system).

http://www.dswarm.org/wp-content/uploads/2015/04/dswarm-demo_2015-04-14.png

Finally, d:swarm is an ETL tool with a GUI for non-programmers. Librarians do not need to write scripts, but can create complex transformations by Drag and Drop from a functions library and configuring them in the d:swarm BackOffice. Following the concept of community sharing, transformations, mappings and almost any other artifact in d:swarm that could be helpful to others is designed for reuse.

d:swarm is realized as a web application that runs in all modern web browsers. The current release of our web application is available at http://demo.dswarm.org. If you want to participate in the tests, drop us a note, and we will gladly add you to the group of testers. We are looking forward to your feedback, your ideas, your opinion and your contribution at our mailing list or issue tracker.

What is the d:swarm idea?

Start watching these two (#1, #2) presentations, which summarize the motivation and goals on an abstract level.

What can I do with d:swarm today?

With the current d:swarm implementation you can ...

import, set up and configure data resources
create projects, define mappings, transformations and filters
transform data
export data in RDF or XML (e.g., for feeding Solr indices).

img/dswarm-workflow-abstract.png

Configuring resources and creating mapping projects can be done with the d:swarm BackOffice web application. See our user guide for a brief manual of how to utilise the d:swarm BackOffice. While you can transform example data directly with the BackOffice, batch-processing large amounts of data can be done with the Task Processing Unit for d:swarm (TPU), initially developed by UB Dortmund. When processing data with D:SWARM, you have the choice between two options - the Streaming and the DataHub version.

d:swarm Streaming Version

The d:swarm streaming version offers fast processing of large amounts data and is sufficient for many scenarios. You may already use it today to start with as long as the work on the full DataHub version of D:SWARM continues. Unlike in the DataHub version this does not allow for versioning/archiving. See how SLUB Dresden employs d:swarm for transforming and integrating bibliographic data sources. (Currently, streaming the export is implemented for XML, RDF will be added on demand.)

d:swarm DataHub Version

Archiving versions of the transformed data is only possible with the DataHub version of d:swarm, which is also the basis for upcoming functionality such as deduplication, frbrization and other data quality improvements. While many steps into this direction have been taken, challenges remain with respect to scalability for very large datasets. See this blog post.

img/dswarm-usage-variants.png

... and behind the scenes

As shown below, the overall architecture consists of three major parts: the BackOffice web application, the TPU, and the back end. The back end, in turn, consists of three modules:

a controller module that controls the program flow and provides a HTTP API
a converter that encapsulates metafacture to transform data
and a [persistence]] layer to access the [metadata repository (currently a relational database; MySQL) and the data hub (currently a graph database; Neo4j).

img/architecture.png

Users, e.g., system librarians, usually interact with the BackOffice web application. Just like the TPU, which batch-processes ingest, transformation and export tasks, it communicates with the d:swarm backend via the controller's HTTP API. The HTTP API provides a documentation via Swagger and hence can be explored via the Swagger UI. This is a very convenient way to explore the back end's functionality.

How to get started?

Just go to http://demo.dswarm.org and try it out or setup your own local installation and run it from there.

Installation and Configuration

It might be a good idea to run d:swarm locally to get full insight into the processes of our application. Installation instructions can be found in the [Developer Install]] guide, [d:swarm Configuration provides details on how to configure the system. (For productive use of d:swarm see Server Install.)

Running the System

Once installed, the BackOffice (usually) runs at http://localhost:9999. You may want to have a look at the [MySQL Cheat Sheet]] for our metadata repository schema (see also our [domain model) and use a tool of your choice to explore the database.

Contributing

You like to contribute? Awesome!

License

All code from the repositories that belong to our project (see here) is published under APL2 license (except of the d:swarm Neo4j unmanaged extension, which is published under GPL3 license).