Outlook - janpolowinski/dswarm-documentation GitHub Wiki
Based on the DataHub version of d:swarm, the following features are planned to support the improvement of data quality and structure:
img/dswarm-workflow-complete.png
Deduplication
Deduplication comprises two steps: (1) Finding duplicates (only in the easiest case this can be done via a common ID) and (2) applying an appropriate strategy for merging the duplicates.
FRBRization
FRBRization is a process specific to the bibliographic context. It defines the conversion of flat bibliographic data structures to a model of related entities implementing the FRBR standard, for example, to ensure a work is properly related to its manifestations.
Deduplication and FRBRization needs to happen on the DataHub, since we may want to refer to data from a specific version or with a specific provenance. The cleaned data will be stored back to the DataHub.
Filtering Statements by Context
A prerequisite for the above mentioned data quality procedures is the ability to filter statements by additional context, such as provenance, version or time. For example, a mapping used in a data quality procedure may need to select data based on the resource it was imported from. Versioning is necessary to easily remove unintentionally created statements.
Community Sharing
While most artifacts in d:swarm are already modelled to support reuse and sharing, we are planning to make sharing a prominent feature, easily accessible from various views in the d:swarm BackOffice. Sharing and discussing projects, transformations and mappings with other users facing the same data management tasks should be possible.