C3PO Components - datascience/c3po GitHub Wiki

This document describes C3PO architecture. The main logical blocks of the components are presented in the following figure: C3PO Components

Core is the main module which contains logic, controllers and algorithms for content profiling. It contains the following packages: Controller is responsible to execute any operation in C3PO, such as parallel jobs when reading metadata, working with a database or doing analysis on data.

Metadata processor is called by the Controller in order to read and parse results produced by characterisation tools. Reading the results is done through the Gatherer API, which has implementation for a local file system. Once read, the data is further processed by adaptors of used characterisation tools. There are adaptors for File Information Tool Set (FITS) and Apache Tika.

Consolidator prepares data for storage and checks if there is new information about already processed digital objects. For example, the results of running new characterisation tools on the same collection will be compared and merged with older data by consolidator.

Persistence layer provides an interface to database systems. By default, C3PO works with MongoDB, a scalable noSQL document store.

Data analysis contains algorithms to conduct research on data. There are classes to calculate statistics and distributions on properties of content. Sampling allows calculation of representatives of the collection using different restraints, e.g. random, size-based and properties-based. The package also contains classes on Conflict Resolution, which is needed in situations when data quality of the collection is unsatisfactory.

To access the Core, there are 2 modules. Command-Line-Interface (CLI) application is a text-based interface to populate C3PO and the database behind it with metadata from characterisation tools. Web-application provides a graphical user interface to interact with the stored data and do content profiling.