Setup Guide - datascience/c3po GitHub Wiki

Setup Guide

This is an installation guide for C3PO 0.6.0. It will help you setup and run the command line application as well as the web app in a server of your choice.

Now it is also possible to install C3PO using docker. The instructions are at the bottom of the page.

Requirements

  • Java 1.8,
  • MongoDB 3.2 or higher,
  • sbt 1.0 or higher,
  • FITS 0.6 or higher (optional)

Setup

Install Java, MongoDB (http://www.mongodb.org) and FITS(0.6) (http://projects.iq.harvard.edu/fits), if you haven't. Take a note of the port where the mongo daemon is running (27017 by default). Clone this repository to a location of your choice. (For this guide we assume ~/c3po) Run Maven:

cd ~/c3po
sbt clean compile assembly

You can find the command line c3po in ~/c3po/c3po-cmd/target/scala-2.11.

General

The command line of c3po has several modes you can choose from. To use c3po use the following command:

java -jar c3po-cmd-assembly-0.1-SNAPSHOT.jar

This will output an error message with the modes that you can use. Here are all the available modes and their options you can use. The ones with the '*' are obligatory.

The help mode prints all the available modes and options.

Usage: c3po help

Prints version information

Usage: c3po version

The gather mode is used to read meta data into the mongo database.

Usage: c3po gather [options]
  Options:
  * -c, --collection
       The name of the collection
  * -i, --inputdir
       The input directory where the meta data is stored
    -r, --recursive
       Whether or not to gather recursively
       Default: false
    -t, --type
       Optional parameter to define the meta data type. Use one of 'FITS' or
       'TIKA', to select the type of the input files. Default is FITS
       Default: FITS

The profile mode is used to generate a profile in xml format.

Usage: c3po profile [options]
  Options:
    -a, --algorithm
       The algorithm that will be used for selecting the samples records.
       Supported values are: 'sizesampling', 'syssampling', 'distsampling'
       Default: sizesampling
  * -c, --collection
       The name of the collection
    -ie, --includeelements
       If this flag is present, the profile will include a list of element
       identifiers. Note, that this might be a long list.
       Default: false
    -o, --outputdir
       The output directory where the profile will be stored
       Default: <empty string>
    -props, --properties
       The list of properties for the 'distsampling' algorithm
       Default: []
    -s, --size
       The size of the samples set.
       Default: 5

The samples mode is used to select representative samples based on different strategies.

Usage: c3po samples [options]
  Options:
    -a, --algorithm
       The algorithm that will be used for selecting the samples records. Use
       one of 'sizesampling', 'syssampling', 'distsampling'
       Default: sizesampling
  * -c, --collection
       The name of the collection
    -o, --outputdir
       The output directory where the samples will be output. If nothing is
       provided the output is written to the console
    -props, --properties
       The list of properties for the 'distsampling' algorithm
       Default: []
    -s, --size
       The size of the samples set.
       Default: 5

The export mode is used to export the data in a csv format.

Usage: c3po export [options]
  Options:
  * -c, --collection
       The name of the collection
    -o, --outputdir
       The output directory where the profile will be stored
       Default: <empty string>

The remove mode is used to remove a collection.

Usage: c3po remove [options]
  Options:
  * -c, --collection
       The name of the collection

Advanced

C3PO relies on some simple configuration parameters, like the db name, db host, db port, etc. Defaults are supplied within the jar, so you don't have to do anything. However, if you want to override them create a file called .c3poconfig in your home directory and replace the properties you want. C3PO will use the defaults for all properties that you skip. Here are the defaults.`

#Application default properties.
c3po.persistence=default                         # the class provider for the persistence layer (or default)
c3po.controller.adaptors.count=4                 # the count of the adaptors
c3po.controller.consolidators.count=2            # the count of the consolidators
c3po.rule.infer_date_from_file_name=false        # a rule that tries to infer a date from the file names
c3po.rule.html_info_processing=false             # a rule that cleans up special fits meta data
c3po.rule.format_version_resolution=true         # a rule that fixes some errors in format version parsing
c3po.rule.empty_value_processing=true            # a rule that does not allow empty values
c3po.rule.create_element_identifier=true         # a rule that creates element identifiers if none are provided by the adaptor
c3po.adaptor.tika.version="unknown"              # the tika version (if tika files were processed)

#DB default Properties
db.host=127.0.0.1                                # the host where mongo is running
db.port=27017                                    # the port where mongo is listening
db.name=c3po                                     # the name of the db

Web Application

The Web App provides a UI for the data and allows you to filter the data, select some sample records, export data (xml profile and csv), but also to integrate with tools like PLATO and SCOUT.

Build and Deploy

Note that version 0.6.0 uses Play 2.4, so make sure you install the correct version. To run web-api, execute command sbt "project c3po-webapi" run from ~/c3po. Fire up a browser and navigate to localhost:9000/c3po. You should see the application running.

Additionally, executing sbt clean compile assemblywill generate everything you need for the standalone version. Just run the generated binary ~/c3po/c3po-webapi/target/scala-2.11/c3po-webapi-assembly-0.1-SNAPSHOT.jar . This will run the app in production.

Docker

Docker allows users to start a local instance of C3PO skipping manual installation of sbt, java, and MongoDB. Make sure docker v.17 (or higher) is installed. Specify a location of folder with FITS files instead of "/path/on/host" and execute:

cd ~/c3po/
docker build . -t c3pobundle
docker run -it -p 9000:9000 -v **/path/on/host**:/data/FITS c3pobundle

Alternatively, we have prepared and pushed an image with the bundle to Docker hub. You can use the image directly like:

docker run -it -p **port**:9000 -v **/path/on/host**:/data/FITS artourkin/c3po:latest

Once the message (Server started, use Ctrl+D to stop and go back to the console...) gets printed, C3PO is available at http://localhost:9000/c3po.

If you have any additional questions, please contact us.