Setup Guide - datascience/c3po GitHub Wiki
Setup Guide
This is an installation guide for C3PO 0.6.0. It will help you setup and run the command line application as well as the web app in a server of your choice.
Now it is also possible to install C3PO using docker
. The instructions are at the bottom of the page.
Requirements
- Java 1.8,
- MongoDB 3.2 or higher,
- sbt 1.0 or higher,
- FITS 0.6 or higher (optional)
Setup
Install Java, MongoDB (http://www.mongodb.org) and FITS(0.6) (http://projects.iq.harvard.edu/fits), if you haven't. Take a note of the port where the mongo daemon is running (27017 by default). Clone this repository to a location of your choice. (For this guide we assume ~/c3po) Run Maven:
cd ~/c3po
sbt clean compile assembly
You can find the command line c3po in ~/c3po/c3po-cmd/target/scala-2.11.
General
The command line of c3po has several modes you can choose from. To use c3po use the following command:
java -jar c3po-cmd-assembly-0.1-SNAPSHOT.jar
This will output an error message with the modes that you can use. Here are all the available modes and their options you can use. The ones with the '*' are obligatory.
The help mode prints all the available modes and options.
Usage: c3po help
Prints version information
Usage: c3po version
The gather mode is used to read meta data into the mongo database.
Usage: c3po gather [options]
Options:
* -c, --collection
The name of the collection
* -i, --inputdir
The input directory where the meta data is stored
-r, --recursive
Whether or not to gather recursively
Default: false
-t, --type
Optional parameter to define the meta data type. Use one of 'FITS' or
'TIKA', to select the type of the input files. Default is FITS
Default: FITS
The profile mode is used to generate a profile in xml format.
Usage: c3po profile [options]
Options:
-a, --algorithm
The algorithm that will be used for selecting the samples records.
Supported values are: 'sizesampling', 'syssampling', 'distsampling'
Default: sizesampling
* -c, --collection
The name of the collection
-ie, --includeelements
If this flag is present, the profile will include a list of element
identifiers. Note, that this might be a long list.
Default: false
-o, --outputdir
The output directory where the profile will be stored
Default: <empty string>
-props, --properties
The list of properties for the 'distsampling' algorithm
Default: []
-s, --size
The size of the samples set.
Default: 5
The samples mode is used to select representative samples based on different strategies.
Usage: c3po samples [options]
Options:
-a, --algorithm
The algorithm that will be used for selecting the samples records. Use
one of 'sizesampling', 'syssampling', 'distsampling'
Default: sizesampling
* -c, --collection
The name of the collection
-o, --outputdir
The output directory where the samples will be output. If nothing is
provided the output is written to the console
-props, --properties
The list of properties for the 'distsampling' algorithm
Default: []
-s, --size
The size of the samples set.
Default: 5
The export mode is used to export the data in a csv format.
Usage: c3po export [options]
Options:
* -c, --collection
The name of the collection
-o, --outputdir
The output directory where the profile will be stored
Default: <empty string>
The remove mode is used to remove a collection.
Usage: c3po remove [options]
Options:
* -c, --collection
The name of the collection
Advanced
C3PO relies on some simple configuration parameters, like the db name, db host, db port, etc. Defaults are supplied within the jar, so you don't have to do anything. However, if you want to override them create a file called .c3poconfig in your home directory and replace the properties you want. C3PO will use the defaults for all properties that you skip. Here are the defaults.`
#Application default properties.
c3po.persistence=default # the class provider for the persistence layer (or default)
c3po.controller.adaptors.count=4 # the count of the adaptors
c3po.controller.consolidators.count=2 # the count of the consolidators
c3po.rule.infer_date_from_file_name=false # a rule that tries to infer a date from the file names
c3po.rule.html_info_processing=false # a rule that cleans up special fits meta data
c3po.rule.format_version_resolution=true # a rule that fixes some errors in format version parsing
c3po.rule.empty_value_processing=true # a rule that does not allow empty values
c3po.rule.create_element_identifier=true # a rule that creates element identifiers if none are provided by the adaptor
c3po.adaptor.tika.version="unknown" # the tika version (if tika files were processed)
#DB default Properties
db.host=127.0.0.1 # the host where mongo is running
db.port=27017 # the port where mongo is listening
db.name=c3po # the name of the db
Web Application
The Web App provides a UI for the data and allows you to filter the data, select some sample records, export data (xml profile and csv), but also to integrate with tools like PLATO and SCOUT.
Build and Deploy
Note that version 0.6.0 uses Play 2.4, so make sure you install the correct version.
To run web-api, execute command sbt "project c3po-webapi" run
from ~/c3po
.
Fire up a browser and navigate to localhost:9000/c3po. You should see the application running.
Additionally, executing sbt clean compile assembly
will generate everything you need for the standalone version. Just run the generated binary ~/c3po/c3po-webapi/target/scala-2.11/c3po-webapi-assembly-0.1-SNAPSHOT.jar
. This will run the app in production.
Docker
Docker allows users to start a local instance of C3PO skipping manual installation of sbt, java, and MongoDB. Make sure docker v.17 (or higher) is installed. Specify a location of folder with FITS files instead of "/path/on/host" and execute:
cd ~/c3po/
docker build . -t c3pobundle
docker run -it -p 9000:9000 -v **/path/on/host**:/data/FITS c3pobundle
Alternatively, we have prepared and pushed an image with the bundle to Docker hub. You can use the image directly like:
docker run -it -p **port**:9000 -v **/path/on/host**:/data/FITS artourkin/c3po:latest
Once the message (Server started, use Ctrl+D to stop and go back to the console...)
gets printed, C3PO is available at http://localhost:9000/c3po.
If you have any additional questions, please contact us.