Home - nih-cfde/published-documentation GitHub Wiki

This wiki is a companion to the detailed technical documentation for participating in the Common Fund Data Ecosystem Portal.

Use the sidebar to navigate wiki pages
Ask questions or search common errors in Discussions
Get started on your submission with our QuickStart
Need more help? Email the Helpdesk

What is the C2M2?

The Crosscut Metadata Model (C2M2) is a flexible metadata standard for describing experimental resources in biomedicine and related fields. At the Common Fund Data Ecosystem (CFDE) we use the C2M2 as our centralized model of participating datasets in a rich relational database accessible at https://app.nih-cfde.org/. This portal supports faceted search of metadata concepts such as anatomical location, species, and assay type, across a wide variety of datasets using a controlled vocabulary (we do not currently support protected metadata). This allows researchers to find a wide variety of data that would otherwise need to be searched individually, using varying nomenclatures. Currently, the portal only accepts C2M2 datapackages from Common Fund Programs. If you represent the Data Coordination Center from a Common Fund Program, and would like to know more about joining the Common Fund Data Ecosystem, please contact us by emailing the helpdesk: [email protected]. Funding is available for Common Fund Programs who wish to participate: see Engagement Opportunities for Common Fund Programs for more information.

How does the CFDE Data Submission System work?

Graphic overview of the steps for data submission. White boxes are user steps; blue boxes are automated.

DCCs build a set of tab-separated value files (TSVs) that represent their available data, then generate controlled vocabulary (CV) term tables using the CFDE-provided C2M2 submission prep script, which also performs several pre-submission integrity checks on the data being prepared. Once all checks have passed, DCCs submit the prepared set of TSVs to the CFDE Data Submission System using the cfde-submit tool. This tool takes a directory as input, does some initial validation of its own, then builds the directory into a bdbag (a "datapackage") and submits the datapackage to an authenticated Globus endpoint. This process should take less than 30 seconds on your local computer, and the tool will report Your dataset has been submitted.

Once your data is in the Globus endpoint, our database (Deriva) will automatically begin ingesting the datapackage, and doing further validation. This process will take several minutes, but is done completely on our servers, so you don't need to stay connected. However, you can check the status of your ingest using cfde-submit status. When Deriva finishes your ingest, you will receive an email that contains information about the datapackage, including a link to view the data in the CFDE data portal. You can also navigate directly to the Submission system by logging in at https://app.nih-cfde.org/ and clicking 'Data Review'.

DCCs can have any number of submitted datapackages in the system, and can use the portal to view each submission in multiple ways and ensure it is structured as intended. No DCC submissions will be viewable or searchable by the public until they are approved for inclusion in the public release. Although DCCs can have any number of reviewable submissions, only one approved submission is included in each public release. If multiple submissions are approved before the next public release, the most recently approved submission will be used. At each public release date, each DCC's approved datapackage will be rolled into the public catalog and will become searchable in the portal. If a DCC does not submit a new datapackage between releases, their current public datapackage will stay in the portal. If a DCC has submitted and approved a new datapackage, it will completely replace any previous datapackages from that DCC. We do not have the ability to accept updates to existing submissions at this time.