JSON structure to store jobs - mettadatalabs1/oncoscape-datapipeline GitHub Wiki

What this document is about?

This document describes the structure of the job creation JSON. It builds upon the existing JSON structure used in the molecular manifest.

Design motivation

The motivation behind extending the current structure is to enable asynchronous processing of jobs, provide job status updates, and create a provenance trail for job execution (by linking the job with associated logs that were generated during the execution), and lastly to have the ability to backtrack from data in the DB to a job that led to its creation (more provenance).

What is getting added?

Here is the architecture overview for the job creation process.

The current structure for describing a job manifest is:

{
    "dataset": "acc",
    "source": "ucsc",
    "type": "mut01",		
    "process": "broadcurated",
    "directory": "../data/UCSC/ACC/",
    "file": "mutation_curated_broad_gene"
}

To this, we will add:

jobID: An auto generated jobID that will be linked to the user
jobStatus: Storing the job status
jobCreationTime: The creation time of the job
jobLogReference: A reference to the log file(s) that are generated during the execution. The following examples illustrates these changes. The first example is the initial document when it is inserted.

{ "dataset": "acc", "source": "ucsc", "type": "mut01", "process": "broadcurated", "directory": "../data/UCSC/ACC/", "file": "mutation_curated_broad_gene" jobID: "job001", jobStatus: {"state": "NOT PROCESSED", "details": [] } jobCreationTime: ISODate("2017-03-07T01:00:00+01:00"), jobLogReference: "" }

The second example illustrates the updates made after Airflow has started processing:

{
    "dataset": "acc",
    "source": "ucsc",
    "type": "mut01",		
    "process": "broadcurated",
    "directory": "../data/UCSC/ACC/",
    "file": "mutation_curated_broad_gene"
    jobID: "job001",
    jobStatus: {"state": "IN-PROGRESS",
                "details": ["INGESTED"]
               }
    jobCreationTime: ISODate("2017-03-07T01:00:00+01:00"),
    jobLogReference: "/var/logs/jobs/jobID/"
}

The second example illustrates the updates made after successful completion:

{
    "dataset": "acc",
    "source": "ucsc",
    "type": "mut01",		
    "process": "broadcurated",
    "directory": "../data/UCSC/ACC/",
    "file": "mutation_curated_broad_gene"
    jobID: "job001",
    jobStatus: {"state": "COMPLETED",
                "details": ["INGESTED", "DATA-VALIDATED","HUGO-VALIDATED","DATABASE-UPDATED","COMPLETED"]
               }
    jobCreationTime: ISODate("2017-03-07T01:00:00+01:00"),
    jobLogReference: "/var/logs/jobs/jobID/"
}

The second example illustrates the updates made after failed completion:

{
    "dataset": "acc",
    "source": "ucsc",
    "type": "mut01",		
    "process": "broadcurated",
    "directory": "../data/UCSC/ACC/",
    "file": "mutation_curated_broad_gene"
    jobID: "job001",
    jobStatus: {"state": "FAILED",
                "details": ["INGESTED", "DATA-VALIDATED","FAILED","COMPLETED"]
               }
    jobCreationTime: ISODate("2017-03-07T01:00:00+01:00"),
    jobLogReference: "/var/logs/jobs/jobID/"
}

Valid job states

A molecular processing job can have these valid states:

IN-PROCESS: Airflow has picked up the job and the processing has started
INGESTED: The file has successfully been loaded
DATA-VALIDATED: The file has been successfully data type validated.
DATA-VALIDATION-FAILED: Data type validation failed on the file.
HUGO-VALIDATED: The genes have been successfully looked up on HUGO
HUGO-ALIASED: The genes were found to be aliased.
HUGO-FAILED: Hugo look up failed
DATABASE-UPDATED: The database has been updated with the processed file
COMPLETED: The job is no longer running. COMPLETED can either means a successful termination or a system error that halted.
FAILED: A system error has caused the job to abort. Note that we do not mark file errors such as invalid schema format or invalid Hugo lookup as FAILED.