Data intake formats - stjude/proteinpaint GitHub Wiki

Initial steps to follow for crosschecking data formats:

Reference genomes - Check for reference genome species and version. For human genome check whether reference genome is hg19 or hg38. For certain datasets, a liftover might be required if coordinates are not available for one or the other reference genomes.
Sample names- check for discrepancies in sample id's from one dataset to another including databases.
Headers- Crosscheck headers in data files to make sure information in the text body matches. for example, length of chromosomes in VCF formats.
Missing fields - check for missing fields or mislabeled fields from one dataset to another. To build db, all fields must be matching to the matrix and sample files.

Intake formats for typical file types of a typical cancer genomic study

Somatic mutation

Somatic mutation can be either in VCF format, or a tabular format with following header line:

sample_name chr pos ref alt origin

sample_name
chr: chromosome
pos: position, same requirement as the POS field in VCF format
ref: reference allele.
alt: alternative allele. Both alleles have same requirement as in VCF format
origin: somatic/germline

When available, additionally provide:

DNA sequencing read count per allele
RNAseq read count per allele

SV and fusion: tabular format

A SV/fusion tabular file should have following fields with a header line.

chromosome A
1-based position A
strand A, value is + or -
gene A, optional
chromosome B
1-based position B
strand B
gene B, optional
Sample name

NOTE: If fusion partner is absent or information is NA then fields_a or fields_b information can be empty.

sample_name, origin should be filled.
event_type should be 'fusion' (currently we support both fusion and SV as event types but majorly fusion is used)

CNV: tabular format

CNV file should be in a tabular format text file with following fields:

Sample chromosome start end value

Chromosome
Start: 1-based start position
End: 1-based stop position
Value: either a numerical value of magnitude of change (e.g. log2 ratio or seg.mean), or a gain/loss call. All events in the file should have the consistent value type, and cannot use a mixture of numeric versus categorical values.

Bulk RNA-seq Gene expression

Both FPKM/TPM and raw counts should be a tabular gene-by-sample value matrix:

Genes should be as rows, and samples as columns.
Please use either HGVS symbol or GENCODE gene accession as gene names

Headers (tab-delimited): geneID geneSymbol sample1 sample2 ...

tSNE results based on bulk transcriptome or methylome should be a tabular file:

has a header row
has columns for 'sample' and 'X/Y coordinates'

Clinical data

Clinical data dictionary

Clinical data dictionary should follow the "phenotree" format: https://github.com/stjude/proteinpaint/wiki/Data-Browser#phenotree

Clinical data annotation

A tabular matrix with samples on rows, and variables on columns.

Survival data should be a tabular matrix with the following columns:

Sample name
Type (OS, EFS, PFS)
Time to event in decimal years
Exit code, 0=alive, 1=dead
(Additional columns for time to event and exit code could be added when more types of survival data are available)

Gene list for building tSNE and UMAP

example file:

geneID geneSymbol bioType length geneInfo chromosome start end ENSG00000000XXX XYZ protein_coding 2359.67 complement factor H chr1 19AAAXXXX 19AAAXXXX ENSG00000000XXX ABC protein_coding 2359.67 complement factor H chr1 19AAAXXXX 19AAAXXXX

Gene length may be optional but better to be provided by the collaborator.

Single-cell RNAseq

To be added