Data intake formats - stjude/proteinpaint GitHub Wiki

Initial steps to follow for crosschecking data formats:

  • Reference genomes - Check for reference genome species and version. For human genome check whether reference genome is hg19 or hg38. For certain datasets, a liftover might be required if coordinates are not available for one or the other reference genomes.
  • Sample id's- check for discrepancies in sample id's from one dataset to another including databases.
  • Headers- Crosscheck headers in data files to make sure information in the text body matches. for example, length of chromosomes in VCF formats.
  • Missing fields - check for missing fields or mislabeled fields from one dataset to another. To build db, all fields must be matching to the matrix and sample files.

Intake formats for typical file types of a typical cancer genomic study

Somatic SNV/indel should be in VCF or tabular tab-delimited format with following fields in the header:

Headers for SNV/indel tabular format:

gene refseq chromosome start aachange class sample origin REF ALT

Headers for VCF format:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

  • chromosome: the chromosome number which could be a number from '1-22' or 'chr(number)'
  • start or pos: 1-based (vcf) or 0-based (tabular) position
  • reference allele: indicate the reference allele based on the reference genome position
  • mutant allele: indicate the change in allele as the mutated allele

NOTE: Headers for VCF should also include INFO, FORMAT fields, and any other if applicable. Moreover, it should also contain contig (or chromosome) ids and lengths. Check with PP team.

SV and Fusion should be a tabular format with following fields:

Headers tab delimited:

gene_a refseq_a chr_a position_a strand_a gene_b refseq_b chr_b position_b strand_b origin sample_name fusion_gene event_type

  • chromosome A
  • 0-based position A
  • strand A, value is + or -
  • chromosome B
  • 0-based position B
  • strand B
  • Sample name

NOTE: If fusion partner is absent or information is NA then fields_a or fields_b information can be empty.

  • sample_name, origin should be filled.
  • event_type should be 'fusion' (currently we support both fusion and SV as event types but majorly fusion is used)

CNV should be a tabular format with following fields:

We currently support two types of CNV data:

a) Numeric type: exact number based on seg.mean

Headers: Sample ID Chromosome Start End Value

  • Chromosome: chromosome
  • Start: 0-based start position
  • End: 0-based stop position
  • Value: Magnitude of change, can be in log2(tumor/germline), or seg.mean etc. We need a consistent quantification for all CNV calls`

b) Categorical type: gain or loss sample_name chrom loc.start loc.end log2ratio origin

Gene expression, for both FPKM/TPM and raw counts should be a tabular gene-by-sample value matrix:

  • Genes should be as rows, and samples as columns.
  • Please use either HGVS symbol or GENCODE gene accession as gene names

Headers (tab-delimited): geneID geneSymbol bioType annotationLevel sampleNames...

NOTE: geneSymbol, bioType, annotationLevel are optional

If possible, additionally provide:

  • DNA sequencing read count per allele
  • RNAseq read count per allele (will be great if this can be available for cases with RNAseq)

tSNE results based on bulk transcriptome or methylome should be a tabular file:

  • has a header row
  • has columns for 'sample' and 'X/Y coordinates'

Clinical data can be in a tabular matrix with samples and variables on rows or columns.

  • When variables are on columns, there should be a row called DATATYPE to identify data type of each column. Supported values are "categorical", "integer", "float"
  • Likewise, when variables are on rows, the DATATYPE should be a column
  • For variables identified as numeric, non-numeric values will be treated as exceptions

Survival data should be a tabular matrix with the following columns:

  • Sample name
  • Type (OS, EFS, PFS)
  • Time to event in decimal years
  • Exit code, 0=alive, 1=dead
  • (Additional columns for time to event and exit code could be added when more types of survival data are available)

Single-cell RNAseq TBA