Data intake formats - stjude/proteinpaint GitHub Wiki
Initial steps to follow for crosschecking data formats:
- Reference genomes - Check for reference genome species and version. For human genome check whether reference genome is hg19 or hg38. For certain datasets, a liftover might be required if coordinates are not available for one or the other reference genomes.
- Sample names- check for discrepancies in sample id's from one dataset to another including databases.
- Headers- Crosscheck headers in data files to make sure information in the text body matches. for example, length of chromosomes in VCF formats.
- Missing fields - check for missing fields or mislabeled fields from one dataset to another. To build db, all fields must be matching to the matrix and sample files.
Intake formats for typical file types of a typical cancer genomic study
Somatic mutation
Somatic mutation can be either in VCF format, or a tabular format with following header line:
sample_name chr pos ref alt origin
- sample_name
- chr: chromosome
- pos: position, same requirement as the POS field in VCF format
- ref: reference allele.
- alt: alternative allele. Both alleles have same requirement as in VCF format
- origin: somatic/germline
When available, additionally provide:
- DNA sequencing read count per allele
- RNAseq read count per allele
SV and fusion: tabular format
A SV/fusion tabular file should have following fields with a header line.
- chromosome A
- 1-based position A
- strand A, value is + or -
- gene A, optional
- chromosome B
- 1-based position B
- strand B
- gene B, optional
- Sample name
NOTE: If fusion partner is absent or information is NA then fields_a or fields_b information can be empty.
- sample_name, origin should be filled.
- event_type should be 'fusion' (currently we support both fusion and SV as event types but majorly fusion is used)
CNV: tabular format
CNV file should be in a tabular format text file with following fields:
Sample chromosome start end value
- Chromosome
- Start: 1-based start position
- End: 1-based stop position
- Value: either a numerical value of magnitude of change (e.g. log2 ratio or seg.mean), or a gain/loss call. All events in the file should have the consistent value type, and cannot use a mixture of numeric versus categorical values.
Bulk RNA-seq Gene expression
Both FPKM/TPM and raw counts should be a tabular gene-by-sample value matrix:
- Genes should be as rows, and samples as columns.
- Please use either HGVS symbol or GENCODE gene accession as gene names
Headers (tab-delimited):
geneID geneSymbol sample1 sample2 ...
tSNE results based on bulk transcriptome or methylome should be a tabular file:
- has a header row
- has columns for 'sample' and 'X/Y coordinates'
Clinical data
Clinical data dictionary
Clinical data dictionary should follow the "phenotree" format: https://github.com/stjude/proteinpaint/wiki/Data-Browser#phenotree
Clinical data annotation
A tabular matrix with samples on rows, and variables on columns.
Survival data should be a tabular matrix with the following columns:
- Sample name
- Type (OS, EFS, PFS)
- Time to event in decimal years
- Exit code, 0=alive, 1=dead
- (Additional columns for time to event and exit code could be added when more types of survival data are available)
Gene list for building tSNE and UMAP
example file:
geneID geneSymbol bioType length geneInfo chromosome start end
ENSG00000000XXX XYZ protein_coding 2359.67 complement factor H chr1 19AAAXXXX 19AAAXXXX
ENSG00000000XXX ABC protein_coding 2359.67 complement factor H chr1 19AAAXXXX 19AAAXXXX
Gene length may be optional but better to be provided by the collaborator.
Single-cell RNAseq
To be added