Providing sample information - uic-ric/uic-ric.github.io GitHub Wiki

For many of the analysis services that the RRC cores provide, e.g. Research Informatics Core (RIC) and Genome Research Core (GRC), accurate sample information is required from the researcher in order to determine proper grouping of sample for various analyses. In order to reduce the burden of work on RRC analysts and subsequent cost to the researcher, we would ask that you use the following guidelines when preparing your sample information.

Basic Guidelines

  1. Sample information should be provided via an Excel spreadsheet, comma separated values (CSV), or tab separated values file. Excel is the preferred format.
  2. All spreadsheets should have clear headers on the first row.
  3. First column of any sample information spreadsheet should be the sample ID provided when it was submitted for sequencing or any other data acquisition service. If you are unclear about the ID of the samples, please contact the core to which you submitted your samples or the Research Informatics Core for a sample list for your project.
  4. Sample IDs and group or factor names should start with an alphabetic character (A-Z, a-z) and consist of alphanumeric characters (A-Z, a-z, 0-9) and underscores (_) only. Please avoid using spaces or special characters. If more information is needed to describe the differences between conditions, you can include additional columns as notes with this information.
  5. Do NOT merge or span cells in the spreadsheet.
  6. Do NOT use colors or special formatting, e.g. italics or bold characters, to indicate experimental groups. For most analyses, we will be converting the spreadsheet to a plain text format that will be used by our analysis tools. In that case, all formatting will be lost during conversion.
  7. If you are providing samples from multiple technologies (e.g., RNA-seq and ATAC-seq, or whole transcript RNA-seq and 3' RNA-seq), please indicate these differences in additional columns in the spreadsheet.
  8. Use consistent values, including letter case, when designating a level/group for a factor.
    • For example, if providing gender information use M/F or Male/Female, such as the following example:
      SampleID Gender
      GC_001 M
      GC_002 F
      GC_003 M
    • Please do NOT mix values, such as this example:
      SampleID Gender
      GC_001 M
      GC_002 F
      GC_003 Male
  9. If providing numerical values do NOT include non-numerical characters. For example, provide 100 and NOT ~100 or 100 mg.
    • Units, if applicable, should be indicated in column headers.
    • If you need to indicate number ranges, code your ranges, e.g., high, medium, low, and include a second tab in your Excel file with data definitions, e.g., high = 10-20 mg.
  10. Do NOT combine factors. If a combined comparison is needed, it is easier to combine separate factors than split existing factors.
    • For example, provide information like this:
      SampleID Gender Treatment
      GC_001 Male Control
      GC_002 Female Control
      GC_003 Male Drug
      GC_004 Female Drug
    • Instead of this:
      SampleID Group
      GC_001 Male-Control
      GC_002 Female-Control
      GC_003 Male-Drug
      GC_004 Female-Drug
  11. If subsets of the data should be analyzed separately, if possible please provide a factor indicating the subset rather than separate spreadsheets. For example:
    SampleID Set Treatment
    GC_001 1 Control
    GC_002 1 Drug
    GC_003 2 Control
    GC_004 2 Drug

Providing Genomic Information

Please use the following guidelines when providing genomic coordinates.

  1. Include separate columns for chromosome, start, and end positions.
  2. Include strand, if appropriate
  3. Include important identifiers, such as gene or locus name, in additional columns
  4. If you have a nucleotide sequence include this as well in a separate column.
  5. Please clearly indicate the genome build or accession number for the coordinates, e.g. mm10, hg19 (genome builds), or NC_000913.3 (NCBI accession number).
Chromosome (mm10) Start End Strand Gene
chr1 1000101 1000150 + ABCD
chr2 2001010 2002001 + EGFH
chr3 5010430 5010600 - IJKL
⚠️ **GitHub.com Fallback** ⚠️