What do we have in the data? - gy315-K/REAL_FORKED_abT-Tact-cells-Team2 GitHub Wiki

Datasets

mmc 1: Quality control (QC)

Sorted Populations Spreadsheet

Column names & explanation:

  • ImmGenLab -> name or ID of the lab/group from ImmGen that generated the data

  • SortingMarkers -> surface markers used to isolate the specific cell population via FACS or similar sorting technique

  • InputCellNumber -> Number of cells used as input for the ATAC-seq protocol

  • PF.reads (pass filter reads) -> number of sequencing reads that passed the quality for ILUMINA (e. g. after removing adapters, low-quality reads)

  • %chrM.mapped -> percentage of reads that are mapped to mitochondrial DNA (chrM); high values can indicate low sample quality or cell stress

  • Paired.read.after.removing.PCR.duplication -> number of paired-end reads remaining after removing PCR duplicates (which can artificially inflate read counts)

  • %fragment.1Kb_TSS -> percent of reads (or fragments) that are located within +/- 1 kb of transcription starting sites (TSS) - a metric for open chromatin near genes

  • Replicate.cor -> correlation coefficient between this sample and its biological replicate(s), used to assess reproducibility and overall sample quality

Read Statistics Spreadsheet

  • ImmGen.lab.contributed: Which ImmGen laboratory contributed this sample’s data. (Often a string or code naming the lab/group that provided the library.)

  • sample.name: Your library’s unique identifier—must match the SampleName used in the SortedPopulations sheet so you can join or filter by sample.

  • population.name: The cell‐population label (e.g. preT.DN1.Th, T.4.Sp.aCD3+CD40.18hr) indicating exactly which FACS‐sorted subpopulation that library derives from.

  • total.reads: TotalReads: the raw number of sequencing reads generated for this library (before any filtering).

  • overal_ alignment_rate%: %Mapped: the percentage of those reads that successfully aligned to the reference genome.

  • mapped_MAPQ5: Number of reads that mapped with mapping‐quality (MAPQ) ≥ 5—i.e. high‐confidence alignments.

  • reads.after.removing.duplication: Number of reads remaining once PCR‐duplicates have been removed (so each fragment is only counted once).

  • properly_paired.reads: Count of reads where both mates aligned in the correct orientation and insert‐size (a further filter on mapped reads).

  • paired.count: Total number of paired‐end fragments (i.e. read-pairs) detected—regardless of whether they pass the “properly paired” flag.

  • total.Htseq-count.on.genes: Reads assigned to annotated gene features by HTSeq-count. This tells you how many of your mapped fragments overlap known genes (often used more for RNA-seq QC but sometimes tracked for ATAC if you quantify gene‐body accessibility).

refFlat: Annotations

The columns in this file are: Gene name, Transcript name, Chromosome, Strand, 5' transcript start, 3' transcript start, coding region 5' start, coding region 3' start, exon count, exon starts, exon ends

  • column names & explanation

    • gene name (Wdsub1, Rbm18, etc.) -> the gene that peaks/reads are associated with

transcript_ID (NM_, etc.) -> RefSeq transcript for ID for the gene. Helps identify the exact transcript isoform

    • chromosome (chr1, ch2) -> the chromosome on which the gene is located
    • strand (+ or -) -> direction of the gene: (+) is the forward strand, (-) is the reverse strand
    • start position -> the genomic coordinate where the gene or feature starts
    • end position -> the genomic coordinate where the genome or feature ends
    • another coordinate -> likely TSS or peak center position, depending on the analysis pipeline
    • additional columns (big lists) -> these are peak coordinates (or fragment positions) mapped to the gene - i. e., where reads were found in proximity of the gene (e. g. enhancers)

ImmGenATAC: Processed ATAC-seq data

  • column names & explanation

    • ImmGenATAC1219.peakID -> unique ID for each ATAC-seq peak (e.g., peak_1, peak_2...), probably based on genomic location or order
    • chrom -> chromosome where the peak is located
    • summit -> the genomic position with the highest read density within the peak - i. e. the center of the signal
    • mm10.60way.phastCons_scores -> conservation score across 60 species; higher = more evolutionary conserved
    • -log10_bestPvalue -> significance score of the peak; the higher, the more significant (based on P-value from peak caller MACS2)
    • included.in.systematica.analysis -> binary (0/1): indicates if this peak was included in the downstream comparative or clustering analysis
    • TSS -> likely the number of TSS overlapping with or near the peak
    • genes.within.100Kb -> lists gene(s) located with 100 kb of peak

mmc2: RNA-seq data

  • column names & explanation

    • row names (0610005C13Rik, P14Rik, etc.) -> gene names (usually gene symbols or ensemble/RefSeq IDs). Each row = a peak assigned to a gene
    • column names (LTHSC.34-.BM, proB.Fra.BM, etc.) -> each column is a cell population from ImmGen. These are sorted immune cell types
    • values/cells (numbers like 1.096, 204.3, etc.) -> these are RNA-seq signals