What do we have in the data? - gy315-K/REAL_FORKED_abT-Tact-cells-Team2 GitHub Wiki
Datasets
mmc 1: Quality control (QC)
Sorted Populations Spreadsheet
Column names & explanation:
-
ImmGenLab -> name or ID of the lab/group from ImmGen that generated the data
-
SortingMarkers -> surface markers used to isolate the specific cell population via FACS or similar sorting technique
-
InputCellNumber -> Number of cells used as input for the ATAC-seq protocol
-
PF.reads (pass filter reads) -> number of sequencing reads that passed the quality for ILUMINA (e. g. after removing adapters, low-quality reads)
-
%chrM.mapped -> percentage of reads that are mapped to mitochondrial DNA (chrM); high values can indicate low sample quality or cell stress
-
Paired.read.after.removing.PCR.duplication -> number of paired-end reads remaining after removing PCR duplicates (which can artificially inflate read counts)
-
%fragment.1Kb_TSS -> percent of reads (or fragments) that are located within +/- 1 kb of transcription starting sites (TSS) - a metric for open chromatin near genes
-
Replicate.cor -> correlation coefficient between this sample and its biological replicate(s), used to assess reproducibility and overall sample quality
Read Statistics Spreadsheet
-
ImmGen.lab.contributed: Which ImmGen laboratory contributed this sample’s data. (Often a string or code naming the lab/group that provided the library.)
-
sample.name: Your library’s unique identifier—must match the SampleName used in the SortedPopulations sheet so you can join or filter by sample.
-
population.name: The cell‐population label (e.g. preT.DN1.Th, T.4.Sp.aCD3+CD40.18hr) indicating exactly which FACS‐sorted subpopulation that library derives from.
-
total.reads: TotalReads: the raw number of sequencing reads generated for this library (before any filtering).
-
overal_ alignment_rate%: %Mapped: the percentage of those reads that successfully aligned to the reference genome.
-
mapped_MAPQ5: Number of reads that mapped with mapping‐quality (MAPQ) ≥ 5—i.e. high‐confidence alignments.
-
reads.after.removing.duplication: Number of reads remaining once PCR‐duplicates have been removed (so each fragment is only counted once).
-
properly_paired.reads: Count of reads where both mates aligned in the correct orientation and insert‐size (a further filter on mapped reads).
-
paired.count: Total number of paired‐end fragments (i.e. read-pairs) detected—regardless of whether they pass the “properly paired” flag.
-
total.Htseq-count.on.genes: Reads assigned to annotated gene features by HTSeq-count. This tells you how many of your mapped fragments overlap known genes (often used more for RNA-seq QC but sometimes tracked for ATAC if you quantify gene‐body accessibility).
refFlat: Annotations
The columns in this file are: Gene name, Transcript name, Chromosome, Strand, 5' transcript start, 3' transcript start, coding region 5' start, coding region 3' start, exon count, exon starts, exon ends
-
column names & explanation
-
- gene name (Wdsub1, Rbm18, etc.) -> the gene that peaks/reads are associated with
-
transcript_ID (NM_, etc.) -> RefSeq transcript for ID for the gene. Helps identify the exact transcript isoform
-
- chromosome (chr1, ch2) -> the chromosome on which the gene is located
-
- strand (+ or -) -> direction of the gene: (+) is the forward strand, (-) is the reverse strand
-
- start position -> the genomic coordinate where the gene or feature starts
-
- end position -> the genomic coordinate where the genome or feature ends
-
- another coordinate -> likely TSS or peak center position, depending on the analysis pipeline
-
- additional columns (big lists) -> these are peak coordinates (or fragment positions) mapped to the gene - i. e., where reads were found in proximity of the gene (e. g. enhancers)
ImmGenATAC: Processed ATAC-seq data
-
column names & explanation
-
- ImmGenATAC1219.peakID -> unique ID for each ATAC-seq peak (e.g., peak_1, peak_2...), probably based on genomic location or order
-
- chrom -> chromosome where the peak is located
-
- summit -> the genomic position with the highest read density within the peak - i. e. the center of the signal
-
- mm10.60way.phastCons_scores -> conservation score across 60 species; higher = more evolutionary conserved
-
- -log10_bestPvalue -> significance score of the peak; the higher, the more significant (based on P-value from peak caller MACS2)
-
- included.in.systematica.analysis -> binary (0/1): indicates if this peak was included in the downstream comparative or clustering analysis
-
- TSS -> likely the number of TSS overlapping with or near the peak
-
- genes.within.100Kb -> lists gene(s) located with 100 kb of peak
mmc2: RNA-seq data
-
column names & explanation
-
- row names (0610005C13Rik, P14Rik, etc.) -> gene names (usually gene symbols or ensemble/RefSeq IDs). Each row = a peak assigned to a gene
-
- column names (LTHSC.34-.BM, proB.Fra.BM, etc.) -> each column is a cell population from ImmGen. These are sorted immune cell types
-
- values/cells (numbers like 1.096, 204.3, etc.) -> these are RNA-seq signals