Differential expression analysis - dogayalova/Genome-Analysis-1MB462-67615-VT2025 GitHub Wiki
HTSeq ANALYSIS
In the paper, RNA-seq libraries were prepared using the ScriptSeq Complete Kit (Bacteria). This kit employs a strand-specific protocol that retains the directionality of the original RNA transcripts. Thus, to accurately count reads with HTSeq while considering strand specificity, I tried setting the --stranded parameter to reverse. Also used flags in the htseq-count command to specify other parameters. Moreover, I converted pacbio_assembly.gff to .gtf format with AGAT. Also, I had problems with disc space, so I had to direct the out and err files to the OUTDIR. This approach resulted in low quality (low no-feature count) and I had to fix this code.
Next, I fixed the code and got it to work properly. I used the original .gff file from the prokka annotation and set the strand to yes. You can find the code and the resulting result below:
#!/bin/bash
#SBATCH -A uppmax2025-3-3
#SBATCH -M snowy
#SBATCH -p core
#SBATCH -n 4
#SBATCH -t 05:00:00
#SBATCH -J htseq_count_serum
#SBATCH -o htseq_count_serum.out
#SBATCH -e htseq_count_serum.err
#SBATCH [email protected]
#SBATCH --mail-type=ALL
# Load modules
module load bioinfo-tools
module load htseq
# Define paths
OUTDIR=/proj/uppmax2025-3-3/nobackup/work/doga
GFF=/proj/uppmax2025-3-3/nobackup/work/doga/pacbio_assembly_no_fasta.gff
BAMDIR=/proj/uppmax2025-3-3/nobackup/work/doga
# Serum samples
htseq-count -f bam -r pos -s yes -t CDS -i ID $BAMDIR/ERR1797969_mapped.sorted.bam $GFF > $OUTDIR/ERR1797969_counts_yes.txt
htseq-count -f bam -r pos -s yes -t CDS -i ID $BAMDIR/ERR1797970_mapped.sorted.bam $GFF > $OUTDIR/ERR1797970_counts_yes.txt
htseq-count -f bam -r pos -s yes -t CDS -i ID $BAMDIR/ERR1797971_mapped.sorted.bam $GFF > $OUTDIR/ERR1797971_counts_yes.txt
- I did the same for BH samples as well.
RESULTS
ERR1797969_counts_yes.txt
__no_feature 4198810
__ambiguous 554950
__too_low_aQual 237392
__not_aligned 234207
__alignment_not_unique 0
ERR1797970_counts_yes.txt
__no_feature 4526935
__ambiguous 658750
__too_low_aQual 280865
__not_aligned 247849
__alignment_not_unique 0
ERR1797971_counts_yes.txt
__ambiguous 588175
__too_low_aQual 262464
__not_aligned 246122
__alignment_not_unique 0
ERR1797972_counts_yes.txt
__no_feature 993164
__ambiguous 1025284
__too_low_aQual 159269
__not_aligned 197729
__alignment_not_unique 0
ERR1797973_counts_yes.txt
EJCDLKNP_03182 120
__no_feature 974518
__ambiguous 1124178
__too_low_aQual 149475
__not_aligned 187421
__alignment_not_unique 0
ERR1797974_counts_yes.txt
__no_feature 938370
__ambiguous 961970
__too_low_aQual 139659
__not_aligned 172093
__alignment_not_unique 0
DESeq2 ANALYSIS
Next, I did DESeq2 to the HTSeq count results. I used an R script for this, its as follows:
# Load DESeq2
library("DESeq2")
# Set working directory to where the count files are
setwd("/proj/uppmax2025-3-3/nobackup/work/doga")
# List of your HTSeq count files
count_files <- c(
"ERR1797969_counts_yes.txt",
"ERR1797970_counts_yes.txt",
"ERR1797971_counts_yes.txt",
"ERR1797972_counts_yes.txt",
"ERR1797973_counts_yes.txt",
"ERR1797974_counts_yes.txt"
)
# Sample condition metadata
sample_info <- data.frame(
sampleName = count_files,
fileName = count_files,
condition = c("Serum", "Serum", "Serum", "BH", "BH", "BH")
)
# Create DESeq2 dataset from HTSeq output
dds <- DESeqDataSetFromHTSeqCount(
sampleTable = sample_info,
design = ~ condition
)
# Filter out low count genes
dds <- dds[rowSums(counts(dds)) > 1, ]
# Run DESeq2
dds <- DESeq(dds)
# Results
res <- results(dds)
res <- lfcShrink(dds, coef=2, res=res) # For accurate log2 fold change
# Order by adjusted p-value
res_ordered <- res[order(res$padj), ]
# Save results
write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv")
# Optional: Plot MA plot
pdf("MA_plot.pdf")
plotMA(res, main="DESeq2 MA-Plot", ylim=c(-5,5))
dev.off()
# Optional: Volcano plot
library(EnhancedVolcano)
pdf("volcano_plot.pdf")
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'pvalue',
pCutoff = 0.05,
FCcutoff = 1.5,
title = 'Differential Expression: BH vs Serum')
dev.off()
RESULTS
I obtained a csv file from the analysis and analyzed it on RStudio. By categorizing the genes as overexpressed and underexpressed with log2FoldChange > 1 and log2FoldChange < -1, I could finally obtained what genes were down and upregulated in comparison of BH and Serum. I listed the log2FoldChange values from highest to lowest for upregulation and lowest to highest for downregulation, and then found the real gene names by comparing the codename from the "pacbio_Assembly.tsv" file.
- UPREGULATED GENES
Locus Tag | Type | Length | Gene | EC Number | COG | Product Description |
---|---|---|---|---|---|---|
EJCDLKNP_00593 | CDS | 255 | purS | 6.3.5.3 | COG1828 | Phosphoribosylformylglycinamidine synthase subunit PurS |
EJCDLKNP_00592 | CDS | 672 | purQ | 6.3.5.3 | COG0047 | Phosphoribosylformylglycinamidine synthase subunit PurQ |
EJCDLKNP_00594 | CDS | 729 | purC | 6.3.2.6 | COG0152 | Phosphoribosylaminoimidazole-succinocarboxamide synthase |
EJCDLKNP_00591 | CDS | 2223 | purL | 6.3.5.3 | COG0046 | Phosphoribosylformylglycinamidine synthase subunit PurL |
EJCDLKNP_00590 | CDS | 1440 | purF | 2.4.2.14 | Amidophosphoribosyltransferase | |
EJCDLKNP_00589 | CDS | 1053 | purM | 6.3.3.1 | COG0150 | Phosphoribosylformylglycinamidine cyclo-ligase |
EJCDLKNP_00587 | CDS | 1536 | purH | Bifunctional purine biosynthesis protein PurH | ||
EJCDLKNP_00586 | CDS | 1275 | purD | 6.3.4.13 | COG0151 | Phosphoribosylamine--glycine ligase |
EJCDLKNP_00588 | CDS | 579 | purN | 2.1.2.2 | Phosphoribosylglycinamide formyltransferase | |
EJCDLKNP_00314 | CDS | 858 | gspA_1 | COG1442 | General stress protein A |
- DOWNREGULATED GENES
Locus Tag | Type | Length | Gene | EC Number | COG | Product Description |
---|---|---|---|---|---|---|
EJCDLKNP_01859 | CDS | 1467 | lacS | COG2190 | Lactose permease | |
EJCDLKNP_00196 | CDS | 915 | lacC_1 | 2.7.1.144 | COG1105 | Tagatose-6-phosphate kinase |
EJCDLKNP_00844 | CDS | 225 | acpA | COG0236 | Acyl carrier protein | |
EJCDLKNP_00843 | CDS | 918 | fabD | 2.3.1.39 | COG0331 | Malonyl CoA-acyl carrier protein transacylase |
EJCDLKNP_00197 | CDS | 1932 | fruA_1 | COG1299 | PTS system fructose-specific EIIABC component | |
EJCDLKNP_00842 | CDS | 738 | fabG_2 | 1.1.1.100 | 3-oxoacyl-[acyl-carrier-protein] reductase FabG | |
EJCDLKNP_00195 | CDS | 753 | glcR | COG1349 | HTH-type transcriptional repressor GlcR | |
EJCDLKNP_00839 | CDS | 420 | fabZ_2 | 4.2.1.59 | COG0764 | 3-hydroxyacyl-[acyl-carrier-protein] dehydratase FabZ |
EJCDLKNP_00841 | CDS | 1236 | fabF | 2.3.1.179 | COG0304 | 3-oxoacyl-[acyl-carrier-protein] synthase 2 |
EJCDLKNP_00840 | CDS | 477 | accB | Biotin carboxyl carrier protein of acetyl-CoA carboxylase |
In addition, I visualized the results with a volcano plot and a MA plot.