Differential expression analysis - dogayalova/Genome-Analysis-1MB462-67615-VT2025 GitHub Wiki

HTSeq ANALYSIS

In the paper, RNA-seq libraries were prepared using the ScriptSeq Complete Kit (Bacteria). This kit employs a strand-specific protocol that retains the directionality of the original RNA transcripts. Thus, to accurately count reads with HTSeq while considering strand specificity, I tried setting the --stranded parameter to reverse. Also used flags in the htseq-count command to specify other parameters. Moreover, I converted pacbio_assembly.gff to .gtf format with AGAT. Also, I had problems with disc space, so I had to direct the out and err files to the OUTDIR. This approach resulted in low quality (low no-feature count) and I had to fix this code.

Next, I fixed the code and got it to work properly. I used the original .gff file from the prokka annotation and set the strand to yes. You can find the code and the resulting result below:

#!/bin/bash
#SBATCH -A uppmax2025-3-3
#SBATCH -M snowy
#SBATCH -p core
#SBATCH -n 4
#SBATCH -t 05:00:00
#SBATCH -J htseq_count_serum
#SBATCH -o htseq_count_serum.out
#SBATCH -e htseq_count_serum.err
#SBATCH [email protected]
#SBATCH --mail-type=ALL


# Load modules
module load bioinfo-tools
module load htseq

# Define paths
OUTDIR=/proj/uppmax2025-3-3/nobackup/work/doga
GFF=/proj/uppmax2025-3-3/nobackup/work/doga/pacbio_assembly_no_fasta.gff
BAMDIR=/proj/uppmax2025-3-3/nobackup/work/doga


# Serum samples
htseq-count -f bam -r pos -s yes -t CDS -i ID $BAMDIR/ERR1797969_mapped.sorted.bam $GFF > $OUTDIR/ERR1797969_counts_yes.txt 
 
htseq-count -f bam -r pos -s yes -t CDS -i ID $BAMDIR/ERR1797970_mapped.sorted.bam $GFF > $OUTDIR/ERR1797970_counts_yes.txt 

htseq-count -f bam -r pos -s yes -t CDS -i ID $BAMDIR/ERR1797971_mapped.sorted.bam $GFF > $OUTDIR/ERR1797971_counts_yes.txt 

  • I did the same for BH samples as well.

RESULTS

ERR1797969_counts_yes.txt
__no_feature	4198810
__ambiguous	554950
__too_low_aQual	237392
__not_aligned	234207
__alignment_not_unique	0


ERR1797970_counts_yes.txt
__no_feature	4526935
__ambiguous	658750
__too_low_aQual	280865
__not_aligned	247849
__alignment_not_unique	0


ERR1797971_counts_yes.txt
__ambiguous	588175
__too_low_aQual	262464
__not_aligned	246122
__alignment_not_unique	0


ERR1797972_counts_yes.txt
__no_feature	993164
__ambiguous	1025284
__too_low_aQual	159269
__not_aligned	197729
__alignment_not_unique	0


ERR1797973_counts_yes.txt
EJCDLKNP_03182	120
__no_feature	974518
__ambiguous	1124178
__too_low_aQual	149475
__not_aligned	187421
__alignment_not_unique	0

ERR1797974_counts_yes.txt
__no_feature	938370
__ambiguous	961970
__too_low_aQual	139659
__not_aligned	172093
__alignment_not_unique	0

DESeq2 ANALYSIS

Next, I did DESeq2 to the HTSeq count results. I used an R script for this, its as follows:

# Load DESeq2
library("DESeq2")

# Set working directory to where the count files are
setwd("/proj/uppmax2025-3-3/nobackup/work/doga")

# List of your HTSeq count files
count_files <- c(
  "ERR1797969_counts_yes.txt",
  "ERR1797970_counts_yes.txt",
  "ERR1797971_counts_yes.txt",
  "ERR1797972_counts_yes.txt",
  "ERR1797973_counts_yes.txt",
  "ERR1797974_counts_yes.txt"
)

# Sample condition metadata
sample_info <- data.frame(
  sampleName = count_files,
  fileName = count_files,
  condition = c("Serum", "Serum", "Serum", "BH", "BH", "BH")
)

# Create DESeq2 dataset from HTSeq output
dds <- DESeqDataSetFromHTSeqCount(
  sampleTable = sample_info,
  design = ~ condition
)

# Filter out low count genes
dds <- dds[rowSums(counts(dds)) > 1, ]

# Run DESeq2
dds <- DESeq(dds)

# Results
res <- results(dds)
res <- lfcShrink(dds, coef=2, res=res) # For accurate log2 fold change

# Order by adjusted p-value
res_ordered <- res[order(res$padj), ]

# Save results
write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv")

# Optional: Plot MA plot
pdf("MA_plot.pdf")
plotMA(res, main="DESeq2 MA-Plot", ylim=c(-5,5))
dev.off()

# Optional: Volcano plot
library(EnhancedVolcano)
pdf("volcano_plot.pdf")
EnhancedVolcano(res,
    lab = rownames(res),
    x = 'log2FoldChange',
    y = 'pvalue',
    pCutoff = 0.05,
    FCcutoff = 1.5,
    title = 'Differential Expression: BH vs Serum')
dev.off()

RESULTS

I obtained a csv file from the analysis and analyzed it on RStudio. By categorizing the genes as overexpressed and underexpressed with log2FoldChange > 1 and log2FoldChange < -1, I could finally obtained what genes were down and upregulated in comparison of BH and Serum. I listed the log2FoldChange values from highest to lowest for upregulation and lowest to highest for downregulation, and then found the real gene names by comparing the codename from the "pacbio_Assembly.tsv" file.

  • UPREGULATED GENES
Locus Tag Type Length Gene EC Number COG Product Description
EJCDLKNP_00593 CDS 255 purS 6.3.5.3 COG1828 Phosphoribosylformylglycinamidine synthase subunit PurS
EJCDLKNP_00592 CDS 672 purQ 6.3.5.3 COG0047 Phosphoribosylformylglycinamidine synthase subunit PurQ
EJCDLKNP_00594 CDS 729 purC 6.3.2.6 COG0152 Phosphoribosylaminoimidazole-succinocarboxamide synthase
EJCDLKNP_00591 CDS 2223 purL 6.3.5.3 COG0046 Phosphoribosylformylglycinamidine synthase subunit PurL
EJCDLKNP_00590 CDS 1440 purF 2.4.2.14 Amidophosphoribosyltransferase
EJCDLKNP_00589 CDS 1053 purM 6.3.3.1 COG0150 Phosphoribosylformylglycinamidine cyclo-ligase
EJCDLKNP_00587 CDS 1536 purH Bifunctional purine biosynthesis protein PurH
EJCDLKNP_00586 CDS 1275 purD 6.3.4.13 COG0151 Phosphoribosylamine--glycine ligase
EJCDLKNP_00588 CDS 579 purN 2.1.2.2 Phosphoribosylglycinamide formyltransferase
EJCDLKNP_00314 CDS 858 gspA_1 COG1442 General stress protein A
  • DOWNREGULATED GENES
Locus Tag Type Length Gene EC Number COG Product Description
EJCDLKNP_01859 CDS 1467 lacS COG2190 Lactose permease
EJCDLKNP_00196 CDS 915 lacC_1 2.7.1.144 COG1105 Tagatose-6-phosphate kinase
EJCDLKNP_00844 CDS 225 acpA COG0236 Acyl carrier protein
EJCDLKNP_00843 CDS 918 fabD 2.3.1.39 COG0331 Malonyl CoA-acyl carrier protein transacylase
EJCDLKNP_00197 CDS 1932 fruA_1 COG1299 PTS system fructose-specific EIIABC component
EJCDLKNP_00842 CDS 738 fabG_2 1.1.1.100 3-oxoacyl-[acyl-carrier-protein] reductase FabG
EJCDLKNP_00195 CDS 753 glcR COG1349 HTH-type transcriptional repressor GlcR
EJCDLKNP_00839 CDS 420 fabZ_2 4.2.1.59 COG0764 3-hydroxyacyl-[acyl-carrier-protein] dehydratase FabZ
EJCDLKNP_00841 CDS 1236 fabF 2.3.1.179 COG0304 3-oxoacyl-[acyl-carrier-protein] synthase 2
EJCDLKNP_00840 CDS 477 accB Biotin carboxyl carrier protein of acetyl-CoA carboxylase

In addition, I visualized the results with a volcano plot and a MA plot.

INTERPRETATION