case2_renew - nibb-gitc/gitc2025mar-rnaseq GitHub Wiki

Case study 2: Genome-based RNA-Seq pipeline

本演習では以下Nature Communicationに掲載された論文の一部データを用いて行う。

https://pubmed.ncbi.nlm.nih.gov/34893639

本論文はアラビドプシスを用いた葉緑体逆行性シグナル(retrograde signaling)伝達経路に関わる因子をGWAS(genome-wide association study)解析等の手法を駆使し明らかにしたものだが、その解析過程でFe欠乏処理における遺伝子発現変動を調べたRNA-Seqデータがある。

以下、論文から抜粋するがFe欠乏処理によってクロロシスを生じる。この状況でどのような遺伝子発現変動があるか解析する。

本論文に用いられたRNA-Seqデータの一部を以下の流れで用意した。

以下ncbiのGEO(Gene Expression Omnibus)のサイトより、

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE163190

SRA toolsのprefetchコマンドで関連データをまとめてダウンロード。

$ prefetch GSE163190

これによりSRA codeごとのフォルダーが作成され、その中にSRAファイルが確認できる。

個々のSRAファイルは以下、sratoolsのfastq-dumpコマンドにより、fastqに変換出来る。

ここではSRR13253052からSRR13253057までの6データをfastqにした。

$ fastq-dump SRR13253052.sra
$ fastq-dump SRR13253053.sra
$ fastq-dump SRR13253054.sra
$ fastq-dump SRR13253055.sra
$ fastq-dump SRR13253056.sra
$ fastq-dump SRR13253057.sra

SRR13253052-SRR13253054の3サンプルがcontrolとなる、PplusFeplus39hoursの条件 3反復。

SRR13253055-SRR13253057の3サンプルがFe欠乏処理したものとなる、PplusFeminus39hoursの条件 3反復。

seqkit statsで以下のstatisticsが確認出来る。

以下のfastqファイルが得られていること、76baseのsingle readのデータであることが分かる。

file               format  type    num_seqs        sum_len  min_len  avg_len  max_len  Q1  Q2  Q3  sum_gap  N50  Q20(%)  Q30(%)  GC(%)
SRR13253052.fastq  FASTQ   DNA   47,808,882  3,633,475,032       76       76       76  76  76  76        0   76   92.56   89.35  44.93
SRR13253053.fastq  FASTQ   DNA   45,549,148  3,461,735,248       76       76       76  76  76  76        0   76   92.33   89.06  44.81
SRR13253054.fastq  FASTQ   DNA   42,558,181  3,234,421,756       76       76       76  76  76  76        0   76    92.5   89.28  44.67
SRR13253055.fastq  FASTQ   DNA   43,572,223  3,311,488,948       76       76       76  76  76  76        0   76   92.41   89.17  44.63
SRR13253056.fastq  FASTQ   DNA   49,715,187  3,778,354,212       76       76       76  76  76  76        0   76    92.5   89.27  44.85
SRR13253057.fastq  FASTQ   DNA   45,508,973  3,458,681,948       76       76       76  76  76  76        0   76   92.57   89.38  44.46

このままだとdata量が多く解析時間がかかるので、今回は以下seqkitを用いたシェルスクリプトでフォルダー内にある各fastqファイルを10M readとなるようサンプリングした。

for file in *.fastq; do seqkit sample -n 10000000 "$file" -o "${file%.fastq}_10M.fastq"; done

以下が得られる。

SRR13253052_10M.fastq
SRR13253053_10M.fastq
SRR13253054_10M.fastq
SRR13253055_10M.fastq
SRR13253056_10M.fastq
SRR13253057_10M.fastq

これらのdataを用いて、cutadapt -> HISAT2 -> StringTie -> edgeR のパイプラインで新規トランスクリプトの発見も含めて、DEG (differential expressed genes)を明らかにせよ。

Data

Input reads

(ファイルは、~/gitc/data/KY/genome_base/case2_renew　にある)

condition control(Pplus_Feplus39_hours) rep1　SRR13253052_10M.fastq
condition control(Pplus_Feplus39_hours) rep2　SRR13253053_10M.fastq
condition control(Pplus_Feplus39_hours) rep3　SRR13253054_10M.fastq
condition Fe deficiency (Pplus_Feminus_39hours) rep1　SRR13253055_10M.fastq
condition Fe deficiency (Pplus_Feminus_39hours) rep2　SRR13253056_10M.fastq
condition Fe deficiency (Pplus_Feminus_39hours) rep3　SRR13253057_10M.fastq

Reference

genome sequence: genome.fa
gene annotation: genes.gtf

Software

cutadapt (installed)
hisat2 (installed)
stringtie (installed)
samtools (installed)
edgeR (installed)
multiqc (installed)

preprocessing

adapter配列の残存やシーケンスクオリティーの画一化のためにcutadaptにかける。

オプションの意味は以下、参考に。 -aでシーケンスリード3'側に位置するadapter配列を指定

-q low quality baseの閾値

-O 偶然一致によるトリミングを回避するためのオーバーラップ最小値

--minimum-lengthでトリミングにより指定長より短くなったリードは捨てる（paired end readの場合はpaired endごと捨てる)

-o 出力ファイル名

cutadapt -q 30 -O 7 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 50 -o SRR13253052_10M.fastq.clnq_30_7_50.fastq SRR13253052_10M.fastq
cutadapt -q 30 -O 7 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 50 -o SRR13253053_10M.fastq.clnq_30_7_50.fastq SRR13253053_10M.fastq
cutadapt -q 30 -O 7 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 50 -o SRR13253054_10M.fastq.clnq_30_7_50.fastq SRR13253054_10M.fastq
cutadapt -q 30 -O 7 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 50 -o SRR13253055_10M.fastq.clnq_30_7_50.fastq SRR13253055_10M.fastq
cutadapt -q 30 -O 7 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 50 -o SRR13253056_10M.fastq.clnq_30_7_50.fastq SRR13253056_10M.fastq
cutadapt -q 30 -O 7 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 50 -o SRR13253057_10M.fastq.clnq_30_7_50.fastq SRR13253057_10M.fastq

参考）以下のシェルスクリプトでフォルダー内にある全fastqファイルを対象にcutadaptにかけられる。

IDX_CONS=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
QV=30
O=7
MINCUT=50

for j in *.fastq

do

k=`echo $j|sed -e 's/.fastq//'`

INPUT1=$k.fastq

OUTPUT1=$INPUT1\.clnq_$QV\_$O\_$MINCUT.fastq

cutadapt \
-q $QV \
-O $O \
-a $IDX_CONS \
--minimum-length $MINCUT \
-o $OUTPUT1 \
$INPUT1

done

Mapping with hisat2

Build hisat2 index from reference genome

hisat2-build genome.fa genome

Run hisat2

$ hisat2 -p 4 --dta -x genome -U SRR13253052_10M.fastq.clnq_30_7_50.fastq -S SRR13253052.sam

他の５つも同様に。

$ hisat2 -p 4 --dta -x genome -U SRR13253053_10M.fastq.clnq_30_7_50.fastq -S SRR13253053.sam
$ hisat2 -p 4 --dta -x genome -U SRR13253054_10M.fastq.clnq_30_7_50.fastq -S SRR13253054.sam
$ hisat2 -p 4 --dta -x genome -U SRR13253055_10M.fastq.clnq_30_7_50.fastq -S SRR13253055.sam
$ hisat2 -p 4 --dta -x genome -U SRR13253056_10M.fastq.clnq_30_7_50.fastq -S SRR13253056.sam
$ hisat2 -p 4 --dta -x genome -U SRR13253057_10M.fastq.clnq_30_7_50.fastq -S SRR13253057.sam

*-p は使うCPU coreを指定するオプション。使用するコンピュータのスペックに合わせて。

Inspect Results

hisat2実行時にmappning stats情報がreportされる。これは記録しておくとよい。

あるいは以下、--new-summary --summary-fileのオプションを付けておくと、mappning stats情報が指定するファイルに出力される。

また後述のmultiqcで、さらにまとめたreportを得ることもできる。

$ hisat2 -p 4 --dta --new-summary --summary-file SRR13253052.summary -x genome -U SRR13253052_10M.fastq.clnq_30_7_50.fastq -S SRR13253052.sam

Hisat2 stats summarize (参考）

multiqcを利用すると以下が可能である。

$multiqc -d .

で、カレントディレクトリーにあるstats情報を認識し、集計したhtmlファイル等が得られる。 https://multiqc.info/docs/#hisat2 WSB000011

Convert SAM to BAM

samtools sort -@ 4 -o SRR13253052.sorted.bam  SRR13253052.sam
samtools index SRR13253052.sorted.bam

他の５つも同様。これもシェルスクリプトで６つ自動で処理するには以下のようにする。

for f in *sam
do 
  samtools sort -o `basename $f .sam`.sorted.bam $f
done

for f in *bam
do
  samtools index $f
done

Transcript assembly and quantification with StringTie

ここでは-eは付けず、-Gで指定したgtfファイルに記載の遺伝子モデルに加え、新規な遺伝子の発見もおこなう。

$ stringtie -p 4 -G genes.gtf -o stringtie_count_no_e/SRR13253052/SRR13253052.gtf SRR13253052.sorted.bam
$ stringtie -p 4 -G genes.gtf -o stringtie_count_no_e/SRR13253053/SRR13253053.gtf SRR13253053.sorted.bam
$ stringtie -p 4 -G genes.gtf -o stringtie_count_no_e/SRR13253054/SRR13253054.gtf SRR13253054.sorted.bam
$ stringtie -p 4 -G genes.gtf -o stringtie_count_no_e/SRR13253055/SRR13253055.gtf SRR13253055.sorted.bam
$ stringtie -p 4 -G genes.gtf -o stringtie_count_no_e/SRR13253056/SRR13253056.gtf SRR13253056.sorted.bam
$ stringtie -p 4 -G genes.gtf -o stringtie_count_no_e/SRR13253057/SRR13253057.gtf SRR13253057.sorted.bam

less コマンドでstringtie_count_no_e/SRR13253052/SRR13253052.gtf 等gtfの中身を確認しておく。

出力されたgtfファイルのリストを作成 sample.list

stringtie_count_no_e/SRR13253052/SRR13253052.gtf SRR13253052.sorted.bam
stringtie_count_no_e/SRR13253053/SRR13253053.gtf SRR13253053.sorted.bam
stringtie_count_no_e/SRR13253054/SRR13253054.gtf SRR13253054.sorted.bam
stringtie_count_no_e/SRR13253055/SRR13253055.gtf SRR13253055.sorted.bam
stringtie_count_no_e/SRR13253056/SRR13253056.gtf SRR13253056.sorted.bam
stringtie_count_no_e/SRR13253057/SRR13253057.gtf SRR13253057.sorted.bam

ここではcommand user interfaceで以下の流れで作製（環境に合わせディレクトリーの指定に間違いがないか注意)

emacs sample.list
stringtie_count_no_e/SRR13253052/SRR13253052.gtf SRR13253052.sorted.bam
stringtie_count_no_e/SRR13253053/SRR13253053.gtf SRR13253053.sorted.bam
stringtie_count_no_e/SRR13253054/SRR13253054.gtf SRR13253054.sorted.bam
stringtie_count_no_e/SRR13253055/SRR13253055.gtf SRR13253055.sorted.bam
stringtie_count_no_e/SRR13253056/SRR13253056.gtf SRR13253056.sorted.bam
stringtie_count_no_e/SRR13253057/SRR13253057.gtf SRR13253057.sorted.bam
C-x C-c　←ctrlキーを押しながらxおよびc
y ←で確認

元のgene.gtfファイルに加え、新規に見出されたgtfがmergeされたgtfファイルが出力される。

stringtie --merge -p 4 -G genes.gtf -o stringtie_merged_no_e.gtf sample.list

Inspect new annotation

gffcompareを使って、stringtie mergeによって追加されたgtf情報を確認

gffcompare -r genes.gtf -o 10M_merged_compare stringtie_merged_no_e.gtf

merged.statsを見る、以下新規に1790のexonが見出されている。

# gffcompare v0.12.2 | Command line was:
#gffcompare -r genes.gtf -o 10M_merged_compare stringtie_merged_no_e.gtf
#

#= Summary for dataset: stringtie_merged_no_e.gtf 
#     Query mRNAs :   54356 in   32992 loci  (42741 multi-exon transcripts)
#            (10303 multi-transcript loci, ~1.6 transcripts per locus)
# Reference mRNAs :   41611 in   33350 loci  (30130 multi-exon)
# Super-loci w/ reference transcripts:    32800
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    97.5    |
        Exon level:    99.0     |    89.5    |
      Intron level:   100.0     |    93.4    |
Intron chain level:   100.0     |    70.5    |
  Transcript level:    99.8     |    76.4    |
       Locus level:    99.9     |    99.3    |

     Matching intron chains:   30127
       Matching transcripts:   41536
              Matching loci:   33313

          Missed exons:       0/169268	(  0.0%)
           Novel exons:    1790/190560	(  0.9%)
        Missed introns:       0/127896	(  0.0%)
         Novel introns:    1990/136988	(  1.5%)
           Missed loci:       0/33350	(  0.0%)
            Novel loci:     192/32992	(  0.6%)

 Total union super-loci across all input datasets: 32992 
54356 out of 54356 consensus transcripts written in 10M_merged_compare.annotated.gtf (0 discarded as redundant)

Transcript quantification with StringTie

再度、この新しいgtfファイルをrefとして、今度は-eを付加してカウントさせる。

$ stringtie -p 4 -e -G stringtie_merged_no_e.gtf -o stringtie_count_e/SRR13253052/SRR13253052.gtf SRR13253052.sorted.bam
$ stringtie -p 4 -e -G stringtie_merged_no_e.gtf -o stringtie_count_e/SRR13253053/SRR13253053.gtf SRR13253053.sorted.bam
$ stringtie -p 4 -e -G stringtie_merged_no_e.gtf -o stringtie_count_e/SRR13253054/SRR13253054.gtf SRR13253054.sorted.bam
$ stringtie -p 4 -e -G stringtie_merged_no_e.gtf -o stringtie_count_e/SRR13253055/SRR13253055.gtf SRR13253055.sorted.bam
$ stringtie -p 4 -e -G stringtie_merged_no_e.gtf -o stringtie_count_e/SRR13253056/SRR13253056.gtf SRR13253056.sorted.bam
$ stringtie -p 4 -e -G stringtie_merged_no_e.gtf -o stringtie_count_e/SRR13253057/SRR13253057.gtf SRR13253057.sorted.bam

Build count matrix

prepDE.pyあるいはprepDE.py3 を使う。(stringtie開発者がedgeR / DEseq2用に提供)

$ prepDE.py3 -i stringtie_count_e

Results => transcript_count_matrix.csv, gene_count_matrix.csv

less コマンドで確認

DEG test using edgeR

> library(edgeR)
> dat <- read.csv("gene_count_matrix.csv",row.names=1)
> head(dat)
                  SRR13253052_genome SRR13253053_genome SRR13253054_genome SRR13253055_genome
MSTRG.2                            0                  0                  0                  0
MSTRG.2|ANAC001                   38                 37                 39                 29
MSTRG.1|NGA3                      65                 65                 81                 60
MSTRG.3|ARV1                      48                  0                  0                 70
MSTRG.3                           56                 98                 95                 51
MSTRG.5|AT1G01070                 64                 44                 45                 26
                  SRR13253056_genome SRR13253057_genome
MSTRG.2                            0                 17
MSTRG.2|ANAC001                   32                 26
MSTRG.1|NGA3                      67                 65
MSTRG.3|ARV1                      32                 50
MSTRG.3                           68                 51
MSTRG.5|AT1G01070                 39                 23

group <- c(rep("cont", 3), rep("Fe_deficiency", 3))
D <- DGEList(dat, group=group)
D <- calcNormFactors(D)       # 注:edgeRのバージョンが３系のケース、4系ならnormLibSizesを使う。
D <- estimateCommonDisp(D)
D <- estimateTagwiseDisp(D)

de.cont.vs.Fe_deficiency <- exactTest(D, pair=c("cont", "Fe_deficiency"))
topTags(de.cont.vs.Fe_deficiency)

Comparison of groups:  Fe_deficiency-cont 
                           logFC   logCPM       PValue          FDR
MSTRG.14833|FER1      -10.660088 6.644828 1.690845e-30 6.917078e-26
MSTRG.16046|ENH1       -3.930730 7.078697 1.224336e-21 2.504317e-17
MSTRG.17948|AT5G51720  -5.859652 5.305329 2.593611e-21 3.536734e-17
MSTRG.8979             11.286069 4.137391 9.247512e-20 9.457662e-16
MSTRG.15727           -11.094253 3.947351 2.528928e-18 2.069118e-14
MSTRG.4248|ATTAP1     -11.137358 3.989875 3.909497e-18 2.665560e-14
MSTRG.436             -11.073557 3.926938 1.990701e-17 1.163394e-13
MSTRG.14833            -4.146941 5.660047 8.751456e-17 4.475166e-13
MSTRG.3424|HEMA1       -3.498131 6.851276 2.111506e-16 9.597733e-13
MSTRG.8201|AT3G02480   10.868772 3.726127 6.311043e-16 2.581785e-12

tmp <- topTags(de.cont.vs.Fe_deficiency, n=nrow(de.cont.vs.Fe_deficiency))
write.table(tmp$table, "de.cont.vs.Fe_deficiency.txt", sep="\t", quote=F)

Links

cutadapt

https://cutadapt.readthedocs.io/en/stable/

hisat2

https://ccb.jhu.edu/software/hisat2/index.shtml

stringtie

http://www.ccb.jhu.edu/software/stringtie/

edgeR

http://bioconductor.org/packages/release/bioc/html/edgeR.html

multiqc

https://docs.seqera.io/multiqc

Notes

論文情報

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown https://www.nature.com/articles/nprot.2016.095