ex1: Genome-based RNA-seq pipeline (Mapping and Counting)

この練習問題では、Genome-based RNA-seq pipelineの前半部分、つまりmapping and countingの過程を練習する。

シロイヌナズナののRNA-seqを行った。ライブラリは1種類のみで、paird-end 101bpシークエンスを行った。これらのリードをシロイヌナズナゲノムにマッピングし、遺伝子ごとにカウントしたい。

戦略：hisat2でgenomeリファレンスにsplice-awareなマッピング。counting はStringTie を使う。

Data

データファイルは、~/data/KY/genome-base 以下。

Input reads

4D_rep1_R1.fastq (Read1 (fwd))
4D_rep1_R2.fastq (Read2 (Rev))

Reference

genome: genome.fa
gene model: genes.gtf

Setup

Setup environment

ex1 ディレクトリをつくり、以下の解析はその下で作業しよう。

Sequence reads

''less'' などのコマンドで、シーケンスファイル（4D_rep1_R1.fastq）の内容を確認する。

注）本番の解析では、リード数の確認、フォーマットの確認、クオリティの確認などを行う。

Reference sequence and annotation files

''genome.fa'', ''genes.gtf'' の内容をless などで確認する。

Mapping

Create index of reference

$ hisat2-build reference.fasta output_basename

reference.fasta : referenceのfastaファイル。今回の場合は minimouse_mRNA.fa（のパス）
output_basename : 生成するインデックスファイル群のbase nameを指定する。好きなbase nameを指定して良い。

たとえば

hisat2-build genome.fa genome

を実行すると、

genome.1.ht2
genome.2.ht2
genome.3.ht2
genome.4.ht2
genome.5.ht2
genome.6.ht2
genome.7.ht2
genome.8.ht2

の8つのファイルができる。

Run hisat2

hisat2でマッピングしよう。

Usage: 
  hisat2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

hisat2には様々なオプションがあるが今回は最低限のオプションだけを設定して実行する。どのようなオプションが利用可能かは、''hisat2 -h'' で確認できる。また開発者ホームページに詳細な解説がある。本番の解析では、適切なオプションを適切なパラメータで実行しなければいけない。実際は、いくつかパラメータを振って試行錯誤することになる。

hisat2 -p 4 --dta -x genome -1 4D_rep1_R1.fastq -2 4D_rep1_R2.fastq -S 4D_rep1.sam

4D_rep1.sam がマッピング結果 SAM format
-p は使うCPUコア数。使用するコンピュータにあわせて設定する。
--dta StringTieによるカウントを行う場合指定

コマンドを実行するとしばらくして、

391510 reads; of these:
  391510 (100.00%) were paired; of these:
    13678 (3.49%) aligned concordantly 0 times
    303880 (77.62%) aligned concordantly exactly 1 time
    73952 (18.89%) aligned concordantly >1 times
    ----
    13678 pairs aligned concordantly 0 times; of these:
      2810 (20.54%) aligned discordantly 1 time
    ----
    10868 pairs aligned 0 times concordantly or discordantly; of these:
      21736 mates make up the pairs; of these:
        12548 (57.73%) aligned 0 times
        6979 (32.11%) aligned exactly 1 time
        2209 (10.16%) aligned >1 times
98.40% overall alignment rate

のようなレポートが表示されて終了する。マッピング率など有用な情報なので、テキストファイルにコピー＆ペーストして保存しておくと良い。

Inspect mapping results

計算が終わったら、どのようなファイルが生成されたか確認する。 (''ls -l''など)

4D_rep1.sam の内容を確認しよう (''less, samtools''など).最初の約2万行はヘッダで、アライメントはそのあとに続く。

(example)

samtools view out.sam  |less -S

SAM to BAM

またmapping結果を可視化したりカウントしたり、様々な下流解析を行うために、SAMファイルをsort済のBAMに変換する。そしてインデクシングする。SAM <=> BAM の変換は、NGS解析ではよく行う作業なので必ず身に付けること。

$ samtools sort -@ 4 -o 4D_rep1.sorted.bam 4D_rep1.sam
# => 4D_rep1.sorted.bam が生成される
$ samtools index 4D_rep1.sorted.bam
# => 4D_rep1.sorted.bam.bai が生成される

Count estimation

StringTieにかけて発現のカウントを行う。ここでは用意したGTFに基づいてgene単位およびtranscript単位でのカウントを行う。

$ stringtie -e -B -p 4 -G genes.gtf -o count_4D_rep1.gtf 4D_rep1.sort.bam

-e -Bを付けることで-Gで指定したgtfに記載のgene modelのみカウントし、後継解析に用いるカウント情報を記載した-oで指定するgtfファイルが出力される。
-p は使うCPUコア数。使用するコンピュータにあわせて設定する。

count_4D_rep1.gtf が結果ファイル。lessなどを用いて中身を確認しよう。

ex1 - nibb-gitc/gitc2018july-rnaseq GitHub Wiki

ex1: Genome-based RNA-seq pipeline (Mapping and Counting)

Data

Setup

Setup environment

Sequence reads

Reference sequence and annotation files

Mapping

Create index of reference

Run hisat2

Inspect mapping results

SAM to BAM

Count estimation

⚠️ GitHub.com Fallback ⚠️

ex1 - nibb-gitc/gitc2018july-rnaseq GitHub Wiki

ex1: Genome-based RNA-seq pipeline (Mapping and Counting)

Data

Setup

Setup environment

Sequence reads

Reference sequence and annotation files

Mapping

Create index of reference

Run hisat2

Inspect mapping results

SAM to BAM

Count estimation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️