3. Results - Kkkzq/Genome-Analysis-paper2 GitHub Wiki

3.1. Quality Control & Preprocessing

Result from FastQC

See all quality control results here. Overall, the results are good and the data can be used for more analysis.

  • Per base sequence quality

FastQC provides result figures that show the aggregated quality score statistics at each position along with all reads in the file. The blue line is the median quality score at that position while The red line in each yellow box represents the median quality score at that position.

The figures show the per-base quality of SRR1719266_1P increases after trimming, since this step removed low quality portions of the reads.

This figure shows the per-base quality of trimmed DNA data (SRR5819794.trim_1P). The distribution of the read quality is in the green area.

  • Per base sequence content

This section shows the percent of bases called for each of the four nucleotides at each position across all the reads in the file. The proportion of the four bases should stay constant along the length of the read with each base ~25% or at least %A = %T and %C = %G.

  • Warning - if the difference between A and T or C and G is greater than 10% at any position.
  • Failure - if the difference between A and T or C and G is greater than 20% at any position.

Most of the reads show a good per base sequence content.

  • Per base N content

Shows the percentage of bases at each position/window with no call.

  • Overrepresented sequences

It shows the list of sequences that appear more than expected in the file.

  • Adapter content

There is no significant adapter content.

  • Kmer content

Kmers should be evenly represented across the length of the read.

3.2. Genome Assembly

Use SOAPdenovo and Spades to finish genome assembly. GapCloser could be used to improve the quality of the assembly result made by SOAPdenovo since it is designed to close the gaps emerging during the scaffolding process by SOAPdenovo or other assemblers.

SOAPdenovo

Although the output.scafStatistics file contains some statistic results for the assembly, it is not enough. Use Quast to generate summary statistics (N50, maximum contig length, GC %, # genes found in a reference list or with built-in gene-finding tools, etc.) for the assembly.

/sw/bioinfo/quast/4.5.4/rackham/bin/quast.py output.scafSeq -o ./quast_output
/sw/bioinfo/quast/4.5.4/rackham/bin/quast.py closed_gaps.fasta -o ./quast_output

SOAPdenovo Quality Report

output.scafSeq
contigs >= 0bp 254571
contigs >= 1000bp 15076
contigs >= 5000bp 1477
contigs >= 10000bp 198
contigs >= 25000bp 1
contigs >= 50000bp 0
contigs 36838
Largest contig 27483
Total length 50020047
GC% 40.38
N50 1463
N's per 100 kbp 54477.96

Contigs are continuous segments containing only A, C, G, or T bases without gaps. Scaffolds are composed of contigs and gaps (by chaining contigs together using additional information about the relative position and orientation of the contigs in the genome). Contigs in a scaffold are separated by gaps (usually represented by "N"). The table provided by Quast only contains information of contigs, the scaffold information is provided in the output.scafStatistics.

Length Percentage
scaffolds>100 251027
scaffolds>500 36782
scaffolds>1K 15053
scaffolds>10K 198
scaffolds>100K 0
scaffolds>1M 0

N50 (sequence length of the shortest contig at 50% of the total genome length, which means that half of the genome sequence is in contigs larger than or equal to the N50 contig size) is useful in the evaluation of the quality of an assembly. A larger N50 value means having fewer and longer contigs, which is better.

N's per 100 kbp is the average number of uncalled bases (N's) per 100000 assembly bases.

SOAPdenovo-GapCloser Quality Report

closed_gaps
contigs >= 0bp 254571
contigs >= 1000bp 11629
contigs >= 5000bp 1353
contigs >= 10000bp 128
contigs >= 25000bp 1
contigs >= 50000bp 0
contigs 36504
Largest contig 25428
Total length 44655862
GC% 40.56
N50 1201
N's per 100 kbp 28958.53

Comparing those 2 tables, GapCloser slightly decreases the length of contigs and the total length, which is not surprising because it deletes gaps in the assembly. The contig total numbers also decreased, it might because GApCloser joined some contigs after deleting gaps. It obviously decreases the average number of uncalled bases, which illustrates the quality of assembly improved. But the smaller N50 value is not good.

Considering all the improvements and worse metrics, SOAPdenovo-GapCloser seems like an improvement from SOAPdenovo.

Spades

Use Quast to generate summary statistics (N50, maximum contig length, GC %, # genes found in a reference list or with built-in gene-finding tools, etc.) for the assembly.

/sw/bioinfo/quast/4.5.4/rackham/bin/quast.py scaffolds.fasta -o ./quast_output/output
scaffolds
contigs >= 0bp 1164864
contigs >= 1000bp 600
contigs >= 5000bp 178
contigs >= 10000bp 99
contigs >= 25000bp 25
contigs >= 50000bp 7
contigs 12069
Largest contig 66390
Total length 10455664
GC% 40.90
N50 682
N's per 100 kbp 417.12

Spades have longer contigs, a smaller number of contigs, and smaller N's per 100 kbp. But the total length is 10455664, which is obviously smaller than SOAPdenovo (44655862 after GapCloser). Also, the N50 value is half smaller than the result of SOAPdenovo.

Compared with reference genome assemble

The following table shows the quast result of the reference genome assemble.

sel3_NW_015504249
contigs >= 0bp 1
contigs >= 1000bp 1
contigs >= 5000bp 1
contigs >= 10000bp 1
contigs >= 25000bp 1
contigs >= 50000bp 1
contigs 1
Largest contig 3380749
Total length 3380749
GC% 45.32
N50 3380749
N's per 100 kbp 3455.54

It is obvious that the reference genome contains a large contig, and it's total length is 1/3 of my assemble. The N's per 100kbp value is in the middle of spades'result and soapdenovo's result, but the contig assemble is much better than my results.

MUMmerplot

MUMmer was used to comparing the genome assembled by SOAPdenovo and SPAdes with the reference genome. When compared assemblies agree, a line or dot is plotted. The forward matches are displayed in red, while the reverse matches are displayed in blue. Thus, a single red line represents perfectly identical sequences.

These 2 figures show the evaluation results of the SOAPdenovo assembly before and after GapCloser.

This figure shows the evaluation result of the Spades assembly.

All these plots are showing good results with nearly red lines and almost no dots. There is a prominent red dot in the evaluation figure of the Spades assembly. Judging from the figures alone, Gapcloser's improvement in the results does not seem to be obvious.

However, combining the statistical tables and images, the SOAPdenovo assembly result after Gapcloser processing is relatively better. Thus, the genome assembled by SOAPdenovo and improved by GapCloser was selected for subsequent analysis.

3.3. Genome Annotation

Soft-masked and hard-masked genome

Soft-masked genome is prepared for BRAKER annotation while hard-masked genome is prepared for STAR mapping. In this step hard-masked genome will be generated first.

RepeatScout only succeeded in the first step and got a genome.repseq.fa file. The filter of short or low complexity failed because of failure in calling filter-stage-1.prl. Thus, genome.repseq.fa was used directly as repeat library for RepeatMasker. Which cause RepeatMasker broke at around 800 branches and can't continue with genome.repseq.fa as repeat library. Then, -species mammal option was used to run RepeatMasker again. RepeatMasker with mammal repeat library successfully get a hard-masked genome file.

For the soft-masked file, change options with the following codes. The output soft-maksed file closed_gaps.fasta is both hard-masked and soft-masked, which is confusing but the genome file could be used for BRAKER annotation.

RepeatMasker -q -species mammal /home/ziqi/Ziqi_Genome_Analysis/result/6_repeatmasker/softmask/closed_gaps.fasta -xsmall

Following are parts of the result .tbl files of hard-masked and soft-masked genome:

# hard-masked
==================================================
file name: closed_gaps.fasta
sequences:        254571
total length:   86507394 bp  (73313141 bp excl N/X-runs)
GC level:         40.98 %
bases masked:   53008785 bp ( 61.28 %)
==================================================
# soft-masked
==================================================
file name: closed_gaps.fasta
sequences:        254571
total length:   86507394 bp  (73313141 bp excl N/X-runs)
GC level:         40.98 %
bases masked:   52394484 bp ( 60.57 %)
==================================================

According to the paper, they got "Repetitive regions comprised 33% of the genome and were soft-masked using RepeatMasker." The reason why our project got a higher value which is 60% repetitive regions of the genome might because of the repeat library. Since RepeatScout failed to build a specific one, a less precise repeat library "mammal" had to be chosen. A higher percentage of masked repeat regions in whole genome could decrease the difficulty of annotation but also cause loss of potential genes.

BRAKER

BRAKER was failed when running GeneMark step. A hint file could be generated but the GeneMark didn't work. After checking the .gm_key file location and reset all paths needed in BRAKER, the problem still exist. Then I suspect that the problem might be that the quality of my own assembled genome is too low, so the reference genome was also used to run the BRAKER but it failed at the same step. I also tried other ways such as re-run the STAR mapping step and use the new bam files for BRAKER, but still not worked.

About one week was spent on solving the braker problem, but in the end there was no progress. In the end, the teaching assistant suggested that I skip this step and use the reference genome and reference genome annotation files for subsequent analysis.

eggNOGmapper

Reference genome sel3_NW_015504249.fna and annotation file GCF_001595765.1_Mnat.v1_genomic.gff was used in this step. This method began with checking the reference genome scaffold name, filtering out the corresponding entries in the annotation file and saving those entries as a new one. Here is a example of annotatio file:

NW_015504249.1	Gnomon	gene	217208	219388	.	-	.	ID=gene3896;Dbxref=GeneID:107525408;Name=LOC107525408;gbkey=Gene;gene=LOC107525408;gene_biotype=protein_coding

According to the nucleotide locations in the entries, we could seperate gene sequences from the large scaffold in the reference genome file. Using emboss to convert gene sequences into protein sequences and output a .fa file. Uploading this file in eggNOGmapper online server for functional annotation. See the result in excel format here. 95 queries were found.

3.4. RNA mapping

STAR

Using following codes at first:

# build genome index
# use hard mask file
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeChrBinNbits 9 \
--genomeSAindexNbases 12 \
--genomeDir /home/ziqi/Ziqi_Genome_Analysis/result/7_rna_mapping/index \
--genomeFastaFiles /home/ziqi/Ziqi_Genome_Analysis/result/6_repeatmasker/hardmask/closed_gaps.fasta.masked \

# read alignment
STAR --runThreadN 8 \
--genomeDir /home/ziqi/Ziqi_Genome_Analysis/result/7_rna_mapping/index \
--readFilesIn /home/ziqi/Ziqi_Genome_Analysis/2_Eckalbar_2016/sel3/rna_seq_data/trimmed/*.fastq.gz \
--outFileNamePrefix /home/ziqi/Ziqi_Genome_Analysis/result/7_rna_mapping/star/Mnat \
--outSAMtype BAM SortedByCoordinate \

But it continuously got failure report:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Searching for solutions on Google and tried to adjust multiple parameters such as --genomeChrBinNbits since my genome is not a small one. This adjustment of parameter helped me finish the index building step but the following mapping step had a segmentation fault. Tried 8 cores under the permission of my TA but still won't work. In the end, I manage to finish the RNA mapping by writing a loop and run one rna-seq file at a time. See script here. This will give me a series of .bam files separately without segmentation fault, which is a RAM fault.

STAR could be run successfully with my assembly and got a series of bam files for the annotation step. But as I mentioned before, the large number of small contigs might cause the broken of BRAKER, thus the annotation can't be finished successfully. So I decided to use the reference genome for the re-run of RNA mapping and gene annotation. RNA mapping with reference genome is finished with similar parameters and code structures, but run 2 pairs of paired and unpaired rna-seq files (total 4 files) a time, which could be seen here. Those bam files are also used in the following HTseq analysis. See the mapping quality results here. All Uniquely mapped reads % are over or around 60%, which is good.

Show one of the result files (outputP_SRR{213}Log.final.out) here:

Number of input reads |	299100
                      Average input read length |	196
                                    UNIQUE READS:
                   Uniquely mapped reads number |	188233
                        Uniquely mapped reads % |	62.93%
                          Average mapped length |	191.12
                       Number of splices: Total |	45903
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	32726
                       Number of splices: GC/AG |	723
                       Number of splices: AT/AC |	88
               Number of splices: Non-canonical |	12366
                      Mismatch rate per base, % |	1.21%
                         Deletion rate per base |	0.05%
                        Deletion average length |	2.01
                        Insertion rate per base |	0.04%
                       Insertion average length |	2.15
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	20103
             % of reads mapped to multiple loci |	6.72%
        Number of reads mapped to too many loci |	413
             % of reads mapped to too many loci |	0.14%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	43225
                 % of reads unmapped: too short |	14.45%
                Number of reads unmapped: other |	47126
                     % of reads unmapped: other |	15.76%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

3.5. Differential Expression

HTseq

Using HTseq to count for each tissue and development stage. See scripts here and all results here. A total of 6 count files are obtained.

Stage Tissue
cs15 forelimb
cs15 hindlimb
cs16 forelimb
cs16 hindlimb
cs17 forelimb
cs17 forelimb

DEseq2

DESeq2 was used for the differential expression analysis of different tissue and development stages. Shrinkage of effect size (LFC estimates) was used for visualization and ranking of genes by calling function lfcShrink, which could remove the noise associated with log2 fold changes from low count genes without setting arbitrary filtering thresholds. The log2 fold change could reveal how much the gene's expression changed compared with other conditions. For example, in the condition of hindlimb vs forelimb, a log2 fold change of -5 means that that gene's expression is increased by a multiplicative factor of 2^(-5) ~ 0.031 compared with forelimb. See all scripts here.

Different tissues

The analysis result of different tissues is shown in the following figures. Two genes were found have adjusted p-value less than 0.1, which are PITX1 and LOC107525399. See result csv file here.

figure
Scatter plot of the log2 fold changes (y-axis) versus the mean of normalized counts (x-axis). Blue points have adjusted p-value less than 0.1

This figure illustrates that PITX1 has lower expression in forelimbs than in hindlimbs.

PCA plot using all the genes. It could be seen that different limbs are separated along the x-axis and this PC could explain 84% of variance The normalized counts in all samples for the gene with the lowest p-value, PITX1.

Different tissues and stages

The list of genes have adjusted p-value less than 0.1 of different tissues and stages could be found in here. Totally 8 genes are in this list.

lfc for 4 groups
Scatter plot of the log2 fold changes (y-axis) versus the mean of normalized counts (x-axis). Blue points have adjusted p-value less than 0.1

PITX1 still has the lowest p-value in this analysis. In different development stages, PITX1 shows low expression in the forelimb and relatively high expression in the hindlimb.

PCA plot using all the genes. It could be seen that different limbs are separated along the x-axis which could explain 84% of variance, while different time stages are along the y-axis which explains 12% of variance. The normalized counts in all samples for the gene with the lowest p-value, PITX1.

For the last part of the visualization, plotting all genes used in differential analysis and their expression levels as a heatmap. The biological interpretive analysis of the 8 genes with a p-value < 0.01 could be seen in the discussion part.

Heatmap of all genes and their expression in the six different samples. Heatmap of distance matrix
⚠️ **GitHub.com Fallback** ⚠️