03 Annotation - Linafina100/GenomeAnalysis GitHub Wiki

Annotation

1. Repeat Masking

According to the article, 34.74% of the assembled genome consisted of repeat sequences, with transbosable elements being the dominant component. Although we cannot be sure of the portion of repeats on this chromosome, because structural annotation software might incorrectly identify these repetitive regions as protein-coding genes, it is important to mask these sequences. The aim of masking is therefore to identify and soft-mask these repeats on Chromosome 3 to prepare for accurate gene prediction.

In this part, RepeatMasker was used to screen the polished Chromosome 3 assembly. Because the organism of interest is the Niphotrichum japonicum, the embryophyta repeat database was used in this process. Soft-masking (-xsmall) was applied to convert repetitive sequences into lowercase letters rather than replacing them with Ns, preserving the underlying sequence context for annotation tools. The execution script (8_repeat_masking.sh) is available in the code repository.

Results:

  • Total length processed: 17,208,405 bp
  • Bases masked: 193,709 bp (1.13%)
  • Repeat Types: 100% of the masked regions were identified as Simple repeats (1.13%). No complex interspersed repeats (0.00%) were identified.

Interpretation: The masking of only 1.13% of the genome, consisting entirely of simple repeats, is a direct consequence of using the generalized embryophyta database rather than a de novo library masker. Because the moss Niphotrichum japonicum is evolutionarily distant from the predominantly angiosperm genomes that make up the standard database, its specific transposable elements were probably not masked. However, the simple repeats have been successfully soft-masked, and the genome is prepared for structural annotation.

2. Mapping

RNA-seq reads were aligned to the repeat-masked chromosome 3 assembly to provide transcriptional evidence for gene prediction with BRAKER3. Read alignment was performed using STAR. First, a genome index was generated from the repeat-masked assembly. This indexing step enables efficient alignment by organizing the genome into searchable data structures. Paired-end RNA-seq reads from both control and heat-treated samples (six samples in total) were then mapped to the indexed genome. The input reads were provided in compressed FASTQ format and decompressed on-the-fly using zcat. The output was generated as a coordinate-sorted BAM file using the --outSAMtype BAM SortedByCoordinate option. The resulting alignment file (mapped_RNA_Aligned.sortedByCoord.out.bam) contains splice-aware mappings of RNA-seq reads and was used as input evidence for downstream gene prediction.

Table 1. Summary of RNA-seq mapping to the chromosome 3 assembly.

Metric Value
Number of input reads 132702385
Average input read length 300
Uniquely mapped reads number 10577067
Uniquely mapped reads % 7.97%
Average mapped length 297.88
Number of splices: Total 6977343
Number of splices: Annotated (sjdb) 0
Number of splices: GT/AG 6909035
Number of splices: GC/AG 42368
Number of splices: AT/AC 4946
Number of splices: Non-canonical 20994
Mismatch rate per base, % 0.32%
Deletion rate per base 0.01%
Deletion average length 1.53
Insertion rate per base 0.01%
Insertion average length 1.62
Number of reads mapped to multiple loci 899126
% of reads mapped to multiple loci 0.68%
Number of reads mapped to too many loci 24146
% of reads mapped to too many loci 0.02%
Number of reads unmapped: too many mismatches 0
% of reads unmapped: too many mismatches 0.00%
Number of reads unmapped: too short 121163750
% of reads unmapped: too short 91.30%
Number of reads unmapped: other 38296
% of reads unmapped: other 0.03%
Number of chimeric reads 0
% of chimeric reads 0.00%

Only 7.97% of reads mapped uniquely, which is expected since RNA-seq data from the whole genome was aligned to chromosome 3 only. Most reads (91.30%) were classified as unmapped by STAR, reflecting lack of matching regions rather than poor quality. The mapped reads show high quality, with a low mismatch rate (0.32%) and predominantly canonical splice junctions (GT/AG), supporting biologically meaningful alignments. The large number of detected splice sites provides useful evidence for gene structure prediction with BRAKER3. Overall, despite the low mapping rate, the alignment quality is high and suitable for annotation.

To inspect the quality of the alignments and the relationship between read coverage and gene structure, the CRAM files and BRAKER annotation (GTF) were loaded into the Integrative Genomics Viewer (IGV).

Coverage of gene g810.t1 Coverage of gene g822.t1
Figure 1A: Read coverage and splice junctions mapped across the exons of gene g810.t1. Figure 1B: Read coverage and splice junctions mapped across the exons of gene g822.t1.

Lack of coverage in intergenic regions
Figure 1C: Read coverage drops completely to zero in the intergenic regions.

In Figure 1, the read coverage across the genome is highly uneven. Since RNA-seq reads come from mature mRNA, they map to coding regions (exons) of the chromosome 3 sequence, which are shown as blue blocks in the annotation track. As a result, intergenic regions display little to no coverage (Figure 1C), and intronic regions are largely absent due to splicing (Figure 1A–B). The varying heights of the coverage peaks across genes reflect differences in gene expression levels, where highly expressed genes show higher read depth. The curved lines between exons show splice junctions, meaning the introns have been removed from the RNA. These patterns are consistent across the different replicate tracks in the figure, where similar coverage profiles and splice junctions are observed. This indicates that no major differences are present between replicates and that the mapping results are reproducible.

3. Structural Annotation

Structural annotation of the repeat-masked N. japonicum genome assembly was performed using the BRAKER3 pipeline via a Singularity container. Both transcriptomic and homologous protein data was integrated into the pipeline to ensure high accuracy. First, paired-end RNA-Seq reads from control and heat-treated samples were mapped to the masked genome to provide empirical evidence for splice sites. This mapping was performed using STAR, and the resulting alignments were coordinate-sorted and converted into a BAM file using SAMtools to optimize downstream processing. These transcriptomic reads were then used to automatically train GeneMark-ETP and AUGUSTUS. The pipeline was executed with the --softmasking parameter to ensure repetitive elements were properly handled during gene prediction, and a unique species identifier (--species=lisa2) was defined to store to the AUGUSTUS training parameters.

When using BRAKER3 with RNA-seq data only, the software predicts protein-coding genes, including features such as exons, introns, splice sites, start and stop codons, and complete gene structures. These predictions are guided by RNA-seq alignments, which provide evidence of transcription and exon–intron boundaries. Features that are directly supported by RNA-seq data, such as exon–intron boundaries and splice sites, are generally more reliable because they are based on actual transcription evidence. In contrast, predictions of full gene models, untranslated regions (UTRs), or lowly expressed genes may be less reliable, as they depend on coverage and expression levels in the RNA-seq data. Therefore, highly expressed genes with strong read support can be trusted more than genes with weak or no RNA evidence.

Table 2. Summary of predicted gene models.

Feature Count
Genes 3296
Transcripts 3792

A total of 3296 genes and 3792 transcripts were predicted on chromosome 3. The higher number of transcripts compared to genes indicates alternative splicing, where one gene can produce multiple transcript variants, which is expected in eukaryotes. Compared to the authors, fewer features were detected in this analysis. This is mainly because only chromosome 3 was analyzed here, while the authors annotated the entire genome, which naturally results in a much higher number of predicted genes and transcripts. The relatively low RNA-seq mapping rate (~8%) is expected since reads from the whole genome were aligned to a single chromosome. Therefore, this does not indicate poor data quality. However, gene prediction may still be limited by expression levels, as genes that are not expressed under the sampled conditions may lack RNA-seq support and could be missed or incompletely predicted.

4. Functional Annotation

Functional annotation of predicted protein sequences was performed using eggNOG-mapper. The input consisted of protein sequences (braker.aa) generated from the structural annotation step with BRAKER3. The eggNOG database was used as a reference for orthology-based functional annotation. The tool assigns presumed functions to proteins based on sequence similarity to orthologous groups, to predict gene ontology (GO) terms, functional categories, and metabolic pathways. The command emapper.py was run with protein input type. The --data_dir parameter was specified to point to the locally installed eggNOG database on the system. This step provides functional insights into the predicted genes on chromosome 3 to interpret biological processes, including potential roles in heat stress response.

Table 3. Summary of Functional Annotation (eggNOG-mapper)

Feature Count
Total Transcripts Input 3792
Transcripts with Functional Hits 2025
Transcripts with GO Terms 899
Transcripts with Descriptions 1861

The quality of the obtained functional annotation can be evaluated by examining the proportion of transcripts that successfully received functional assignments. Out of 3792 transcripts, 53.4% (2025) received functional hits, and nearly half received specific descriptions. This represents a solid annotation rate for a non-model plant species, indicating that the eggNOG database contained sufficient orthology data to classify the majority of the predicted proteins.

The 1931 transcripts that did not receive description are annotated as 'hypothetical proteins'. This subset consists of two distinct categories. The first category includes genes that did get a hit in the database, meaning their exact protein sequence exists in other organisms in nature. . However, because it didn't get a description, it has still not been established what they do. These are labeled by eggNOG as "DUF" (Domain of Unknown Function). The second category includes genes that received no database hits at all. This occurs because these predicted protein sequences lack significant homology to any experimentally characterized proteins in the reference databases. This high number of uncharacterized proteins is expected in non-model mosses, as they often contain novel, lineage-specific genes or sequences that have diverged so far from their ancestors that standard alignment tools can no longer recognize the similarity.

To tackle the problem of hypothetical proteins, several advanced approaches could be applied. Computationally, one could use deep-learning structural prediction tools (such as AlphaFold) combined with structural aligners (like Foldseek) to identify distant homologs based on 3D protein folding rather than raw sequence text. Additionally, running tools like InterProScan could identify isolated functional domains or motifs within the unknown proteins. Biologically, the function of these genes could be inferred through transcriptomics by observing if their expression significantly changes under specific conditions, such as the heat-stress treatments performed in this study.

⚠️ **GitHub.com Fallback** ⚠️