4. Discussion - Kkkzq/Genome-Analysis-paper2 GitHub Wiki

This discussion part mainly focusing on 4 topics: Biological interpretation, validation of completed analysis, possible improvement of this project and final conclusions.

Interpretation of Biological results

The following table including all differential expressed genes (p-value > 0.01) and their descriptions which was given by gene id transfer R script here. The full table could be seen here. The biological function was based on the NCBI entries.

PITX1 has high expression in the hindlimb and low expression in the forelimb, which could be seen obviously in the heatmap of all genes. It is related to organ development and left-right asymmetry, which is very important. The gene codes a protein that plays a critical role in the development of the hindlimbs and this protein is found primarily in the developing legs and feet.
The expression amount of TGFBI gradually increases from time stage cs15 to cs17. This gene is responsible for endochondrial bone formation and cell-collagen interaction, which illustrates that along with the embryo develops, the importance of cartilage formation in tissue structure gradually increases.
The expression of UBE2B is higher in forelimb than in hindlimb. Since this gene is related with post-replicative DNA damage repair, the reason why the gene is highly expressed in the forelimbs might be that the development of forelimbs requires the cells to divide faster and this will cause more DNA damage to be repaired.
PCBD2 have higher expression in hindlimbs. This gene encodes an enzyme which is related to perform mechanistically distinct functions. The difference in the expression of this gene may be related to the different motor functions of the fore and hind limbs.
The expression level of SEC24A in the hind limbs is higher than that in the forelimbs. In both limbs, the expression of this gene increases with the developmental stage. It is speculated that as the development progresses, biological processes such as protein transport will become more active, and the hind limbs have more protein transport requirements than the forelimbs.

hgnc_symbol	description	Biological function
PITX1	paired like homeodomain 1	Involved in organ development and left-right asymmetry, a transcriptional regulator involved in the basal and hormone-regulated activity of prolactin
TGFBI	transforming growth factor beta induced	Involved in cell-collagen interactions, may be involved in endochondrial bone formation in cartilage
UBE2B	ubiquitin-conjugating enzyme E2 B	Encodes a member of the E2 ubiquitin-conjugating enzyme family. This enzyme is required for post-replicative DNA damage repair
PCBD2	pterin-4-alpha-carbinolamine dehydratase 2	Encodes a member of the pterin-4-alpha-carbinolamine dehydratase family. The encoded protein has been identified as a moonlighting protein based on its ability to perform mechanistically distinct functions.
SEC24A	SEC24 homolog A, COPII coat complex component	Involved in process of mediating protein transport from the endoplasmic reticulum

Validation of the completed analysis

My genome assemblies are not very successful. There are too many small contigs that cause low N50/NG50, and the GC% value is 5% less than the reference genome, although I tried with both SOAPdenovo and spades. I choose a K-mer size of 49 according to the paper, but it might be good to also try K-mer size of 27. I also noticed that in the paper, those k-mers in a read with a frequency of 3 or lower were corrected to a more common k-mer, but I didn't do such correction to improve the assembly result. The gene annotation with BRAKER can't finish and I failed to solve this problem. I was told that many other students in my group also facing the same problem and I was recommended to skip this structural annotation step and use the reference genome in order to continue with functional annotation. The function annotation with eggNOGmapper was successful and the result could be seen here.While reflecting on my results and scripts I realized that when translating DNA sequence into protein sequence, I didn't use all the 6 reading frames. New attempt gets 95 queries.

RNA mapping with STAR is successful and the quality could be checked here. All Uniquely mapped reads(%) values are around or larger than 60%, these bam files are used for HTseq counting. The HTseq and DEseq2 run successfully and a series of differential expressed genes could be found. Compare these genes with the result of the paper, they do have some genes in common such as PITX1, but I still missed many genes such as Msx1 and Msx2, both of which are key genes involved in apoptotic activity during interdigit tissue regression. This probably because I only use a subset of the reference genome and many genes are not contained in this subset. I checked the gff file (the result of structural annotation) of this subset genome and there are no genes named Msx1 or Msx2, so my interpretation for the missing genes probably be right.

Possible improvement

I should consider change the K-mer parameter for genome assembly and try to do more corrections of other parameters in order to improve the assemble result. If I could get a assembly with higher quality, I could use it in subsequent analysis and try to find more differential expressed genes.

I was stuck in the BRAKER step for almost 2 weeks. Even if I have more time, I am still not sure whether I could solve this problem. If I could restart this project, I will try to choose another set of bioinformatics analysis software for genome structure annotation.

I was looking forward to trying the CHIP-seq analysis and analyses of long non-coding RNAs before the project started, but due to the slow progress of the structural annotation step, I finally did not have enough time to reach my goal.

Final Conclusions

During this project, I get familiar with different bioinformatic tools and managed to go through the whole pipeline and get results. Although some analysis could be improved with more time and parameter evaluation, the goal of this course has been achieved.

Bioinformatics analysis is a complex topic and I still need more experience and exercise to get more familiar with different tools. This course could be a great start for me to understand the actual bioinformatics research project.

Special Thanks

I am very grateful to my teaching assistant Zeynep for her patient help during the course. I hope that all the bugs we solve together could provide material for the preparation of future experimental manuals.