5. Daliy Work Log - Kkkzq/Genome-Analysis-paper2 GitHub Wiki

2021/4/6: FastQC, Trimming. Update workflow figure on wiki. Still struggling with using rsync on Windows and haven't solved this problem yet.
2021/4/7: Solved the problem of rsync. Send results of fastQC and trimming to the local computer. d
2021/4/8: Push the results of fastQC and trimming to Github master branch. Also, push the scripts used on 2021.4.6. Tried to merge main and master branch but failed. Decide to work with the master branch.
2021/4/12: Finish the batch jobs of DNA assembly using spades and SOAPdenovo.
2021/4/13: Push the codes of DNA assembly to Github. Try to write scripts for assembly evaluation. Fix the time plan (add GapCloser after SOAPdenovo).
2021/4/14: Submit GapCloser batch job. Finish the assembly evaluation for spades. Update wiki-method.
2021/4/15: Add evaluation for soapdenovo without gapcloser in time plan table. Submit the related batch job and finish. Write scripts for RNA assembly.
2021/4/16: Finish the assembly evaluations. Try to solve the problems of running trinity but failed. Update wiki time plan.
2021/4/19: Write scripts for the soft mask. Discuss questions about future workflow with TA.
2021/4/20: Solve the questions about future workflow. Write method and result part of my wiki.
2021/4/21: Write script for building repeatmasker library using repeatscout. Having problem running filter step, try to search for solution and it seems because perl module is installed uncompletely. Decide to skip this step and directly use the original .repseq.fa file to do repeatmasker analysis. Use -q parameter to improve the speed of repeatmasker. The original .repseq.fa file will cause repeatmasker broken, can't be used.

TA reminds: NEVER use more than 2 cores to run scripts because it might waste CPU hours. I shouldn't make that mistake again.

2021/4/22: Wrong estimate about the running time of repeatmasker. Set 12 hours but not enough. Write scripts of STAR (need to be checked by TAs).
2021/4/26: Finish the setting of parameters in STAR script and will run it in the next lab course. Write method page of Github wiki.

Here is a simple plan for future analysis:

1. run STAR with all rna-seq files (for structure annotation) 
2. run BRAKER to finish structure annotation 
3. eggnogmapper for functional annotation 
4. run STAR seperately with different limb and time stage to prepare for differencial analysis 
5. HTseq and DEseq2.

2021/4/26: Solve the segmentation fault of RNA mapping in a special way. I wrote a loop and run one fq file a time. I got lots of bam files and my TA told me BRAKER can solve several bam files so that's nice.
2021/5/3: Try to run BRAKER and got this bug:The hints file is empty. Maybe the genome and the RNA-seq file do not belong together. Solved by using this code to delete whitespaces in the headers of fasta file.

cut -d ' ' -f1 closed_gaps.fasta.masked > new_genome.fa

However, because of the low quality of assembly, I decided to re-run STAR and BRAKER using reference genome.

2021/5/4: It turns out the error of my assembly and the reference genome are the same. Perhaps there is something wrong with the RNA mapping. I realize the U_fq.bam files' size are obviouslyt smaller than P_fq.bam files, then I start a test folder only use the P files to do the annotation, still failed. During the discussion with my TA, I realized that maybe I should use those paired rna files to perform rna mapping pair by pair.
2021/5/7: I did rna mapping again with reference genome and try to run BRAKER again, it still failed at the same step. Now I'm completely ran out of solutions and I'm really confused. I checked all the paths and the .gm_key in my home directory. I run BRAKER with my genome and reference genome. But all of these failed at the same step of GeneMark.
2021/5/11: According to my TA: If you feel like stuck you can get the genes in which the scaffold you’ve been working from the original gff file and create a protein fasta file from that and continue with the rest of the analysis. I will skip the BRAKER step and go with eggnogmapper.

sed -n '141191,142375p' GCF_001595765.1_Mnat.v1_genomic.gff > NW_015504249.1.gff
bedtools getfasta -fi sel3_NW_015504249.fna -bed NW_015504249.1.gff -fo sel3_genome.fasta # get genome fasta file
# or
module load cufflinks/2.2.1
gffread -w output_transcripts.fa -g sel3_NW_015504249.fna NW_015504249.1.gff
# get protein fasta file
module load emboss/6.6.0
transeq output_transcripts.fa protein.fa

The result protein fasta file I got is only 64k large, so I choose online eggnogmapper to do the functional annotation. The online annotation runs about 2h. Download the output files on my local computer and push the result on github.

2021/5/19: Finish HTseq and DEseq2 analysis. Push all codes and results on github.