Daily Log - Linafina100/GenomeAnalysis GitHub Wiki

31/03-26: Finished project plan. Set up the structure of my directory. Linked whole-genome sequence files from the course directory to my raw data folder in my data folder. Started pre-processing raw data with FastQC allocating 2 cores and 1 hour. Pre-processed chromosome 3 gz-files allocating 2 cores and 2 hours.

04/04-26: Reorganized the directory structure within the preprocessing folder to separate chromosome 3 and whole-genome data for better traceability. Updated the metadata file (sample_info.csv) with specific sample IDs and raw filenames. Inspected FastQC reports for chromosome 3 Illumina reads; confirmed 0% adapter contamination and high per-base quality. Created and executed a Trimmomatic script to trim low-quality "tails" using a sliding window (4:15) and a minimum length filter of 36bp, allocating 2 cores and 2 hours. Process interrupted by network timeout (Broken Pipe) post-execution. Verified output integrity via zcat and file size inspection (~1.4GB per paired file). High ratio of paired to unpaired data (1.4GB vs 12MB) indicates a high survival rate, estimated at >95%.

06/04-26: Re-ran Trimmomatic with output logging to capture trimming statistics. Successfully completed trimming of chromosome 3 Illumina reads.

Results showed:

  • Input read pairs: 20,979,851
  • Both surviving: 98.85%
  • Forward only: 0.70%
  • Reverse only: 0.42%
  • Dropped: 0.03%

These results confirm very high read quality and minimal data loss during trimming, consistent with previous FastQC analysis. A second FastQC was performed to check qulaity of trimmed sequences.

07/04-26: Updated wiki and wrote pre-processing part.

13/04-26: Fixed figures folder and gitignore. Wrote assembly script and started assembly run. Successful flye run, key statistics summarised in 01 Assembly.

14/04-26: Created script for polishing. Ran polishing.

15/04-26: Evaluated assembly with Quast and BUSCO and masked repetitive sequences with DNAMasker. Worked on wiki. Realised I've been following the wrong manual on the studium start page. Planned for bigger bottleneck when doing braker3 due to only using one core. Jobs could not be run during this day because of resource limitations. Continued to write on wiki.

16/04-26: Wrote braker script stacked with mapping. Ran script over the weekend.

20/04-26: Mapping was sucessful but braker failed because if path being outside of the container. Need help tomorrow in the lab session.

21/04-26: Braker kept on failing because of species not correctly configured. Ran correctly with species but then failed because of GeneMark which was not setup correctly. Then braker failed because of Gene prediction using BRAKER was limited by the reduced genomic scope (chromosome 3 only), resulting in insufficient gene models for reliable AUGUSTUS training. This required relaxing training parameters or skipping optimization, highlighting the dependency of ab initio gene prediction on sufficient genomic context and evidence density.

23/04-26: Braker was run again with min contigs and succeeded. Wiki updated for annotation. Answered questions 1-14 for grade 4.

24/04-26: Created eggnog script, put to queue. Trouble to find database but it was found on the rackham server. Updated wiki for annotation mapping and braker. Created log script to see braker results for wiki analysis.

26/04-26: Inspected reads on IGV, wrote analyse on wiki and answered questions on mapping and annotation. Saw that i need to do individual mapping for Dseq2 so wrote a script which does not merge the mapped bam files but keeps the samples.

27/04-26: Wrote script for feature count. Redid last 2 samples of individual star because time ran out, simply changed samples in the same script.