1. Overview - Kkkzq/Genome-Analysis-paper2 GitHub Wiki
1.1. Aim
The aim of this project is to detect genetic determinants related to the development of bat wings by characterizing Transcriptomic and epigenomic data. During the processes of this project, various bioinformatics tools were used and familiarized. It is necessary to mention that the results of this project might differ from the results in the paper because only part of the original data is used.
1.2. Time plan
| Day | Hours | Process | Status | Data Type | Software | Estimated Time | DDL |
|---|---|---|---|---|---|---|---|
| 26/3 | 2 | Seminar | Completed | -- | -- | 2h | Seminar |
| 1/4 | -- | Project Planning | Completed | -- | Github | 4h | -- |
| 6/4 | 4 | Reads quality control | Completed | Illumina | FastQC | 3min | Project planning |
| 6/4 | 4 | Reads preprocessing | Completed | Illumina | Trimmomatic | 1-10 min/file (2 cores) | Project planning |
| 9/4 | 4 | Genome Assembly | Completed | Illumina | SOAPdenovo | ~1.5h (2 cores) | -- |
| 9/4 | 4 | Assembly evaluation for SOAPdenovo (no gapcloser) | Completed | Fasta sequences | MUMmerplot | < 5min (1 core) | -- |
| 14/4 | 4 | Genome Assembly | Completed | Illumina | SPAdes | ~12 h (12 cores) | -- |
| 14/4 | 4 | Genome Assembly (after SOAPdenovo) | Completed | Illumina | GapCloser | -- | -- |
| 15/4 | 4 | Assembly evaluation for spades | Completed | Fasta sequences | MUMmerplot | < 5min (1 core) | Genome Assembly |
| 15/4 | 4 | Assembly evaluation for SOAPdenovo (after gapcloser) | Completed | Fasta sequences | MUMmerplot | < 5min (1 core) | -- |
| 15/4 | 4 | Transcriptome assembly | Incompleted | Illumina RNA | Trinity | ~1,5h (2 cores) | Genome Assembly |
| 20/4 | 4 | Genome Annotation | Completed | Eukaryotes | BRAKER | ? | -- |
| 20/4 | 4 | Functional Annotation | Completed | Eukaryotes | EggnogMAPPER | ? | -- |
| 28/4 | 4 | Calculate heterozygosity of genome | Incompleted | -- | BWA | ~1.5h (WGS,2cores) | -- |
| 29/4 | 4 | RNA mapping | Completed | Eukaryotes RNA | STAR | ? | Annotation |
| 29/4 | 4 | Differential Expression | Completed | -- | Htseq | < 5min 1 core | -- |
| 29/4 | 4 | Differential Expression | Completed | -- | DEseq2 | ? | -- |
| 3/5 | 4 | Compare lncRNAs to databases | Incompleted | lncRNA | lncRNAdb, GENCODE lncRNA | 1 h | -- |
| 3/5 | 4 | Compare homogenic lncRNAs | Incompleted | lncRNA | BLAST | 1 h | -- |
| 3/5 | 4 | Visualize ChIP-seq data | Incompleted | ChIP-seq data | DiffBind in R | ? | -- |
| 4/5 | 4 | Deeper analysis of differential expression | Partly Completed | -- | hclust, GO analysis | -- | RNA mapping |
| 17/5 | 4 | Check; finish wiki | Completed | -- | Github | -- | -- |
| 19/5 | 4 | Check; finish wiki | Completed | -- | Github | -- | -- |
| 24/5 | 4 | Final version of wiki | Incompleted | -- | Github | -- | Check all |
1.3. Data
DNA was extracted from the leg muscle tissue of a single male M. natalensis. RNA was extracted from paired forelimbs and hindlimbs from three individuals (biological replicates) at three developmental stages (CS15, CS16, and CS17). The subset of original data is provided through UPPMAX under /proj/g2021012/2_Eckalbar_2016.
There are 4 datasets which are sel1, sel2, sel3, and sel4 in this folder. Each dataset contains data from whole-genome sequencing, ChIP-sequencing, and RNA-sequencing. The additional folder contains the reference genome assemblies for each subset of data.
| Data | Tissue | Stage | Tool |
|---|---|---|---|
| WGS | Leg muscle of a single M. natalensis male | Adult | Illumina |
| ChIP-seq | Bat forelimb and hindlimb | CS15, CS16, CS17 | Illumina |
| RNA-seq | Bat forelimb and hindlimb | CS15, CS16, CS17 | Illumina |
WGS Data
The data in the following table is used for DNA assembly (SOAPdenovo).
| Read | Average insert Library size (bp) |
|---|---|
| SRR5819794 | 400 |
| SRR5819795 | 800 |
| SRR5819796 | 9000 |
| SRR5819797 | 175 |
| SRR5819798 | 3000 |
| SRR5819799 | 5500 |
RNA Data
| ID | Tissue | Development stage |
|---|---|---|
| SRR1719013 | Forelimb | CS15 |
| SRR1719014 | Forelimb | CS15 |
| SRR1719015 | Forelimb | CS15 |
| SRR1719016 | Hindlimb | CS15 |
| SRR1719017 | Hindlimb | CS15 |
| SRR1719018 | Hindlimb | CS15 |
| SRR1719204 | Forelimb | CS16 |
| SRR1719206 | Forelimb | CS16 |
| SRR1719207 | Forelimb | CS16 |
| SRR1719212 | Hindlimb | CS16 |
| SRR1719214 | Hindlimb | CS16 |
| SRR1719242 | Hindlimb | CS16 |
| SRR1719208 | Forelimb | CS17 |
| SRR1719209 | Forelimb | CS17 |
| SRR1719211 | Forelimb | CS17 |
| SRR1719213 | Hindlimb | CS17 |
| SRR1719241 | Hindlimb | CS17 |
| SRR1719266 | Hindlimb | CS17 |
These datasets including paired and unpaired reads.
Data Structure
The data and scripts of this project are stores in the following structure. The raw data is in the 2_Eckalbar_2016 and kept unchanged during the project. The unziped_data folder stores the data which has been copied and unzipped from the raw data folder. All scripts are in the script folder and all results are saved in the result folder.
.
├── 2_Eckalbar_2016 -> /proj/g2021012/2_Eckalbar_2016/
├── result
│ ├── 1_quality_control
│ ├── 2_trimming
│ ├── 3_dna_assembly
│ ├── 4_assembly_validation
│ ├── 5_transcriptome_assembly
│ ├── 6_repeatmasker
│ ├── 6_softmask_test
│ ├── 7_ref_mapping
│ ├── 7_rna_mapping
│ ├── 8_braker_annotation
│ ├── 8_ref_annotation
│ ├── 8_ref_eggnog
│ ├── 9_htseq
│ └── test
├── script
│ ├── 1_fastqc_chip.sh
│ ├── 1_fastqc_rna_raw_trimmed.sh
│ ├── 1_fastqc_rna.sh
│ ├── 1_fastqc_rna_trim.sh
│ ├── 1_fastqc_wgs.sh
│ ├── 2_trim_rna.sh
│ ├── 3_GapCloser_SOAP_wgs.sh
│ ├── 3_SOAPdenovo_wgs.sh
│ ├── 3_spades_assembly_wgs.sh
│ ├── 4_MUMmer_soapdenovo_nogapcloser_wgs.sh
│ ├── 4_MUMmer_soapdenovo_wgs.sh
│ ├── 4_MUMmer_spades_wgs.sh
│ ├── 5_trinity_rna_pair.sh
│ ├── 5_trinity_rna_single.sh
│ ├── 6_hard_repeatmasker.sh
│ ├── 6_repeatmasker_wgs.sh
│ ├── 6_repeatscout_wgs_filter.sh
│ ├── 6_repeatscout_wgs.sh
│ ├── 6_soft_repeatmasker.sh
│ ├── 7_ref_mapping.sh
│ ├── 7_rna_mapping.sh
│ ├── 7_test.sh
│ ├── 8_annotation_new.sh
│ ├── 8_BRAKER_annotation.sh
│ ├── 8_genemark.sh
│ ├── 8_ref_annotation.sh
│ ├── 8_ref_eggnog.sh
│ ├── 8_test.sh
│ ├── 9_htseq_count.sh
│ ├── slurm_log
│ ├── soapdenovo_config.txt
│ └── soapdenovo.log
└── unziped_data
├── GCF_001595765.1_Mnat.v1_genomic.gff
├── NW_015504249.1.gff
├── quast_output
├── sel2_NW_015504334.fna
└── sel3_NW_015504249.fna
21 directories, 38 files
1.4 Workflow
