Cool codes - RichardShao17/Cool-Codes GitHub Wiki

Using Prokka

for k in aquimonas.FASTA blastomonas.FASTA
do prokka $k --outdir "$k".prokka.output
echo $k
done

This is an example for 2 fasta files, for more than 2 simply write the files afterwards (ex. blastamonas.FASTA ecoli.FASTA...; do prokka)

Checking for indels (Thanks Mick Watson)

conda activate snakemake
snakemake --cores 4

Make sure all dependencies (snakemake, prodigal, diamond) are downloaded in the snake-make environment

Making circular genomic plots with arrows linking common genes

conda install -c bioconda minimap2
cat *.fasta > all_genomes.fasta

This step combines many fasta files into one. Just make sure the headers for every genome are unique.

minimap2 -x ava-ont all_genomes.fasta all_genomes.fasta > all_vs_all_mapped.paf

This step allows you to compare every genome with every other genome to look for similarities.

cat all_genomes.fa | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

This one liner tells you the length of each fasta file in a concatenated file. Useful for the next step where you need to write in the end point in the data frame. The next steps are in R.

bed1<-all_vs_all_mapped[c(1,3,4,11)]
bed2<-all_vs_all_mapped[c(6,8,9,11)]

df = data.frame(name = c("contig_125", "contig_24", "contig_24$", "contig_255","contig_32", "contig_348","contig_371","contig_458", "contig_464","contig_564","contig_690","contig_788","contig_940","contig_98"),
                start = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0),
                end = c(10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000),
                stringsAsFactors = FALSE)
circos.genomicInitialize(df)
circos.genomicLink(bed1, bed2, col = rand_color(nrow(bed1)), border = NA)

The end points for each genome (10000000) is a placeholder currently, it should be exactly the length of the respective sequence, which you can find using the one-liner above.

Picking out the annotation of genes which are syntenic

df = NULL
for (i in 1:dim(d.sub)[1]){
  for (j in 1:dim(all_genomes_clean)[1]){
    if (!is.na(all_genomes_clean$V4[j]) && ((d.sub$V3[i] < all_genomes_clean$V4[j] && all_genomes_clean$V4[j] < d.sub$V4[i]) ||
          (d.sub$V3[i] < all_genomes_clean$V5[j] && all_genomes_clean$V5[j] < d.sub$V4[i])) && d.sub$V1[i] == all_genomes_clean$V1[j]){
      df = rbind(df, all_genomes_clean[j,])
    }
  }
}

``
If the start OR end point in all_genomes (PROKKA annotated genes) are contained in the start and end sites of d.sub (synteny output) AND it is the same genome name, only then do we want to keep it. d.sub$V3 and d.sub$4 are respectively the start and end. Same with all_genomes$V4 and all_genomes$V4.