Cool codes - RichardShao17/Cool-Codes GitHub Wiki
Using Prokka
for k in aquimonas.FASTA blastomonas.FASTA
do prokka $k --outdir "$k".prokka.output
echo $k
done
This is an example for 2 fasta files, for more than 2 simply write the files afterwards (ex. blastamonas.FASTA ecoli.FASTA...; do prokka)
Checking for indels (Thanks Mick Watson)
conda activate snakemake
snakemake --cores 4
Make sure all dependencies (snakemake, prodigal, diamond) are downloaded in the snake-make environment
Making circular genomic plots with arrows linking common genes
conda install -c bioconda minimap2
cat *.fasta > all_genomes.fasta
This step combines many fasta files into one. Just make sure the headers for every genome are unique.
minimap2 -x ava-ont all_genomes.fasta all_genomes.fasta > all_vs_all_mapped.paf
This step allows you to compare every genome with every other genome to look for similarities.
cat all_genomes.fa | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
This one liner tells you the length of each fasta file in a concatenated file. Useful for the next step where you need to write in the end point in the data frame. The next steps are in R.
bed1<-all_vs_all_mapped[c(1,3,4,11)]
bed2<-all_vs_all_mapped[c(6,8,9,11)]
df = data.frame(name = c("contig_125", "contig_24", "contig_24$", "contig_255","contig_32", "contig_348","contig_371","contig_458", "contig_464","contig_564","contig_690","contig_788","contig_940","contig_98"),
start = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0),
end = c(10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000,10000000),
stringsAsFactors = FALSE)
circos.genomicInitialize(df)
circos.genomicLink(bed1, bed2, col = rand_color(nrow(bed1)), border = NA)
The end points for each genome (10000000) is a placeholder currently, it should be exactly the length of the respective sequence, which you can find using the one-liner above.
Picking out the annotation of genes which are syntenic
df = NULL
for (i in 1:dim(d.sub)[1]){
for (j in 1:dim(all_genomes_clean)[1]){
if (!is.na(all_genomes_clean$V4[j]) && ((d.sub$V3[i] < all_genomes_clean$V4[j] && all_genomes_clean$V4[j] < d.sub$V4[i]) ||
(d.sub$V3[i] < all_genomes_clean$V5[j] && all_genomes_clean$V5[j] < d.sub$V4[i])) && d.sub$V1[i] == all_genomes_clean$V1[j]){
df = rbind(df, all_genomes_clean[j,])
}
}
}
``
If the start OR end point in all_genomes (PROKKA annotated genes) are contained in the start and end sites of d.sub (synteny output) AND it is the same genome name, only then do we want to keep it. d.sub$V3 and d.sub$4 are respectively the start and end. Same with all_genomes$V4 and all_genomes$V4.