Archive1 - ASBioinfo/Utils-hub GitHub Wiki
1. Biser
- Before proceedings run the hardmask the sequence using trffinder
biser -T TEMP -t 40 -o brahman.bed --keep-contigs --no-decomposition --max-error=20 --max-edit-error=10 ./Brahman_trf_RM_bed.fasta
- Giving output at .bedpe format
2. Survivour
SURVIVOR merge list.txt 1000 1 1 0 0 50 final.vcf
Here,
-
1000--minimum length to be merged
-
1(field after 1000)--support by one caller
-
50--length greater than 50
Other fields can be ignored,or go for strict filter as per requirement
- SV type other than NA and extract file in matrix format:
awk '!/SVTYPE=NA/{print $0}' final.vcf|perl -ne 'print "$1\n" if /SUPP_VEC=([^,;]+)/' -|sed -e 's/\(.\)/\1 /g' >sample_merged_overlapp.txt
Input command in R (after loading the dataset)
venn.diagram(list(Gir = which(a[, 1] == 1), Kankrej = which(a[, 2] == 1),Tharparkar = which(a[, 3] == 1), Sahiwal = which(a[, 4] == 1), Red_sindhi = which(a[, 5] == 1)),fill = c ("gray", "orange", "blue", "red", "green"), alpha =c(0.5, 0.5, 0.5, 0.5, 0.5), cex = 1, lty =1, filename ="my_sample_overlapp.tiff");
3. SyRI
- Ragtag for assembly correction/mismatch of scaffold
ragtag.py scaffold ../GCF_003369695.1_UOA_Brahman_1_genomic.fna ../LGP01_2_assembly_sequence.fasta
- BUSCO analysis to check completeness of assemblies(same for all assemblies) *Here lineages database downloaded and path was given
busco -i LGP01_2_ragtag_scaffold.fasta -l /home/sarwar/abhisek/busco/busco_downloads/lineages/mammalia_odb10/ -o out_busco --out_path /home/sarwar/abhisek/ragtag/busco_out/out_busco_1 -m geno -c 30
or
busco -i LGP02_ragtag.fasta -l /home/niab/app/busco/downloads/mammalia_odb10/ -o LGP02_out_busco -m genome -c 80 --offline
- BUSCO plot generation
- Place all BUSCO output file i.e(short_summary.specific.mammalia_odb10.LGP02_Kankrej.txt) in a single directory of all assembly files and run below command
python /home/sarwar/app/busco/scripts/generate_plot.py -wd ./
- RepeatMasker used for masking
RepeatMasker -pa 38 -q -species cow LGP01_2_ragtag_chr.fasta
- nucmer for alignment (same for all assemblies)
nucmer --maxmatch -c 500 -b 500 -l 100 -t 60 /home/niab/abhisek/syri/ref/reference_chr.fna.masked /home/niab/abhisek/syri/LGP01_2/LGP01_2_ragtag_chr.fasta.masked
- mummer for delta filtering(same for all assemblies)
delta-filter -m -i 90 -l 100 out.delta > out_m_i90_l100.delta
- mummer for getting coordinates (same for all assemblies)
show-coords -THrd out_m_i90_l100.delta > out_m_i90_l100.coords
- Running syri(Choose ref & query assemblies in correct order for plot generation)
syri -c out_m_i90_l100.coords -r /home/niab/abhisek/syri/ref/reference_chr.fna.masked -q /home/niab/abhisek/syri/LGP01_2/LGP01_2_ragtag_chr.fasta.masked -d out_m_i90_l100.delta --nc 60 --prefix ref_LGP01_2 --all
- Running plotsr for syri plot generation (For all assemblies in a single plot)
link all assemblies fasta file in a single directories
put all syri.out files in the same folder
Make a genomes.txt file with a specific format that provided in website
then run below command
plotsr --sr ref_LGP01_2syri.out --sr LGP01_2_LGP02syri.out --sr LGP02_LGP03syri.out --sr LGP03_LGP04syri.out --sr LGP04_LGP05syri.out --genomes genomes.txt -H 40 -W 50 -o plot.png
4. popins
- Run sample wise
/home/niab/app/PopIns2/popins2 assemble -t 12 --sample L5_P5843_LGP4 /home/niab/naveen/cattle_analysis/popins_with_bam/2.Alignment_log/L5_P5843_LGP4_trim/L5_P5843_LGP4.bam
/home/niab/app/PopIns2/popins2 contigmap -t 12 11_S13
/home/niab/app/PopIns2/popins2 place-refalign
/home/niab/app/PopIns2/popins2 place-splitalign 11_S13
/home/niab/app/PopIns2/popins2 place-finish
/home/niab/app/PopIns2/popins2 genotype 11_S13
5. Augustus
augustus --species=human --singlestrand=true --genemodel=partial nui_final_set.fa >nui_final_set_partial
augustus --gff3=on --singlestrand=true --genemodel=complete cd_hit_out_contig_name_changed.fa --species=human > cd_hit_out_contig_name_changed.gff (For GFF)
6. cdhit
cd-hit-est -i bovidae_protein.faa -o bovidae_protein_CD_hit.fa -T 50 -aS 0.9 -g 1 -c 0.9 -M 0 (Nucleotide/DNA)
cd-hit -i bovidae_protein.faa -o bovidae_protein_CD_hit.fa -T 50 -aS 0.9 -g 1 -c 0.9 -M 0 (protein)
-- aS depends accordingley
check quality of pacbio hifi reads
python /home/niab/app/LongQC/longQC.py sampleqc -x pb-rs2 -o ./Chilika-2_output/ Chilika-2_HC9784.hifi_reads.bam
MAFFT alignment and phastcons
mafft --thread -7 --maxiterate 1000 --globalpair NC_040104.1_30600001_30606000.fasta >NC_040104.1_30600001_30606000
# Convert aligned FASTA to SS
msa_view --in-format FASTA NC_040105.1_99948001_99953000_mafft_output_t1 --out-format SS >NC_040105.1_99948001_99953000_mafft_output_t1.ss
# Estimate neutral model
phyloFit --msa-format SS --out-root neutral_model aligned.ss
# Calculate conservation
phastCons --target-coverage 0.3 --expected-length 12 --rho 0.31 aligned.ss neutral_model.mod > phastcons.bed
# Base-wise scores (conservation/acceleration)
phyloP --method LRT --mode CON neutral_model.mod aligned.ss > phylop.bed
VEP installation & annotation
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep/
perl INSTALL.pl
git clone --recursive https://github.com/konradjk/loftee.git
cd loftee/
perl INSTALL.pl --AUTO cf --SPECIES homo_sapiens --CACHEDIR /home/livestock/.vep --MERGED
perl INSTALL.pl --AUTO cf --SPECIES homo_sapiens --CACHEDIR /home/livestock/.vep --ASSEMBLY GRCh38
/home/livestock/app/ensembl-vep/vep -i SCB-4_S7.dbsnp_annotated.vcf.gz --cache --plugin LoF,loftee_path:/home/livestock/app/loftee --dir_plugins /home/livestock/app/loftee -o check.txt
/home/livestock/app/ensembl-vep/vep -i final_tribal_touse.vcf.gz --cache --plugin LoF,loftee_path:/home/livestock/app/loftee,human_ancestor_fa:homo_sapiens_ancestor_GRCh38.fa.gz --dir_plugins /home/livestock/app/loftee --vcf -o dj.vcf (For vcf file format)