TAMA GO: Sequence Cleanup - GenomeRIK/tama GitHub Wiki
This set of tools in TAMA-GO is used to clean up sequences. Right now there is only one tool but it will be expanded later.
tama_flnc_polya_cleanup.py
To remove poly-A tail sequences from the FLNC fasta files use tama_read_support_levels.py. This tool is used to remove the poly-A tails left in the FLNC fasta files after running IsoSeq3 Refine without the "--require-polya" parameter. If you have Iso-Seq data generated from cDNA libraries prepared with the Teloprime kit, you should not use the "--require-polya" parameter. Using the "--require-polya" parameter will remove many reads due to an issue with the Teloprime 3' primer sequence and the way LIMA works. Instead you should run default Refine and then clean up the remaining Poly-A tails using this tool.
See twitter thread for more info: https://twitter.com/GenomeRIK/status/1179788262187110401
Instructions for Teloprime Iso-Seq data: Primer sequences (ie primers.fasta)(may need to change header depending on software version):
>primer_5p TGGATTGATATGTAATACGACTCACTATAG >primer_3p AAAAAAAAAAAAAAAAAACGCCTGAGA
Run LIMA depending on the version you are using (for IsoSeq3 3.2):
lima --isoseq --dump-clips --no-pbi --peek-guess -j 24 ccs.bam primers.fasta demux.bam
Run refine without the "--require-polya" argument (for IsoSeq3 3.2):
isoseq3 refine output.5p--3p.bam primers.fasta flnc.bam
Convert flnc.bam file into a fasta file:
bamtools convert -format fasta -in flnc.bam > flnc.fa
Run tama_flnc_polya_cleanup.py to remove remaining 3' poly-A tails:
python tama_flnc_polya_cleanup.py -f flnc.fa -p prefix
The resulting fasta file is now ready for genome mapping.
In order to convert the FLNC BAM file into a fasta file you can use this command: bamtools convert -format fasta -in bam_file > fasta_file
Note: This is not a part of TAMA. This is bamtools.
usage: tama_flnc_polya_cleanup.py [-h] [-f] [-p]
optional arguments:
-h, --help show this help message and exit -f F FLNC fasta file -p P Prefix for output file -m M Minimum read length to keep (default is 200)
Default command would look like this:
python tama_flnc_polya_cleanup.py -f flnc.fa -p prefix
Detailed explanation of arguments:
-f F
The FLNC fasta file is the output from running IsoSeq3 Refine and then the BAM to Fasta conversion.
-p P
This is the prefix used for the file naming of all the output files.
-m M
This is the minimum read length to keep after poly-A trimming. Default is 200bp.
Outputs:
prefix.fa prefix_polya_flnc_report.txt prefix_discarded_reads.txt prefix_summary.txt
Detailed explanation:
prefix.fa
This is the cleaned up FLNC fasta file.
prefix_polya_flnc_report.txt
This is a report file showing a table of the number of sequences with different counts of poly-A's.
polya_num polya_num_count 0 40676 1 46986 2 63718
prefix_discarded_reads.txt
This is a report file showing a fasta of the reads that were discarded and also giving a reason why.
prefix_summary.txt
This is a report file showing a summary of some of the characteristics of the reads.