extract reads (new) - smangul1/UMI-Reducer-new GitHub Wiki

The tool extractTag.py performs several tasks. First, it splits aggregated reads according to barcode into separate samples. In this step, the barcode and unique molecular identifier (UMI) sequence are assigned to the first line of the fastq format as follows:

>UMI_barcode
tag sequence
-
quality score

Second, this tool removes the L32 linker sequence on the 3' end of the read. Reads that do not contain the full L32 sequence are discarded. Reads in which the L32 sequence is of poor quality are discarded. Reads that contain full length L32 sequence of intermediate quality are kept if all 20nt are a perfect match to the L32 sequence. Reads that contain full length L32 sequence of high quality are kept and we allow a one-base mismatch to the L32 sequence. The purpose of requiring full length L32 sequence is to filter out background reads that did not derive from a linker ligated product. In tests, at least 80% of raw reads pass these filters. the usage for this tool is as follows: