Input and Output files - vikshiv/mumemto GitHub Wiki
The most common input to mumemto
are genome assemblies, typically comprising a pangenome. These should be in FASTA format, and are passed as positional arguments to mumemto:
mumemto /path/to/fastas/*.fa
Alternatively, a file containing a list of paths to input FASTAs (one per line) can be supplied with -i
mumemto -i filelist.txt
We recommend each FASTA file contain a single sequence. If multiple sequences are present, they will be concatenated together. If there are multiple chromosomes in each assembly, we recommend splitting each chromosome into a seperate FASTA and running mumemto on each chromosome separately.
Note
We highly recommend removing Ns from the input sequences. They could potentially appear as multi-MUMs in certain cases. This would likely not affect results, however it may appear in visualizations as an unintended synteny block.
The main output of mumemto
is the *.mums
(or *.mems
) file. A *.lengths
is also produced, defining the order of sequences in the outputs, and also including the length of each input sequence.
If the maximum number of occurences per sequence (-f
) is set to 1 (indicating MUMs), a *.mums
file is generated [default].
[MUM length] [comma-delimited list of offsets in each sequence, in order of filelist] [comma-delimited strand indicators (one of +/-)]
Each line in the *.mums
file represents a multi-MUM. It appears exactly once in each sequence (or not at all for partial MUMs,-k
set). The offsets and strand information are listed in order of the sequences in the *.lengths
file. If a MUM is not present in a sequence, the field is left blank. NOTE: multi-MUMs are sorted in the output file lexicographically based on the match sequence.
A *.bumbl
file is a binary version of a *.mum
file. It contains all the same information, but is generally smaller, and significantly faster to work with. For instance, for large datasets with millions of multi-MUMs, this can be orders of magnitude faster to load into the visualization module.
It can be converted to a human-readable format using mumemto convert -b <prefix>.bumbl > out.mums
, or piped into less for quick inspection. This is similar to converting between sam
and bam
files.
Note
Using the convert
module will result in multi-MUMs sorted by position in the first sequence. This is often useful for collinear blocking as well, and is the default.
If more than one occurence of match is allowed per sequence (-f
> 1), then a *.mems
file is generated. It has a similar format:
[MEM length] [comma-delimited list of offsets for each occurence] [comma-delimited list of sequence IDs, as defined in the filelist] [comma-delimited strand indicators (one of +/-)]
The order of offsets is no longer defined, but an extra list field indicates the input sequence ID of origin for each offset (again, index order defined by the *.lengths
file). Similarly, the multi-MEMs are ordered lexicographically.