Input and Output files - vikshiv/mumemto GitHub Wiki

Input files

The most common input to mumemto are genome assemblies, typically comprising a pangenome. These should be in FASTA format, and are passed as positional arguments to mumemto:

mumemto /path/to/fastas/*.fa

Alternatively, a file containing a list of paths to input FASTAs (one per line) can be supplied with -i

mumemto -i filelist.txt

We recommend each FASTA file contain a single sequence. If multiple sequences are present, they will be concatenated together. If there are multiple chromosomes in each assembly, we recommend splitting each chromosome into a seperate FASTA and running mumemto on each chromosome separately.

Note

We highly recommend removing Ns from the input sequences. They could potentially appear as multi-MUMs in certain cases. This would likely not affect results, however it may appear in visualizations as an unintended synteny block.

Output files

The main output of mumemto is the *.mums (or *.mems) file. A *.lengths is also produced, defining the order of sequences in the outputs, and also including the length of each input sequence.

*.mums file

If the maximum number of occurences per sequence (-f) is set to 1 (indicating MUMs), a *.mums file is generated [default].

[MUM length] [comma-delimited list of offsets in each sequence, in order of filelist] [comma-delimited strand indicators (one of +/-)]

Each line in the *.mums file represents a multi-MUM. It appears exactly once in each sequence (or not at all for partial MUMs,-k set). The offsets and strand information are listed in order of the sequences in the *.lengths file. If a MUM is not present in a sequence, the field is left blank. NOTE: multi-MUMs are sorted in the output file lexicographically based on the match sequence.

New in v1.2 *.bumbl file

A *.bumbl file is a binary version of a *.mum file. It contains all the same information, but is generally smaller, and significantly faster to work with. For instance, for large datasets with millions of multi-MUMs, this can be orders of magnitude faster to load into the visualization module.

It can be converted to a human-readable format using mumemto convert -b <prefix>.bumbl > out.mums, or piped into less for quick inspection. This is similar to converting between sam and bam files.

Note

Using the convert module will result in multi-MUMs sorted by position in the first sequence. This is often useful for collinear blocking as well, and is the default.

*.mems file

If more than one occurence of match is allowed per sequence (-f > 1), then a *.mems file is generated. It has a similar format:

[MEM length] [comma-delimited list of offsets for each occurence] [comma-delimited list of sequence IDs, as defined in the filelist] [comma-delimited strand indicators (one of +/-)]

The order of offsets is no longer defined, but an extra list field indicates the input sequence ID of origin for each offset (again, index order defined by the *.lengths file). Similarly, the multi-MEMs are ordered lexicographically.

⚠️ **GitHub.com Fallback** ⚠️