Command Line Parameters - vikshiv/mumemto GitHub Wiki

Defining match parameters

Mumemto is capable of computing a wide variety of match types, using three primary flags:

-k determines the minimum number of sequences a match must occur in (e.g. for finding MUMs across smaller subsets)
-f controls the maximum number of occurences in each sequence (e.g. finding duplication regions)
-F controls the total number of occurences in the collection (e.g. filtering out matches that occur frequently due to low complexity)

The following table summarizes the types of exact matches and how to compute them using these parameters:

The most common task is finding multi-MUMs. This is the default set of parameters, so the following two commands produce identical results:

# default
mumemto /path/to/inputs/*.fa -o output
# with explicit parameters
mumemto /path/to/inputs/*.fa -o output -f 1 -F 0 -k 0

Common scenarios

Here are some example scenarios for match types, and the corresponding parameters:

# Find all strict multi-MUMs across a collection
mumemto [OPTIONS] [input_fasta [...]] (equivalently -k 0 -f 1 -F 0)

# Find partial multi-MUMs in all sequences but one
mumemto -k -1 [OPTIONS] [input_fasta [...]]

# Find multi-MEMs that appear at most 3 times in each sequence
mumemto -f 3 [OPTIONS] [input_fasta [...]]

# Find all MEMs that appear at most 100 times within a collection
mumemto -f 0 -k 2 -F 100 [OPTIONS] [input_fasta [...]]

Misc flag details

-k and -F can be set relative to the number of sequences. Passing a negative value indicates that the flag is relative to |N|, the number of sequences. For example, -k -1 means partial multi-MUMs that appear in all sequences but one.

If -F and -f are both set, then the per doc max occurence limit overrides the total occurence limit. For example, if the parameters -f 2 -F 30 is passed for a dataset with |N| = 10 sequences, then only 2 occurences may occur in each sequence, for a total of 20 occurences. This is a stricter threshold, and thus overrides the -F parameter.

Other flags

-l defines a minimum match length. By default, this is 20bp.
By default, matches are found across both strands. -r sets the reverse complement to False, finding only forward strand matches.
-A will write the full enhanced suffix array to disk. This includes the Burrows-Wheeler Transform (BWT), Suffix Array (SA), and Longest Common Prefix (LCP) arrays. The SA and LCP array are written with 5 bytes per input character. This allows "rescans", finding different match types without recomputing these arrays. They are available under the same output prefix.
- To "rescan" an existing set of arrays (with example prefix), run:
```
mumemto -i prefix.lengths -a prefix [match options] 
```
-w and -m control the prefix free parse parameters and can alter the compressibility of the algorithm memory footprint.
To save intermediate files, use -K. To re-use intermediate files, pass the same input files (or pass the *.lengths file as a filelist -i), and use -p.