rdmcl - biologyguy/RD-MCL GitHub Wiki

Main program in the suite. Got orthogroups?

Before running RD-MCL for the first time, please execute the setup script as follows:

$: rdmcl -setup

This will ensure that certain dependencies are properly installed on your system.

Generalized usage

$: rdmcl input_file <args>

input_file: A sequence file in any of the supported formats.

args: All flagged arguments are explained in detail below.

Arguments

-o, --outdir ( path )

Specify where results should be written to. By default, a folder will be made in the current working directory with the name rdmcl-dd-mm-yyyy.

NOTE: Folder contents will be overwritten!

-wdb, --workdb ( path )

Specify the directory where worker nodes are monitoring a queue. If any workers are detected, RD-MCL will pass off its 'hard' work to the queue.

$: rdmcl input_seqs.fa -wdb "/home/worker_dir/"

-rs, --r_seed ( int )

RD-MCL is a heuristic approach, which means the results may change slightly from run-to-run. For reproducibility, you can specify the random seed to get the same results if starting from the same data and parameters.

$: rdmcl input_seqs.fa -rs 12321

-sql, --sqlite_db ( path )

RD-MCL stores multiple sequence alignments and all-by-all similarity graphs in a SQLite database. If you want to reuse a database, or keep all of the data from multiple runs in the same place, specify with this flag.

$: rdmcl input_seqs.fa -sql "/home/rdmcl.sqlite"

-psi, --psipred_dir ( path )

Secondary structure predictions are included in the similarity scores between sequences. These predictions are saved as .ss2 files in the psipred directory and can be reused by subsequent runs if you point to that folder.

$: rdmcl input_seqs.fa -psi "/home/psipred_files/"

-ts, --taxa_sep ( char )

All sequences must be prefixed if a species identifier and a separation character. By default, RD-MCL will look for the hyphen (-) character, but this can be overridden if necessary.

$: rdmcl input_seqs.fa -ts '~'

-ch, --chains ( int )

Specify how many MCMC chains are included in the run (default=3)

$: rdmcl input_seqs.fa -ch 4

-wlk, --walkers ( int )

Specify how many Metropolis-Hastings walkers are in each chain (default=3)

$: rdmcl input_seqs.fa -wlk 4

-cpu, --max_cpus ( int )

Specify the maximum number of cores RD-MCL can use (default=21). Note that this does not work perfectly, and RDMCL may use more cores than this value is set to.

$: rdmcl input_seqs.fa -cpu 16

-cnv, --converge ( float )

Set minimum Gelman-Rubin potential scale reduction factor value for convergence. The closer to 1.0, the longer it will take to reach convergence (default=1.1).

$: rdmcl input_seqs.fa -cnv 1.05

-mcs, --mcmc_steps ( int )

Specify a maximum number of MCMC steps allowed (default=auto-detect, note that the minimum is 100). If not set, convergence is detected using the Gelman-Rubin method.

$: rdmcl input_seqs.fa -mcs 1000

-algn_m, --align_method ( str or path )

Specify a supported alignment algorithm to perform the multiple sequences alignments used by RD-MCL to create all-by-all similarity matrices. If the program is in your $PATH then supply its name, otherwise supply the full path to the executable (default=clustalo).

$: rdmcl input_seqs.fa -align_m mafft

$: rdmcl input_seqs.fa -align_m "/path/to/clustalomega"

-algn_p, --align_params ( str )

If you want to modify the parameters of the multiple sequence alignment, pass them in between double quotes ("--params etc"). Note that ClustalOmega is the default aligner, so make sure to combine this with align_m if switching to a different method.

$: rdmcl input_seqs.fa -align_p "--max-hmm-iterations=4 --full-iter"

-op, --open_penalty ( float )

Penalty for opening a gap in pairwise alignment scoring (default=-5)

$: rdmcl input_seqs.fa -op -3

-ep, --ext_penalty ( float )

Penalty to extend a gap in pairwise alignment scoring (default=0)

$: rdmcl input_seqs.fa -ep -0.2

-lwt, --lock_wait_time ( int )

Specify how long a process should wait on the SQLite database before crashing out (default=1200). This is largely a development argument and you should not need to change it. If you do, please leave a note in the GitHub issue tracker.

$: rdmcl input_seqs.fa -lwt 600

-r, --resume

RD-MCL dumps its MCMCMC chains as it progresses and can pick a run back up later in the event of a crash. If you want it to try and resume a broken run, pass in this flag (note that it overrides and breaks r_seed).

$: rdmcl input_seqs.fa --resume

-f, --force

RD-MCL will warn you if you try to do something it thinks might be wrong (e.g., passing in very large datasets). If you are certain you know what you are doing, you can suppress these warnings with --force.

$: rdmcl very_large_input_seqs.fa --force

-q, --quiet

This does not completely eliminate output currently, but it makes it less verbose.

$: rdmcl input_seqs.fa --quiet