Partanalyzer Help page - MASantos/Partanalyzer GitHub Wiki

Full help (as of version alpha 1.0.) partanalyzer (Partition Analyzer)

Usage: partanalyzer [-h|--help] (Use --help for more details) partanalyzer --version partanalyzer [OPTIONS] COMMAND ARGS

OPTIONS
        --debug
        --verbose
        -q , --quiet
        -z, --pid-normalization [s|p|r|l]
        -t , --format partition_format
        --tab tab_file
        --DIST_SUBSPROJECT
        --beta beta_value
        --mu mu_value
COMMANDS
Defining the algebra of partitions
    (-i|-u) partition1  partition2  [partition1_offset (=2) ] [partition2_offset (=partition1_offset) ]
    (-I|-U) [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
For analyzing partitions
    (-v|-e|-p) partition1  partition2  [partition1_offset (=2) ] [partition2_offset (=partition1_offset) ]
    -c matrix-of-values partition1 [threshold (=-1.0)] [partition_offset (=2)]
    -d matrix-of-values partition1 [partition_offset (=2)]
    (-Q|-R|-T) [-ext extensivity] [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
    (-V|-E|-P) [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
    --pstat-sym [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
    (--ipot|--cpot|--jpot|--v-measure-h) entropy [-ext extensivity] [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
    (--mpot|--cmpot|--SSSA) entropy [-ext extensivity] [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
    (-C|-H) [-cons] [-ofs partition_offset (=2)] [-f partition_list | [partition1 [ partition2 [ ... ]]]
    (-A|-S|--Info) [-ofs partition_offset (=2)] [-f partition_list | [partition1 [ partition2 [ ... ]]]
For creating partitions ( Clustering )
    --cluster graph [-below|-above] [ threshold ]
    --cluster-robust graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity]
    --cluster-robust-self-consistently graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity]
For editing partitions
    --part-extract-elements elements_file [-tab mcl_tab_file] partition [partition1_offset (=2) ]
    --part-sort partition
    --part-sort-rename partition [prefix]
    --part-swap-names partition (requires use of --tab)
For converting between different partition formats
    --toMCL [-tab mcl_tab_file] partition [partition1_offset (=2) ]
    --toFREE partition [partition1_offset (=2) ]
    --MCLtoPART [-tab mcl_tab_file] partition [partition1_offset (=2) ]
    --MSAtoPART msa_file
For dealing with (fasta) sequence files
    --seq-noclone-sequences fasta_sequence_file [reference_sequence_file]
For analyzing Multiple Sequence Alignments
    --msa-seqid-stat [--positions positions_file] multiple_seq_alignment.fasta [multiple_seq_alignment.fasta2]
    --msa-seqid-avg [-thr threshold=50] multiple_seq_alignment.fasta [multiple_seq_alignment.fasta2]
    --msa-extract-positions positions_file multiple_seq_alignment.fasta
    --msa-extract-sequences sequences_file multiple_seq_alignment.fasta
    --msa-drop-sequences sequences_file multiple_seq_alignment.fasta
    --msa-extract-sequences-by-id  msa_file1 msa_file2 [minId maxId]
    --msa-drop-sequences-by-id sequences_file msa_file [minId maxId]
    --msa-extract-sequences-by-topid msa_file1 msa_file2 [count]
    --msa-drop-sequences-by-topid msa_file1 msa_file [count]
    --msa-map-partition partition multiple_seq_alignment.fasta [MSAformat]
    --msa-print [-sort|-nosort] multiple_seq_alignment.fasta
    --msa-redundant [-nsam nsam] [-nseq nseq] [-seed seed] multiple_seq_alignment.fasta
For dealing with -interaction- matrices (aka, undirected graph)
    --edge-dist matrix-of-values [partition [partition_offset (=2)] ]
    -m matrix-of-values1 matrix-of-values2
    -r matrix-of-values1 matrix-of-values2 partition [partition_offset (=2)]
    -l matrix-of-values1 matrix-of-values2
    --prune-edges-above float graphfile
    --prune-edges-below float graphfile
    --print-matrix matrix-of-values
    --graph-nodes matrix-of-values

partanalyzer aims at being a general program for analyzing (sets of) partitions. Here a partition is defined as in set theory of mathematics (see http://en.wikipedia.org/wiki/Partition_of_a_set). It also allows to edit (rudimentarily), as well as generate, partitions.

Whenever many input files are expected, one can either list them as
command line arguments, or list them in a file and use option -f to
specify that file.
For calculating distances between partitions with different number of
elements, use option --DIST_SUBSPROJECT right before any *stat command.
Works only with a *stat distance command, i.e., not  purity scores.
OPTIONS:
      -z , --pid-normalization norm
         Determines the normalization used for calculating percent
         sequence identities. The possible string values for norm are:
                s , shorter-sequence
                p , aligned-positions
                r , aligned-residues
                l , average-length
         Default normalization is the average sequence length, l.
 COMMANDS:
Defining algebra of partitions
-i , --intersection , -m , --meet
   Calculate the intersection of  partition1 & partition2
-u , --union , -j , --join
   Calculate the union of  partition1 & partition2. This can be
   seen as the algebraic optimal consensus partition covering
   partition1 and partition2. Optimal means the most refined
   partition that covers both.
-I , --Intersection , -M , --Meet
   Calculate the intersection of  all partitions provided
-U , --Union , -J , --Join
   Calculate the union of all partitions provided. This can be
   seen as the algebraic optimal consensus partition covering
   each and every input partition. Here, optimal means the most
   refined partition that covers any of the input partitions.
   Remark: This algebraic consensus is very sensitive to outlier
   partitions.

For analyzing partitions

-v , --vi-distance
   Calculate VI distances between partition1 & partition2
-e , --edit-distance
   Calculate the edit score distance between partition1 and
        partition2
-p , -purity-scores
   Calculates the purity scores of partition2 (the target)
        againts the partition1 (the reference).
-c , --check-consistency-of-partition , --ccop [-tab tab_file]
   Check cluster consistency according to the given matrix. If
   partition and graph matrix label items differently, use the
   option -tab to provide a tab file specifying the conversion.
   (See below for syntaxis of the matrix and tab file)
-d , --intra-inter-edge-dist
   Calculate intra and inter cluster distribution of weights
   according to the given matrix
-Q , --qstat [-ext extensivity] [-ref]
   Calculates Tarantola distance  for each pair of partitions.
   For that it uses the Jeffrey's Qnorm based on Shannon Entropy.
   With option -ref, the first partition is taken as a reference
   and it calculates the distances of all againts that one.
   Default extensivity coefficient is 2.
-R , --rstat [-ext extensivity] [-ref]
   Calculates Renyi distances for each pair of partitions.
   With option -ref, the first partition is taken as a reference
   and it calculates the distances of all againts that one.
   Default extensivity coefficient is 2.
-T , --tstat [-ext extensivity] [-ref]
   Calculates Tsallis distances for each pair of partitions.
   With option -ref, the first partition is taken as a reference
   and it calculates the distances of all againts that one.
   Default extensivity coefficient is 2.
-B , --bstat [-ref]
   Calculates the Boltzmann distance for each pair of partitions.
   With option -ref, the first partition is taken as a reference
   and it calculates the distances of all againts that one.
-V , --vstat [-ref]
   Calculates the VI distance for each pair of partitions.
   With option -ref, the first partition is taken as a reference
   and it calculates the distances of all againts that one.
-E , --estat [-ref]
   Calculates the Edit Score distance for each pair of partitions
   With option -ref, the first partition is taken as a reference
   and it calculates the distances of all againts that one.
-P , --pstat [-ref | -target]
   Calculates the purity scores (strict and lax) for each pair
   of partitions. With option -ref, it calculates the purity
   scores of all againts the first one, which is taken as a
   reference. With option -target, the first one is considered
   the target and it calculates the scores of that one against
   all others taken as reference.
--pstat-sym , --pstat-symmetric
   Calculates arithmetic averages of purity stric and purity lax
   scores for each pair of partitions.
-n , --ipot  entropy [(-e|-ext) extensivity] [-ref]
   Calculates (information theoretic) potential (entropy) of each
   partition. The possible values for entropy are (short|long):
   v | s | vonneumann | shannon
   b     | boltzmann
   c | e | cardinality
   r     | renyi
   t     | tsallis
   q     | tarantola/jeffrey/tjqn
   Both, long and short option names are valid.
   Default extensivity coefficient is 2.
   For cardinality potential, this coefficient will be used as a
   gauge determining the card(1)=1+extensivity.
--cpot, --conditional-potential entropy [-ext extensivity] [-ref]
   Calculates conditional entropy for each pair of partitions.
   The possible values for entropy and extensivity are the same
   as for option --ipot.
--jpot, --joint-potential entropy [-ext extensivity] [-ref]
   Calculates joint entropy for each pair of partitions.
   The possible values for entropy and extensivity are the same
   as for option --ipot.
--mpot, --mutual-potential entropy [-ext extensivity] [-ref]
--SA, --subadditivity entropy [-ext extensivity]
   Calculates the mutual potential (mutual information)
   for each pair of partitions. If positive, subadditivity holds.
   The possible values for entropy and extensivity are the same
   as for option --ipot.
--cmpot, --conditional-mutual-potential entropy [-ext extensivity]
--SSA, --strong-subadditivity entropy [-ext extensivity] [-ref]
   Calculates the conditional mutual potential (conditional
   mutual information) for each pair of partitions. If positive,
   for all three partitions, then strong subadditivity holds.
   The possible values for entropy and extensivity are the same
   as for option --ipot.
--SSSA, --soft-strong-subadditivity entropy [-ext extensivity] [-ref]
   Calculates a softer version of the strong subadditivity
   condition for all three partitions. If positive, then the
   potential acts as a norm and defines a metric, which thus
   satisfies the triangular inequality.
   The possible values for entropy and extensivity are the same
   as for option --ipot.
--v-measure-h , --v-measure-harmonic entropy [-ext extensivity] [-ref]
   Calculates the Vmeasure between each pair of partitions. This
   measure is as that defined by Roseberg, A. and Hirschberg, J.
   in http://acl.ldc.upenn.edu/D/D07/D07-1043.pdf. Use global
   option --beta for specifying relative weight of homogeneity
   versus completeness. Default is equal weight, i.e., beta=1.
   and thus the average between both is strictly an harmonic one.
   The possible values for entropy and extensivity are the same
   as for option --ipot.
--v-measure-a , --v-measure-arithmetic entropy [-ext extensivity] [-ref]
   Analogous to --v-measure-h but using arithmetic mean between
   homogeneity and completeness.
--v-measure-g , --v-measure-geometric entropy [-ext extensivity] [-ref]
   Analogous to --v-measure-h but using geometric mean between
   homogeneity and completeness.
-C , --cluster-stat [-ofs ofs] [-norm gaug] [-cons|-consensus]
   For each item, determines the most frequent cluster where
   it appears among all the clusters of all the given partitions.
   It also prints its size and observed frequency (both, raw
   count and %).
   Option -ofs,see below, allows to specify a partition offset.
   Option -norm gaug gauges the normalization used for
          determining the %frequencies. By default these are
          calculated by counting how many times the mode cluster
          is found at each of the different partitions and then
          dividing by the number of partitions N. With this
          option, that count gets divided by N+gaug, where gaug
          can be negative or positive.
   Option -cons or -consensus will print the consensus partition
    -A , --adjacency-stat , {--adjstat}
       Determines the average adjacency matrix from the provided
       partitions. The adjacency matrix of a partition is the graph
where edges (0 or 1 ) represent two elements belonging to the
same subfamily. The average adjacency matrix has edges with
continous values [0,1]. The output consists in a matrix of
values and a gray-scale image of it in PGM format.
-S , --split-merge-analysis , {--splitstat}
  (Split-Merge plot)
   Determines the overlap of each cluster to those of the
   reference partition (the first). Possible values are for
   the overlap are:
   -over fraction elements in common relative to the target cluster.
   -cos  cosine normalized similarity
     It outputs:
     -Confusion matrix (in % of the target clusters) taking
       the first partition as reference and the second as target.
     -number of overlaps for each target cluster
     -Split-Merge image showing the CT. In addition it show two
reference color bars: a bottom color bar representing the
       perfect split transformations (black), the merge-only
       (white) ones and those cases in between (different grey
       levels); a right-most column shows whether these are perfect
       matches (black) or not (white).
--Info , {--isPart , --isaPart , --is-partition} [-ofs ofs]
    For each partition checks whether it is a sound partition
    or not, i.e., whether all of its clusters are pair-wise
    disjoint. With option -q, only error message will be printed
    in case partition is not sound, otherwise it'll keep silent.
-H, --hasse-diagram
    prints the local Hasse Diagram (graph) spanned by the
    given partitions.

For creating partitions ( clustering ) --cluster graph [ [-below|-above] treshold ] Defines clusters from the transitivity relation given by the graph’s edges. If a treshold is provided, it prunes first the edges below the threshold. Example: partanalyzer --cluster gf -below 0.7 partanalyzer --cluster gf 0.7 both cases will first pruned the edges below 0.7 and the obtain the clusters generated that way. For pruning above we must use the second explicit form partanalyzer --cluster gf -above 0.7

--cluster-robust graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity]
    Gives the most robut clustering with respect to edge pruning.
    This is defined as the partition showing the smallest average
    variability _after_ the phase transition. The average varia-
    bility is calculated as the average distance against those
    partitions at its #neighbors nearest pruning thresholds
    (#neighbors above; #neighbors below).
    It repeatedly clusters the graph starting with a pruning
    threshold equal to the lowest edge and increasing it by a fixed
    amount until reaching the highest edge value. The total
    number of samples determine each step increase of threshold.
    We may be pruning the edges above the threshold (as if the
    later were a temperature T) or below the threshold (1/T).
    Defaults: #samples=10 ; Pruning=below ; Metric=shannon (-V)
    #neighbors=2.
--RDC
--cluster-robust-self-consistently graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity]
    As --cluster-robust, but it determines self-consitently the
    largest possible number of samples. The latter is defined as
    the largest for which each pruning interval removes at least
    one edge. The method used is bisectioning and the provided
    #samples is used as the seed for the search. All defaults as
    for --cluster-robust.

For editing partitions --part-extract-elements {--extract-elements} elements_file elements_file lists the names of the elements to cull from the given partition

--part-sort
    Sorts the clusters by size, the larger on top. Ties are
    sorted alphabetically by their first item. Within each
    cluster, items are sorted alphabetically.
--part-sort-rename partition [prefix]
    As --part-sort, but also rename each clusters consecutively
    as C1, C2,etc. If a prefix string is supplied use that
    instead of C.
--part-swap-names partition
--part-swap-labels partition
    Swaps elements' names present in partition by their new
    names as found in the provided tab file. An element's name in
    the partition will be changed iif there is a translation for
    it found in the tab file; otherwise it will be left as it is.
    Thus, it is not mandatory to provide a translation for all
    elements. Requires the use of --tab to specify a tab file
    providing the mapping between new and old names. See general
    options.

For converting between different partition formats

--toMCL [-tab mcl_tab_file] converts partition from PART format
    to MCL's format. If additional tab file is provided, output
    will contain the specific label index given in the tab file.
--toFREE converts partition from PART format to FREE format.
--MCLtoPART [-tab mcl_tabl_file]
    converts partition from MCL format to PART format.
    If additional tab file is provided, output will contain
    items' labels, instead of simply their MCL index number.
--MSAtoPART
    converts a MSA file in FASTA format containing the
    labeling of clusters into a partition in PART format. Cluster
    labels are expected in a separate line before the actual
    set of sequences, i.e.,
    %Group_A
    >Sequence_A1
    ...
    or
    ==Group_A
    >Sequence_A1
    ...
    Escape characters indicating cluster labels can be mixed in
    the same file, although it's not recommended.

For dealing with (fasta) sequence files --drop-clone-sequences --msa-noclone-sequences --seq-noclone-sequences sequence_file (fasta) Given a (fasta) sequence file or a fasta MSA file, remove all duplicate sequences. Here duplicate means literally that, namely, exactly the same string of characters. Therefore, it is not the same as having a pid=100%, but more stringent. If a second sequence file is provided, drop also sequences that are clones of any sequence in the second file.

For analyzing Multiple Sequence Alignments

--msa-seqid-stat
--msa-seqid-stat [--positions file]
   Given a multiple sequence alignment in fasta format, it
   prints all pair-wise sequence identities. By default, it
   calculates identities over the full sequence length. The
   second version allows to specify the (reduced) set of positions
   we want to consider in comparing sequences. These should be
   specified in a file, each separated by space,tabs, new lines,
   etc. The positions are understood as columns of the MSA.
   If two MSA are provided, it prints the sequence Id of the
   first set against the second.
--msa-seqid-avg [-thr threshold ]
   Similar as option --msa-seqid-avg, but prints for each sequence
   a statistics of its pair-wise sequence identity to all other
   sequences. This consists of average Seq.Id, standard
   deviation, variance, minimum Seq.Id, maximum Seq.Id, number
   of pairs with Seq.Id > threshold, fraction of pairs with Seq.
   Id. > threshold and total number of pairs.
   Option -thr allows to provide a specific threshold to use.
               default value is 50%. Values are floating numbers
               within [0,100].
   If two MSA are provided, it prints the sequence Id of the
   first set against the second.
--msa-extract-positions positions_file msa_file
   From the given MSA, extract only columns specified in file
   positions_file.
--msa-extract-sequences sequences_file msa_file
--msa-drop-sequences sequences_file msa_file
   From the given MSA, extract only sequences specified in file
   sequences_file. This file contains a list of sequences names
   The second form drops those sequences instead.
   If a positions file is given, sequence Id's are calculated
   considering only those columns of the MSA.
--msa-extract-sequences-by-id msa_file1 msa_file2 [minId maxId]
--msa-drop-sequences-by-id sequences_file msa_file [minId maxId]
   From MSA msa_file1, extract sequences with an ID above minId
   and at most maxId against any sequence of MSA msa_file2.
   The second form drops those sequences instead. Default values
   values are minId=30 and maxId=100, i.e., homologous sequences.
   If a positions file is given, sequence Id's are calculated
   considering only those columns of the MSA. In this case minId
   and maxId are mandatory and must come before positions_file.
--msa-extract-sequences-by-topid msa_file1 msa_file2 [count]
--msa-drop-sequences-by-topid sequences_file msa_file [count]
   From MSA msa_file1, extract at most count most similar sequences
   (seq.ID) to any sequence of MSA msa_file2.
   The second form drops those sequences instead.
   If a positions file is given, sequence Id's are calculated
   considering only those columns of the MSA. In this case count
   is mandatory and must come before positions_file.
--msa-redundant [-nsam nsam] [-nseq nseq] [-seed seed]
   Duplicates sequences chosen at random in the given multiple
   sequence alignment. Wtihout options, only one is chosen.
   Option -nsam  Generate nsam samples of MSAs with nseq dupli-
                 cated sequences. Each sample is written is its
                 own directory.
          -nseq  Specify the number of sequences to duplicate.
          -seed  Specify the seed of the random number generator
   All options are expected to be integer values. The value of
   the seed is written within .seed_used allowing for repeated
   experiments.
--msa-map-partition
   Given a Partition and the original MSA, output the MSA
   with the cluster annotation format of the SDPpred server.
   MSAformat allows to specify the format of the output alignment
   Possible formats are: FASTA[23]*, SPEER[23]*, GDE[23]* and
   GSIM[23]*. Example: FASTA prints cluster information as a
   line heading the sequence label line starting with `%'; using
   FASTA2 prints the same but only clusters with 2 or more
   elements are printed (3 or more if format is FASTA3). Idem
   for the additional formats. SPEER prints the MSA appending the
   clusters' sizes as a last line; GDE is analogous to FASTA but
   but uses `==' instead of `%'. Finally, GSIM adds cluster name
   as the last string of the fasta label separated from it by `|'
--msa-print , --print-msa [-sort|-nosort]
   Prints the given multiple sequence alignment. Useful for
   debugging. With -sort, sequences are sorted alphabetically;
   -nosort leaves them sorted as in the input file (default).

For dealing with -interaction- matrices

--edge-dist
   For each node, prints the distribution of edge weights.
   Information printed is: Node, average edge weight, standard
   deviation, standard error, skewness, minimum edge value,
   max edge value and sample size (number of edges).
   If a partition is provided, it also prints the cluster size
   and cluster name each node belongs to.
-m , --merge-graphs
   Merge two graph matrices into one that contains both values
   for each pair of items, i.e., the resulting graph looks like
              stringA stringB  float1 float2
                 ...    ...      ...    ...
where float1, float2 are the matrix values of matrix1 and
matrix2, respectively. Both matrices are expected to contain
the same set of pair of items, i.e., the same set of edges.
-r , --merge-graphs-color
   as option -m, but in addition includes the name of the
   cluster each pair of values belong to. If they belong to
   different clusters the label is "x". The label is NAN
   if any of the item does not belong to any of the clusters
   defined in the given partition. The format of the output is
          float1A flaot2 clustername_AB stringA stringB
            ...     ...      ...          ...    ...
-l , --cull-edges
   Culls from matrix of values edges specified in second file.
--prune-edges-below float graphfile
--prune-edges-above float graphfile
   Removes all edges below or above the given threshold.
--graph-nodes graphfile
--matrix-nodes graphfile
   Prints the list of nodes of the given interaction matrix.
--graph-print [-c col] , --matrix-print [-c col]
   Print the given interaction matrix. For debugging. Integer
   col specifies the column containing the edge values. Default:
   col=3.

General options

--verbose
   For debugging.
-q , --quiet
   quiet mode. Do not print out comment lines (that start with
   `#').
-t , --format {--fmt} [pfmt=input_partition_file_format]
   Specify the default format expected for the input paritions.
   Possible format values are: PART,MCL and FREE. See below.
   As MCL is automatically recognized from the file content,
   this option will be useful in two cases:
   (1) to distinguish between PART and FREE input partitions,
   (2) in combination with --tab, if the output (specified with
       --oformat or the different format conversion options) is
       different from the (input) format specified with -t, the
       tabfile will be used for translating the labels of the
       elements; however, if the _specified_ input and output
       formats coindice, the original labels will be preserved.
       Example: p.mcl is in MCL format; p.lst, in PART format.
           partanalyzer -t PART --tab tbf -V p.mcl p.lst
       this gives the distance between the two by using the tab
       file on p.mcl, but NOT on p.lst.
   Default input format is PART.
--oformat [pfmt=input_partition_file_format]
   Specify the ouput format when printing partitions.
   Default output format is PART.

File formats:

matrix-of-values (an undirected graph):
                 stringA stringB  float
                 stringA stringC  float
                   ...     ...     ...
                 stringZ stringV  float
tab file:
                 integer1  string1
                 integer2  string2
                   ...       ...
partition:
  PART: (default, i.e., partition_offset=2)
        sizeA clusterA_name  item_1 item_2 ... item_sizeA
        sizeB clusterB_name  item_1 item_2 ... item_sizeB
         ...      ...       ...    ...         ...
     or (partition_offset=1):
        sizeA  item_1 item_2 ... item_sizeA
         ...    ...    ...         ...
  FREE: (not yet implemented) (partition_offset=0)
        item_1 item_2 ... item_sizeA
         ...    ...         ...
  MCL : MCL's own matrix format for partitions. See MCL manual.

License # partanalyzer Version alpha 1.0. # Copyright (c) Miguel A. Santos, May. 2008-2010 .Build Feb 19 2010 # Licensed under the GNU GPL version 3 or later. # (see http://www.gnu.org/copyleft/gpl.html ) #

Examples: For lastest options check the help from the program ./partanalyze -h

Check consistency of a given partition test.subfam.lst based on a matrix of interactions given by test-blast_pairwise_id. How large are the intra-cluster values compared to the inter-cluster ones. ./partanalyze -c test-blast_pairwise_id test.subfam.lst or ./partanalyze --check-consistency-of-partition test-blast_pairwise_id test.subfam.lst which also accepts an abreviated form as ./partanalyze --ccop test-blast_pairwise_id test.subfam.lst

Calculate VI distance between two partitions and between each of them and their intersection Definition of VI distance: Given two partitions P1 and P2, with cluster size distributions {n_k} and {n_k'} respectively, where k and k' are indexes to each of their corresponding clusters, and such that Sum_k n_k = Sum_k n'_k = N, the VI distance is defined as

	VI (P1,P2)  = Sum_k n_k/N * log( n_k/N) + Sum_k' n_k'/N * log( n_k'/N) - 2 * Sum_k Sum_k' n_kk'/N log(n_kk'/N)
 where  n_kk' is the number of items common to cluster k of P1 and cluster k' of P2.
 This definition satisfies the triangular inequality, i.e., for any three partitions P1,P2 and P, it is
 	VI (P1, P) + VI (P,P2) >= VI (P1,P2)
./partanalyze --vi-distance test.subfam.lst test.subfam.lst2
or simply
./partanalyze -v test.subfam.lst test.subfam.lst2

Print the intersection of 2 partitions test.subfam.lst and test.subfam.lst2 ./partanalyze -i test.subfam.lst test.subfam.lst2 Performs the intersection of P1 and P2 as induced by the intersection operation on the underlying set (the one that contains all elements). This gives a new partition I such that each cluster of I is obtained as an intersection of one cluster of P1 and one of P2 (all againts all).

Print the purity scores for partition1 (target) againts partition2 (reference) ./partanalyze --purity-scores test.subfam.lst test.subfam.lst2 or simplply ./partanalyze -p test.subfam.lst test.subfam.lst2

It outputs the purity strict and purity lax values. Purity strict of P1 againts P2 := the number of non-singleton clusters of P1 that are exactly identical to one of P2, divided by the number of non-singleton clusters of P2 (the reference). Purity Lax of P1 againts P2 := the number of non-singleton clusters of P1 that are subsets of a cluster of P2, divided by the number of non-singleton clusters of P1 (the target).

For debugging: print the interaction matrix read by the program ./partanalyze --print-matrix test-blast_pairwise_id

⚠️ **GitHub.com Fallback** ⚠️