Home - MASantos/Partanalyzer GitHub Wiki

Welcome to the Partanalyzer wiki!

The MAN file contains the full output of --help option. It explains all of them and provides a few examples.

Full help (as of version alpha 1.0.)
partanalyzer (Partition Analyzer)

Usage:
partanalyzer [-h|--help] (Use --help for more details)
partanalyzer --version
partanalyzer [OPTIONS] COMMAND ARGS

    OPTIONS   
            --debug   
            --verbose   
            -q , --quiet                          
            -z, --pid-normalization [s|p|r|l]   
            -t , --format partition_format   
            --tab tab_file      
            --DIST_SUBSPROJECT   
            --beta beta_value   
            --mu mu_value   

    COMMANDS

Defining the algebra of partitions
(-i|-u) partition1 partition2 [partition1_offset (=2) ] [partition2_offset (=partition1_offset) ]
(-I|-U) [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]

For analyzing partitions
(-v|-e|-p) partition1 partition2 [partition1_offset (=2) ] [partition2_offset (=partition1_offset) ]
-c matrix-of-values partition1 [threshold (=-1.0)] [partition_offset (=2)]
-d matrix-of-values partition1 [partition_offset (=2)]
(-Q|-R|-T) [-ext extensivity] [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
(-V|-E|-P) [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
--pstat-sym [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
(--ipot|--cpot|--jpot|--v-measure-h) entropy [-ext extensivity] [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]]
(--mpot|--cmpot|--SSSA) entropy [-ext extensivity] [-ofs partition_offset (=2)] [-f partition_list | partition1 [ partition2 [ ... ]]] (-C|-H) [-cons] [-ofs partition_offset (=2)] [-f partition_list | [partition1 [ partition2 [ ... ]]] (-A|-S|--Info) [-ofs partition_offset (=2)] [-f partition_list | [partition1 [ partition2 [ ... ]]]

For creating partitions ( Clustering ) --cluster graph [-below|-above] [ threshold ] --cluster-robust graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity] --cluster-robust-self-consistently graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity]

For editing partitions --part-extract-elements elements_file [-tab mcl_tab_file] partition [partition1_offset (=2) ] --part-sort partition --part-sort-rename partition [prefix] --part-swap-names partition (requires use of --tab)

For converting between different partition formats --toMCL [-tab mcl_tab_file] partition [partition1_offset (=2) ] --toFREE partition [partition1_offset (=2) ] --MCLtoPART [-tab mcl_tab_file] partition [partition1_offset (=2) ] --MSAtoPART msa_file

For dealing with (fasta) sequence files --seq-noclone-sequences fasta_sequence_file [reference_sequence_file]

For analyzing Multiple Sequence Alignments --msa-seqid-stat [--positions positions_file] multiple_seq_alignment.fasta [multiple_seq_alignment.fasta2] --msa-seqid-avg [-thr threshold=50] multiple_seq_alignment.fasta [multiple_seq_alignment.fasta2] --msa-extract-positions positions_file multiple_seq_alignment.fasta --msa-extract-sequences sequences_file multiple_seq_alignment.fasta --msa-drop-sequences sequences_file multiple_seq_alignment.fasta --msa-extract-sequences-by-id msa_file1 msa_file2 [minId maxId] --msa-drop-sequences-by-id sequences_file msa_file [minId maxId] --msa-extract-sequences-by-topid msa_file1 msa_file2 [count] --msa-drop-sequences-by-topid msa_file1 msa_file [count] --msa-map-partition partition multiple_seq_alignment.fasta [MSAformat] --msa-print [-sort|-nosort] multiple_seq_alignment.fasta --msa-redundant [-nsam nsam] [-nseq nseq] [-seed seed] multiple_seq_alignment.fasta

For dealing with -interaction- matrices (aka, undirected graph) --edge-dist matrix-of-values [partition [partition_offset (=2)] ] -m matrix-of-values1 matrix-of-values2 -r matrix-of-values1 matrix-of-values2 partition [partition_offset (=2)] -l matrix-of-values1 matrix-of-values2 --prune-edges-above float graphfile --prune-edges-below float graphfile --print-matrix matrix-of-values --graph-nodes matrix-of-values

partanalyzer aims at being a general program for analyzing (sets of) partitions. Here a partition is defined as in set theory of mathematics (see http://en.wikipedia.org/wiki/Partition_of_a_set). It also allows to edit (rudimentarily), as well as generate, partitions.

Whenever many input files are expected, one can either list them as command line arguments, or list them in a file and use option -f to specify that file.

For calculating distances between partitions with different number of elements, use option --DIST_SUBSPROJECT right before any *stat command. Works only with a *stat distance command, i.e., not purity scores.

OPTIONS: -z , --pid-normalization norm Determines the normalization used for calculating percent sequence identities. The possible string values for norm are: s , shorter-sequence p , aligned-positions r , aligned-residues l , average-length Default normalization is the average sequence length, l.

COMMANDS: Defining algebra of partitions

   -i , --intersection , -m , --meet
      Calculate the intersection of  partition1 & partition2

   -u , --union , -j , --join
      Calculate the union of  partition1 & partition2. This can be
      seen as the algebraic optimal consensus partition covering
      partition1 and partition2. Optimal means the most refined
      partition that covers both.

   -I , --Intersection , -M , --Meet
      Calculate the intersection of  all partitions provided

   -U , --Union , -J , --Join
      Calculate the union of all partitions provided. This can be
      seen as the algebraic optimal consensus partition covering
      each and every input partition. Here, optimal means the most
      refined partition that covers any of the input partitions.
      Remark: This algebraic consensus is very sensitive to outlier
      partitions.

For analyzing partitions

   -v , --vi-distance
      Calculate VI distances between partition1 & partition2

   -e , --edit-distance
      Calculate the edit score distance between partition1 and
           partition2

   -p , -purity-scores
      Calculates the purity scores of partition2 (the target)
           againts the partition1 (the reference).

   -c , --check-consistency-of-partition , --ccop [-tab tab_file]
      Check cluster consistency according to the given matrix. If
      partition and graph matrix label items differently, use the
      option -tab to provide a tab file specifying the conversion.
      (See below for syntaxis of the matrix and tab file)

   -d , --intra-inter-edge-dist
      Calculate intra and inter cluster distribution of weights
      according to the given matrix

   -Q , --qstat [-ext extensivity] [-ref]
      Calculates Tarantola distance  for each pair of partitions.
      For that it uses the Jeffrey's Qnorm based on Shannon Entropy.
      With option -ref, the first partition is taken as a reference
      and it calculates the distances of all againts that one.
      Default extensivity coefficient is 2.

   -R , --rstat [-ext extensivity] [-ref]
      Calculates Renyi distances for each pair of partitions.
      With option -ref, the first partition is taken as a reference
      and it calculates the distances of all againts that one.
      Default extensivity coefficient is 2.

   -T , --tstat [-ext extensivity] [-ref]
      Calculates Tsallis distances for each pair of partitions.
      With option -ref, the first partition is taken as a reference
      and it calculates the distances of all againts that one.
      Default extensivity coefficient is 2.

   -B , --bstat [-ref]
      Calculates the Boltzmann distance for each pair of partitions.
      With option -ref, the first partition is taken as a reference
      and it calculates the distances of all againts that one.

   -V , --vstat [-ref]
      Calculates the VI distance for each pair of partitions.
      With option -ref, the first partition is taken as a reference
      and it calculates the distances of all againts that one.

   -E , --estat [-ref]
      Calculates the Edit Score distance for each pair of partitions
      With option -ref, the first partition is taken as a reference
      and it calculates the distances of all againts that one.

   -P , --pstat [-ref | -target]
      Calculates the purity scores (strict and lax) for each pair
      of partitions. With option -ref, it calculates the purity
      scores of all againts the first one, which is taken as a
      reference. With option -target, the first one is considered
      the target and it calculates the scores of that one against
      all others taken as reference.

   --pstat-sym , --pstat-symmetric
      Calculates arithmetic averages of purity stric and purity lax
      scores for each pair of partitions.

   -n , --ipot  entropy [(-e|-ext) extensivity] [-ref]
      Calculates (information theoretic) potential (entropy) of each
      partition. The possible values for entropy are (short|long):
      v | s | vonneumann | shannon
      b     | boltzmann
      c | e | cardinality
      r     | renyi
      t     | tsallis
      q     | tarantola/jeffrey/tjqn
      Both, long and short option names are valid.
      Default extensivity coefficient is 2.
      For cardinality potential, this coefficient will be used as a
      gauge determining the card(1)=1+extensivity.

   --cpot, --conditional-potential entropy [-ext extensivity] [-ref]
      Calculates conditional entropy for each pair of partitions.
      The possible values for entropy and extensivity are the same
      as for option --ipot.

   --jpot, --joint-potential entropy [-ext extensivity] [-ref]
      Calculates joint entropy for each pair of partitions.
      The possible values for entropy and extensivity are the same
      as for option --ipot.

   --mpot, --mutual-potential entropy [-ext extensivity] [-ref]
   --SA, --subadditivity entropy [-ext extensivity]
      Calculates the mutual potential (mutual information)
      for each pair of partitions. If positive, subadditivity holds.
      The possible values for entropy and extensivity are the same
      as for option --ipot.

   --cmpot, --conditional-mutual-potential entropy [-ext extensivity]
   --SSA, --strong-subadditivity entropy [-ext extensivity] [-ref]
      Calculates the conditional mutual potential (conditional
      mutual information) for each pair of partitions. If positive,
      for all three partitions, then strong subadditivity holds.
      The possible values for entropy and extensivity are the same
      as for option --ipot.

   --SSSA, --soft-strong-subadditivity entropy [-ext extensivity] [-ref]
      Calculates a softer version of the strong subadditivity
      condition for all three partitions. If positive, then the
      potential acts as a norm and defines a metric, which thus
      satisfies the triangular inequality.
      The possible values for entropy and extensivity are the same
      as for option --ipot.

   --v-measure-h , --v-measure-harmonic entropy [-ext extensivity] [-ref]
      Calculates the Vmeasure between each pair of partitions. This
      measure is as that defined by Roseberg, A. and Hirschberg, J.
      in http://acl.ldc.upenn.edu/D/D07/D07-1043.pdf. Use global
      option --beta for specifying relative weight of homogeneity
      versus completeness. Default is equal weight, i.e., beta=1.
      and thus the average between both is strictly an harmonic one.
      The possible values for entropy and extensivity are the same
      as for option --ipot.

   --v-measure-a , --v-measure-arithmetic entropy [-ext extensivity] [-ref]
      Analogous to --v-measure-h but using arithmetic mean between
      homogeneity and completeness.

   --v-measure-g , --v-measure-geometric entropy [-ext extensivity] [-ref]
      Analogous to --v-measure-h but using geometric mean between
      homogeneity and completeness.

   -C , --cluster-stat [-ofs ofs] [-norm gaug] [-cons|-consensus]
      For each item, determines the most frequent cluster where
      it appears among all the clusters of all the given partitions.
      It also prints its size and observed frequency (both, raw
      count and %).
      Option -ofs,see below, allows to specify a partition offset.
      Option -norm gaug gauges the normalization used for
             determining the %frequencies. By default these are
             calculated by counting how many times the mode cluster
             is found at each of the different partitions and then
             dividing by the number of partitions N. With this
             option, that count gets divided by N+gaug, where gaug
             can be negative or positive.
      Option -cons or -consensus will print the consensus partition

   -A , --adjacency-stat , {--adjstat}
      Determines the average adjacency matrix from the provided
      partitions. The adjacency matrix of a partition is the graph
	 where edges (0 or 1 ) represent two elements belonging to the
	 same subfamily. The average adjacency matrix has edges with
	 continous values [0,1]. The output consists in a matrix of
	 values and a gray-scale image of it in PGM format.

   -S , --split-merge-analysis , {--splitstat}
     (Split-Merge plot)
      Determines the overlap of each cluster to those of the
      reference partition (the first). Possible values are for
      the overlap are:
      -over fraction elements in common relative to the target cluster.
      -cos  cosine normalized similarity

      It outputs:
      -Confusion matrix (in % of the target clusters) taking
        the first partition as reference and the second as target.
      -number of overlaps for each target cluster
      -Split-Merge image showing the CT. In addition it show two
	   reference color bars: a bottom color bar representing the
        perfect split transformations (black), the merge-only
        (white) ones and those cases in between (different grey
        levels); a right-most column shows whether these are perfect
        matches (black) or not (white).

   --Info , {--isPart , --isaPart , --is-partition} [-ofs ofs]
       For each partition checks whether it is a sound partition
       or not, i.e., whether all of its clusters are pair-wise
       disjoint. With option -q, only error message will be printed
       in case partition is not sound, otherwise it'll keep silent.

   -H, --hasse-diagram
       prints the local Hasse Diagram (graph) spanned by the
       given partitions.

For creating partitions ( clustering ) --cluster graph [ [-below|-above] treshold ] Defines clusters from the transitivity relation given by the graph's edges. If a treshold is provided, it prunes first the edges below the threshold. Example: partanalyzer --cluster gf -below 0.7 partanalyzer --cluster gf 0.7 both cases will first pruned the edges below 0.7 and the obtain the clusters generated that way. For pruning above we must use the second explicit form partanalyzer --cluster gf -above 0.7

   --cluster-robust graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity]
       Gives the most robut clustering with respect to edge pruning.
       This is defined as the partition showing the smallest average
       variability _after_ the phase transition. The average varia-
       bility is calculated as the average distance against those
       partitions at its #neighbors nearest pruning thresholds
       (#neighbors above; #neighbors below).
       It repeatedly clusters the graph starting with a pruning
       threshold equal to the lowest edge and increasing it by a fixed
       amount until reaching the highest edge value. The total
       number of samples determine each step increase of threshold.
       We may be pruning the edges above the threshold (as if the
       later were a temperature T) or below the threshold (1/T).
       Defaults: #samples=10 ; Pruning=below ; Metric=shannon (-V)
       #neighbors=2.

   --RDC
   --cluster-robust-self-consistently graph [-s #samples] [-n #neighbors] [-below|-above] [-V|-E|-R|-T|-J] [-ext extensivity]
       As --cluster-robust, but it determines self-consitently the
       largest possible number of samples. The latter is defined as
       the largest for which each pruning interval removes at least
       one edge. The method used is bisectioning and the provided
       #samples is used as the seed for the search. All defaults as
       for --cluster-robust.

For editing partitions --part-extract-elements {--extract-elements} elements_file elements_file lists the names of the elements to cull from the given partition

   --part-sort
       Sorts the clusters by size, the larger on top. Ties are
       sorted alphabetically by their first item. Within each
       cluster, items are sorted alphabetically.

   --part-sort-rename partition [prefix]
       As --part-sort, but also rename each clusters consecutively
       as C1, C2,etc. If a prefix string is supplied use that
       instead of C.

   --part-swap-names partition
   --part-swap-labels partition
       Swaps elements' names present in partition by their new
       names as found in the provided tab file. An element's name in
       the partition will be changed iif there is a translation for
       it found in the tab file; otherwise it will be left as it is.
       Thus, it is not mandatory to provide a translation for all
       elements. Requires the use of --tab to specify a tab file
       providing the mapping between new and old names. See general
       options.

For converting between different partition formats

   --toMCL [-tab mcl_tab_file] converts partition from PART format
       to MCL's format. If additional tab file is provided, output
       will contain the specific label index given in the tab file.

   --toFREE converts partition from PART format to FREE format.

   --MCLtoPART [-tab mcl_tabl_file]
       converts partition from MCL format to PART format.
       If additional tab file is provided, output will contain
       items' labels, instead of simply their MCL index number.

   --MSAtoPART
       converts a MSA file in FASTA format containing the
       labeling of clusters into a partition in PART format. Cluster
       labels are expected in a separate line before the actual
       set of sequences, i.e.,
       %Group_A
       >Sequence_A1
       ...
       or
       ==Group_A
       >Sequence_A1
       ...
       Escape characters indicating cluster labels can be mixed in
       the same file, although it's not recommended.

For dealing with (fasta) sequence files --drop-clone-sequences --msa-noclone-sequences --seq-noclone-sequences sequence_file (fasta) Given a (fasta) sequence file or a fasta MSA file, remove all duplicate sequences. Here duplicate means literally that, namely, exactly the same string of characters. Therefore, it is not the same as having a pid=100%, but more stringent. If a second sequence file is provided, drop also sequences that are clones of any sequence in the second file.

For analyzing Multiple Sequence Alignments

   --msa-seqid-stat
   --msa-seqid-stat [--positions file]
      Given a multiple sequence alignment in fasta format, it
      prints all pair-wise sequence identities. By default, it
      calculates identities over the full sequence length. The
      second version allows to specify the (reduced) set of positions
      we want to consider in comparing sequences. These should be
      specified in a file, each separated by space,tabs, new lines,
      etc. The positions are understood as columns of the MSA.
      If two MSA are provided, it prints the sequence Id of the
      first set against the second.

   --msa-seqid-avg [-thr threshold ]
      Similar as option --msa-seqid-avg, but prints for each sequence
      a statistics of its pair-wise sequence identity to all other
      sequences. This consists of average Seq.Id, standard
      deviation, variance, minimum Seq.Id, maximum Seq.Id, number
      of pairs with Seq.Id > threshold, fraction of pairs with Seq.
      Id. > threshold and total number of pairs.
      Option -thr allows to provide a specific threshold to use.
                  default value is 50%. Values are floating numbers
                  within [0,100].
      If two MSA are provided, it prints the sequence Id of the
      first set against the second.

   --msa-extract-positions positions_file msa_file
      From the given MSA, extract only columns specified in file
      positions_file.

   --msa-extract-sequences sequences_file msa_file
   --msa-drop-sequences sequences_file msa_file
      From the given MSA, extract only sequences specified in file
      sequences_file. This file contains a list of sequences names
      The second form drops those sequences instead.
      If a positions file is given, sequence Id's are calculated
      considering only those columns of the MSA.

   --msa-extract-sequences-by-id msa_file1 msa_file2 [minId maxId]
   --msa-drop-sequences-by-id sequences_file msa_file [minId maxId]
      From MSA msa_file1, extract sequences with an ID above minId
      and at most maxId against any sequence of MSA msa_file2.
      The second form drops those sequences instead. Default values
      values are minId=30 and maxId=100, i.e., homologous sequences.
      If a positions file is given, sequence Id's are calculated
      considering only those columns of the MSA. In this case minId
      and maxId are mandatory and must come before positions_file.

   --msa-extract-sequences-by-topid msa_file1 msa_file2 [count]
   --msa-drop-sequences-by-topid sequences_file msa_file [count]
      From MSA msa_file1, extract at most count most similar sequences
      (seq.ID) to any sequence of MSA msa_file2.
      The second form drops those sequences instead.
      If a positions file is given, sequence Id's are calculated
      considering only those columns of the MSA. In this case count
      is mandatory and must come before positions_file.

   --msa-redundant [-nsam nsam] [-nseq nseq] [-seed seed]
      Duplicates sequences chosen at random in the given multiple
      sequence alignment. Wtihout options, only one is chosen.
      Option -nsam  Generate nsam samples of MSAs with nseq dupli-
                    cated sequences. Each sample is written is its
                    own directory.
             -nseq  Specify the number of sequences to duplicate.
             -seed  Specify the seed of the random number generator
      All options are expected to be integer values. The value of
      the seed is written within .seed_used allowing for repeated
      experiments.

   --msa-map-partition
      Given a Partition and the original MSA, output the MSA
      with the cluster annotation format of the SDPpred server.
      MSAformat allows to specify the format of the output alignment
      Possible formats are: FASTA[23]*, SPEER[23]*, GDE[23]* and
      GSIM[23]*. Example: FASTA prints cluster information as a
      line heading the sequence label line starting with `%'; using
      FASTA2 prints the same but only clusters with 2 or more
      elements are printed (3 or more if format is FASTA3). Idem
      for the additional formats. SPEER prints the MSA appending the
      clusters' sizes as a last line; GDE is analogous to FASTA but
      but uses `==' instead of `%'. Finally, GSIM adds cluster name
      as the last string of the fasta label separated from it by `|'

   --msa-print , --print-msa [-sort|-nosort]
      Prints the given multiple sequence alignment. Useful for
      debugging. With -sort, sequences are sorted alphabetically;
      -nosort leaves them sorted as in the input file (default).

For dealing with -interaction- matrices

   --edge-dist
      For each node, prints the distribution of edge weights.
      Information printed is: Node, average edge weight, standard
      deviation, standard error, skewness, minimum edge value,
      max edge value and sample size (number of edges).
      If a partition is provided, it also prints the cluster size
      and cluster name each node belongs to.

   -m , --merge-graphs
      Merge two graph matrices into one that contains both values
      for each pair of items, i.e., the resulting graph looks like

                    stringA stringB  float1 float2
                       ...    ...      ...    ...
      where float1, float2 are the matrix values of matrix1 and
      matrix2, respectively. Both matrices are expected to contain
      the same set of pair of items, i.e., the same set of edges.

   -r , --merge-graphs-color
      as option -m, but in addition includes the name of the
      cluster each pair of values belong to. If they belong to
      different clusters the label is "x". The label is NAN
      if any of the item does not belong to any of the clusters
      defined in the given partition. The format of the output is
             float1A flaot2 clustername_AB stringA stringB
               ...     ...      ...          ...    ...

   -l , --cull-edges
      Culls from matrix of values edges specified in second file.

   --prune-edges-below float graphfile
   --prune-edges-above float graphfile
      Removes all edges below or above the given threshold.

   --graph-nodes graphfile
   --matrix-nodes graphfile
      Prints the list of nodes of the given interaction matrix.

   --graph-print [-c col] , --matrix-print [-c col]
      Print the given interaction matrix. For debugging. Integer
      col specifies the column containing the edge values. Default:
      col=3.

General options

   --verbose
      For debugging.

   -q , --quiet
      quiet mode. Do not print out comment lines (that start with
      `#').

   -t , --format {--fmt} [pfmt=input_partition_file_format]
      Specify the default format expected for the input paritions.
      Possible format values are: PART,MCL and FREE. See below.
      As MCL is automatically recognized from the file content,
      this option will be useful in two cases:
      (1) to distinguish between PART and FREE input partitions,
      (2) in combination with --tab, if the output (specified with
          --oformat or the different format conversion options) is
          different from the (input) format specified with -t, the
          tabfile will be used for translating the labels of the
          elements; however, if the _specified_ input and output
          formats coindice, the original labels will be preserved.
          Example: p.mcl is in MCL format; p.lst, in PART format.
              partanalyzer -t PART --tab tbf -V p.mcl p.lst
          this gives the distance between the two by using the tab
          file on p.mcl, but NOT on p.lst.
      Default input format is PART.

   --oformat [pfmt=input_partition_file_format]
      Specify the ouput format when printing partitions.
      Default output format is PART.

File formats:

   matrix-of-values (an undirected graph):
                    stringA stringB  float
                    stringA stringC  float
                      ...     ...     ...
                    stringZ stringV  float

   tab file:
                    integer1  string1
                    integer2  string2
                      ...       ...

   partition:
     PART: (default, i.e., partition_offset=2)
           sizeA clusterA_name  item_1 item_2 ... item_sizeA
           sizeB clusterB_name  item_1 item_2 ... item_sizeB
            ...      ...       ...    ...         ...
        or (partition_offset=1):
           sizeA  item_1 item_2 ... item_sizeA
            ...    ...    ...         ...
     FREE: (not yet implemented) (partition_offset=0)
           item_1 item_2 ... item_sizeA
            ...    ...         ...
     MCL : MCL's own matrix format for partitions. See MCL manual.

License

partanalyzer Version alpha 1.0.

Licensed under the GNU GPL version 3 or later.

(see http://www.gnu.org/copyleft/gpl.html )

Examples: For lastest options check the help from the program ./partanalyze -h

Check consistency of a given partition test.subfam.lst based on a matrix of interactions given by test-blast_pairwise_id. How large are the intra-cluster values compared to the inter-cluster ones. ./partanalyze -c test-blast_pairwise_id test.subfam.lst or ./partanalyze --check-consistency-of-partition test-blast_pairwise_id test.subfam.lst which also accepts an abreviated form as ./partanalyze --ccop test-blast_pairwise_id test.subfam.lst

Calculate VI distance between two partitions and between each of them and their intersection Definition of VI distance: Given two partitions P1 and P2, with cluster size distributions {n_k} and {n_k'} respectively, where k and k' are indexes to each of their corresponding clusters, and such that Sum_k n_k = Sum_k n'_k = N, the VI distance is defined as

VI (P1,P2)  = Sum_k n_k/N * log( n_k/N) + Sum_k' n_k'/N * log( n_k'/N) - 2 * Sum_k Sum_k' n_kk'/N log(n_kk'/N)

where n_kk' is the number of items common to cluster k of P1 and cluster k' of P2. This definition satisfies the triangular inequality, i.e., for any three partitions P1,P2 and P, it is VI (P1, P) + VI (P,P2) >= VI (P1,P2) ./partanalyze --vi-distance test.subfam.lst test.subfam.lst2 or simply ./partanalyze -v test.subfam.lst test.subfam.lst2

Print the intersection of 2 partitions test.subfam.lst and test.subfam.lst2 ./partanalyze -i test.subfam.lst test.subfam.lst2 Performs the intersection of P1 and P2 as induced by the intersection operation on the underlying set (the one that contains all elements). This gives a new partition I such that each cluster of I is obtained as an intersection of one cluster of P1 and one of P2 (all againts all).

Print the purity scores for partition1 (target) againts partition2 (reference) ./partanalyze --purity-scores test.subfam.lst test.subfam.lst2 or simplply ./partanalyze -p test.subfam.lst test.subfam.lst2

It outputs the purity strict and purity lax values. Purity strict of P1 againts P2 := the number of non-singleton clusters of P1 that are exactly identical to one of P2, divided by the number of non-singleton clusters of P2 (the reference). Purity Lax of P1 againts P2 := the number of non-singleton clusters of P1 that are subsets of a cluster of P2, divided by the number of non-singleton clusters of P1 (the target).

For debugging: print the interaction matrix read by the program ./partanalyze --print-matrix test-blast_pairwise_id

Home - MASantos/Partanalyzer GitHub Wiki

partanalyzer Version alpha 1.0.

Copyright (c) Miguel A. Santos, May. 2008-2010 .Build Feb 19 2010

Licensed under the GNU GPL version 3 or later.

(see http://www.gnu.org/copyleft/gpl.html )