Quick Start Guide - dgravesa/HPARHMAC GitHub Wiki
This is a quick start guide to using the HMACTool
binary.
For systems with MPI and Boost libraries available, the tool can typically be built using the Makefile as is -- just run make
. Boost libraries are used for output directory checks; in cases where a specified output directory does not exist, the tool attempts to create the output directory if Boost is available. MPI is needed for multicore or distributed processing. However, in cases where MPI is not available or a single CPU core is used, the runtime will still be improved over the original implementation by specifying a minimum partition size for recursive partitioning.
To build HMACTool without Boost or MPI, comment out the respective line at the top of the Makefile (i.e. # with_boost = true
).
Note: setting the value to false does not work because the check is if with_boost
is set -- true or false doesn't matter in this case.
In this example, I am building without Boost or MPI. Your build should look similar:
ubuntu@hostname:~/HPARHMAC$ make
c++ -c src/HMAC.cpp -I./inc -O2 --std=c++0x -o lib/HMAC.opp
c++ -c src/HMACTool.cpp -I./inc -O2 --std=c++0x -o lib/HMACTool.opp
c++ lib/HMAC.opp lib/HMACTool.opp -o bin/HMACTool
Currently, the required data input format is limited to either a text file or CSV with one coordinate per line. Some examples are shown in the tests
directory.
ubuntu@hostname:~/HPARHMAC$ bin/HMACTool tests/image.txt -s 0.3,0.6
5 modes on level 1 (sigma = 0.3):
112.63 109.55 109.256 (1423)
113.084 136.002 148.99 (1)
152.31 185.841 200.345 (484)
195.643 106.72 55.0408 (2031)
217.747 235.802 240.644 (157)
2 modes on level 2 (sigma = 0.6):
102.846 101.343 104.688 (1424)
195.913 112.932 66.0024 (2672)
total run time = 102.826 seconds
In this example, we use bandwidths or 0.3 and 0.6. By default, the data is shuffled and normalized prior to computation. Although the shuffling doesn't affect the result here (and takes trivial time compared to clustering), normalization is useful when multiple dimensions have different magnitudes. In the case of RGB image data where each channel is represented from 0 to 255, normalization may not be desired. Normalization may be disabled by adding the option --normalize=no
.
ubuntu@hostname:~/HPARHMAC$ bin/HMACTool tests/image.txt -s 0.3,0.6 -r 256
4 modes on level 1 (sigma = 0.3):
112.644 109.558 109.259 (1440)
152.294 185.825 200.331 (461)
195.633 106.707 55.026 (2032)
217.804 235.856 240.688 (163)
2 modes on level 2 (sigma = 0.6):
102.846 101.343 104.688 (1440)
195.913 112.932 66.0023 (2656)
total run time = 4.2504 seconds
Notice how the runtime improves from 103 seconds to almost 4 seconds. Also notice that the results are similar but not exact. Generally, the misclustered coordinates are points that fall between two clusters and have weak cluster membership as a result. These points are less wrong than points that fall close to modes and have strong membership to one cluster as a result, which generally conform to their original result with sufficient sampling. For spatially related data ordering such as pixels in an image, shuffling will help to improve sampling and provide a more accurate result. As mentioned in the previous section, shuffling is enabled by default and is typically a fast operation.
To run HMACTool in parallel, make sure to build with MPI enabled, and prefix the run command with the call to mpirun
:
mpirun -n 4 bin/HMACTool tests/image.txt -s 0.3,0.6 -r 256
Running HMACTool with MPI does not affect the result. However, currently every process receives a copy of the data set, which may cause memory challenges if the data set is large. Each process receives a set of coordinates to cluster up to and including the top level merge. Recursive partitioning is independent of MPI partitioning.
Output files are named based on their level in the hierarchy, starting with level-1.txt. The output files generated consist of a one-line header, followed by the modes (one per line), followed by the cluster membership indices (one per line). The header is of the following format:
<num_modes> <num_coords> <num_fields>
The modes are listed in index order, where the top mode line is considered mode 0. The membership indices of coordinates index into this list of modes. Coordinates that map to the same mode are considered to be in the same cluster.
Additionally, some useful information is printed out during runtime by default. This information prints the number of modes at each level and prints the mode values and cluster size for levels with 10 clusters or fewer. To silence this output, use the --quiet
option. To add additional information at startup, use the --verbose
option.
The remainder of this section gets into technical details in case the function calls want to be used directly.
In the function call itself, the clustering result is returned as an STL map with modes as keys and membership lists as the values. This is contained in a tree-like structure instead of a hash table because the exact values of modes are not guaranteed with the iterative algorithm, only approximate. When lookup occurs, a lower and upper bound range is found in this tree, and then that range is searched linearly for a matching proximity. This structure may be converted to modes and indices -- like the output file format -- using the Clusters2ModesAndIndices
function.