Bit compressed integer vectors - simongog/sdsl GitHub Wiki

Introduction

This tutorial shows how to construct and use bit compressed integer vectors. Bit compressed integer vectors, unlike the default stl vectors support storing integers with the minimum amount of bits required to represent the maximum number in the vector. The bit-width of integers can be adjusted at runtime for all vectors of class int_vector<> (=int_vector<0>) and int_vector<X> with X not in {1,8,16,32,64}. We call vectors of these types bitcompressed. For X in {8,16,32,64} int_vector<X> is specialized and corresponds to an array of integers of type uint8_t, uint16_t, uint32_t, and uint64_t. The read and write access of those non-bitcompressed is faster compared to the bitcompressed versions. The case X=1 corresponds to a bit vector (int_vector<1> is typedefed to bit_vector) and the specialization also accelerates the read and write operation compared to a bitcompressed version.

Code

#include <stdint.h>
#include <sdsl/int_vector.hpp>
#include <sdsl/util.hpp>
#include <algorithm> // for sort

using namespace sdsl;

int main(int argc,char** argv) {
    uint64_t size = 10000;

    // create a int_vector
    int_vector<> vec(size);

    //  fill with random numbers that use <=12 bits
    for(int_vector<32>::size_type i=0;i<vec.size();i++) {
        vec[i] = rand() % 4096;
    }
    std::cout << "vec uses " << util::get_size_in_mega_bytes(vec) << " MB." << std::endl;

    // bit compress
    util::bit_compress(vec);
    std::cout << "vec uses " << util::get_size_in_mega_bytes(vec) << " MB." << std::endl;

    // create a second int vector with specified int width and default value 5
    int_vector<17> vec_17(size,5);
    std::cout << "vec_17 uses " << util::get_size_in_mega_bytes(vec_17) << " MB." << std::endl;

    // create default vector and resize to 17 bits 
    int_vector<> vec_res(size);
    std::cout << "vec_res uses " << util::get_size_in_mega_bytes(vec_res) << " MB." << std::endl;
    // data is lost here
    vec_res.set_int_width(17);
    vec_res.resize(size);
    std::cout << "vec_res uses " << util::get_size_in_mega_bytes(vec_res) << " MB." << std::endl;

    // use the stl sort function to sort an int_vector
    std::sort(vec.begin(), vec.end());

    // print out content:
    int_vector<>::iterator it;
    std::cout << "vec contains:";
    for (it=vec.begin(); it!=vec.end(); ++it) std::cout << " " << *it;

    // clear up memory used by bv
    sdsl::util::clear(vec);
    sdsl::util::clear(vec_17);
    sdsl::util::clear(vec_res);
    
    return EXIT_SUCCESS;
}

Explanation

The int_vector class available in sdsl/int_vector.hpp. The size of each integer can be spezified as a template parameter. Note that a default value (5) for each element in the vector can also be specified.

int_vector<17> vec(size,5);

or manually using the set_int_width function

vec_res.set_int_width(17);
vec_res.resize(size);

or automatically using the util::bit_compress function:

util::bit_compress(vec);

The individual elements can easiliy be accessed through the [] operator:

vec[5] = 1234;
if(vec[5] == 132) count++;

The integer vector can be inserted into any stl container and used by stl algorithms. The std::sort function can be used to sort a bit compressed int_vector:

std::sort(vec.begin(), vec.end());

Iterators can be used to traverse the elements:

int_vector<>::const_iterator it;
std::cout << "vec contains:";
for (it=vec.begin(); it!=vec.end(); ++it) std::cout << " " << *it;

To output the space used by any sdsl data structure you can use the following two function in the util name space:

size_t size_in_bytes = util::get_size_in_mega_bytes(vec)
size_t size_in_mb = util::get_size_in_bytes(vec);

You can free the space used by any sdsl data structure using:

sdsl::util::clear(vec);

Compiling

You can compile the above example using the following command:

g++ -O3 -funroll-loops -I${HOME}/include -L${HOME}/lib intvec.cpp -o intvec -lsdsl

where -I${HOME}/include is the include directory you installed the sdsl header files and -L${HOME}/lib is the path of the sdsl library.

Performance

If we do not use a fixed integer width for int_vector than the access is usually not aligned and some shifting and masking has to be done. The following figure shows the runtime for random and sequential accesses on vectors that contain 50 millions items. runtime performance int_vector

⚠️ **GitHub.com Fallback** ⚠️