Additional Notes - JulianThijssen/TelegraafES GitHub Wiki

Additional Notes

Optimised indexing

Since the whole Telegraaf collection is over 12GB, the indexing is quite slow when one has to put in the documents 1-by-1. To speed this process up, we developed a program which will take a list of XML files to index. Since each files contains a massive amount of documents, our program splits this file up into bulks of 10000 documents. It consequently goes through each documents and extracts the necessary information from them. Then it saves this collection of bulk files as a .es file, where the documents are now formatted in Json and information on how to insert the document is prepended. These bulk files are about ~15MB each and are therefore perfect for bulk insertion into Elastic Search.

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }

After generating these bulk files from the original file, the program uses Curl to bulk insert the documents inside into the database. Should the user have multiple servers running that are connected to the same cluster this bulk indexing operation can even be done in parallel. When a bulk file is completely indexed, the file is removed to avoid cluttering the file system.

if (fileName.endsWith(".es")) {
    if (PARALLEL) {
        while (coresUsed == NUM_CORES) {
            try {Thread.sleep(100);} catch (InterruptedException e) {}
        }
        IndexUnit unit = new IndexUnit(files[i]);
        unit.start();
    } else {
        System.out.println("Indexing: " + fileName);
        Curl.post(fileName);
        files[i].delete();
        System.out.println("Indexed: " + fileName);
    }
}

This bulk insertion speeds up the indexing operation quite a bit, in fact we can index 2.2GB in about 25 minutes (On a 2-cpu, 40GB SSD, 1GB memory computer).

In addition the program runs several Curl commands before the indexing starts. These commands serve to take away performance from the search function of Elastic Search and temporarily increase the speed of indexing. Firstly, we create the index and set the standard mapping for our documents, so that ES doesn't have to figure it out. Secondly, we turn off merging by setting the store throttling to "none". This prevents merge operations from taking I/O resources away from our indexing operation. Thirdly, we set the refresh interval of the shards belonging to our cluster to 30 seconds (from 1 second), this prevents the inverted index from constantly trying to update with the new data flowing in and again gives us more resources to perform our indexing. Turning the refresh off during the indexing would increase performance more, but overflows memory quickly (we found 30s to be a good compromise on our 1GB mem computer). Lastly, we set the number of replicas of our index to 0, to prevent our indexed files from having to be replicated across other shards for reliability, since this is essentially copying our whole index.

With these optimisations we reach a point of being able to index 2.2GB in about 19 minutes. Of course we set all the options back to their original values after the indexing to prevent them from harming the search functionality.