Explaination of the algorithm to build threads - lmichel/vo-grimoire GitHub Wiki

How i build the threads

What data to use :

In all mails stored in mbox files, there is Message-Id,In-Reply-To and References fields. What they mean :

  • Message-Id : Unique identifier for each message
  • In-Reply-To : It's the ID of the mail that the current mail have respond to
  • References : It's all the ids of the mails in the thread of the current mail

The output of the algorithm :

This algorithm will build a python dictionary who is associating ids to a number of a thread. For example : " " : "1"

The Algorithm :

First, i needed to index all the mails of the mbox (taking care of ignoring duplicates)

While indexing the mails, i built a list of all ids that you index.

Then i initialize a python dictionary to store the references of a specified thread and a int iterator for the number of thread.

Then i went through the list of ids and for each id i checked if it is already in the dictionary, if yes, then i modify the document on elastic search for this id to modify the "numThread". If no, then i build a string containing all the ids necessary :

  • Ids in "References"
  • Id in "In-Reply-To"
  • All the Ids of mails who contained the id of the current mail (Current + N levels mails)

Once i have built this string, i put it in the dictionary with a new number of Thread and modify the document in Elastic Search.

And add 1 to the int iterator.

⚠️ **GitHub.com Fallback** ⚠️