Explaination of the algorithm to build threads - lmichel/vo-grimoire GitHub Wiki
In all mails stored in mbox files, there is Message-Id,In-Reply-To and References fields. What they mean :
- Message-Id : Unique identifier for each message
- In-Reply-To : It's the ID of the mail that the current mail have respond to
- References : It's all the ids of the mails in the thread of the current mail
This algorithm will build a python dictionary who is associating ids to a number of a thread. For example : " " : "1"
First, i needed to index all the mails of the mbox (taking care of ignoring duplicates)
While indexing the mails, i built a list of all ids that you index.
Then i initialize a python dictionary to store the references of a specified thread and a int iterator for the number of thread.
Then i went through the list of ids and for each id i checked if it is already in the dictionary, if yes, then i modify the document on elastic search for this id to modify the "numThread". If no, then i build a string containing all the ids necessary :
- Ids in "References"
- Id in "In-Reply-To"
- All the Ids of mails who contained the id of the current mail (Current + N levels mails)
Once i have built this string, i put it in the dictionary with a new number of Thread and modify the document in Elastic Search.
And add 1 to the int iterator.