Clustering - abrt/faf GitHub Wiki

Overview

Because of the way most programs work, slightly different backtraces (function call paths) can lead to the same end result - a crash caused by the same bug. Naturally, it makes sense to identify these and group them together. The purpose of the create-problems action is just that - to create these groups called Problems from backtraces associated with Reports.

Data model

uReports sent by ABRT client to FAF contain backtraces from multiple threads, with the only one we're concerned here being the crash thread. A Report object should contain exactly one backtrace (ReportBacktrace) associated with a crash thread (ReportBtThread where crashtread == True). Futher, ReportBacktraces have multiple ReportBtFrames which constitute the backtrace itself. The frames are ordered by ReportBtFrame.order.

The Report objects have a foreign key to a Problem object. This foreign key should point to the same Problem iff the underlying cause (the bug) of Reports is the same. Unfortunately this characteristic cannot be objectively verified, so we need to take a best-effort approach (dendrogram clusters with backtrace edit distance metric).

Implementation

The core of the clustering is implemented in the satyr library. The create-problems actions goes through the following steps:

  1. Load backtraces of the specified problem type (core, kerneloops etc.) from the database.
  2. Convert the backtraces to satyr backtraces.
  3. Quickly create buckets of at most 2000 backtraces.
  4. Call satyr clustering routines on each bucket separately.
  5. Retrieve the resulting clusters.
  6. Save clusters to the database, keeping the changes to a minimum.

Bucketing

See pyfaf.actins.create_problems.CreateProblems._create_clusters method for details. Note the terminology is a little confusing. Buckets are called clusters and backtraces threads.

Clustering

The clustering itself happens separately on each bucket of at most 2000 backtraces.

First satyr is used to compute distances between each of the backtraces in the cluster. The metric used is the Levenstein edit distance. This process has o(n^2) complexity, so that's why the bucketing is necessary.

Then satyr is used to create a dendrogram of the distances. The resulting cluster are formed by cutting off the dendrogram at 0.3.

Saving to the DB

The result of clustering is a list of lists of backtraces, e.g. [bt1,bt2], [bt3], [bt4,bt5](/abrt/faf/wiki/bt1,bt2],-[bt3],-[bt4,bt5), forming three problems. Imagine this is saved in the DB. Now a new backtrace bt6 appears that is similar to bt4 and bt5. The result of clustering should be [bt1,bt2], [bt3], [bt4,bt5,bt6](/abrt/faf/wiki/bt1,bt2],-[bt3],-[bt4,bt5,bt6). We need to save this result to the DB while changing the data as little as possible (And the changes should really be very small, I'd say 95% of problems should stay exactly the same). We need to keep the first two Problems intact and add the new backtrace to the third problem. (Note there is no direct connection between problems resulting from the clustering and Problems in the DB, i.e. the lists don't carry the DB Problem's id)

The _iter_problems takes care of this task. For it to work efficiently we need to keep the state of Problems before the clustering in the reuse_problems dict. The key of the dict is a tuple of sorted backtrace ids, the value is the DB Problem id. This way, perfect matches (the first two problems in our example) can be take care of in constant time by looking up the key. If the key is not found, the best match is used, but to get it, all problems in the previous DB state must be iterated, which is slower. Refer to the _iter_problems method code and comments for details.