Heuristic: Fake Promoter - ParticipaPY/politic-bots GitHub Wiki

Heuristic that identifies accounts that are promotioning other accounts exhibiting bot-like behaviour.

Preparation

This heuristic uses a db structure that has stored, for each user, its bot_detector_pbb(1) and, for each user interacted by the former, its screen name and bot_detector_pbb. Before doing any computations, it fetches a sample of one(2) user document from the database, and if it already has the attribute 'bot_detector_pbb', it is assumed that it and all the user documents have the described structure. If it doesn't, its and its interacted users' bot_detector_pbbs are calculated and stored. Since performing the preparations for this computations for every user may imply doing an extra computation for every user that she started an interaction with, an upper limit is set on the number of the interacted users that are going to be taken into account for the update process (NUM_INTERACTED_USERS_UPDATE(3)). The number of documents to be updated can also be set (NUM_USERS_UPDATE). This in case it is desired to update the db in multiple passes.

Implementation

The heuristic consists of evaluating the interactions (RTs, Quotes, Mentions, etc.) generated by the user being analyzed and doing some mathematical computations about them. For the same reason that in the preparation, and since not all interacted users may be stored in the document for each user, an upper limit is set on the number of the interacted users that are going to be taken into account for the computations (NUM_INTERACTED_USERS_HEUR). Note that it's not the same constant than NUM_INTERACTED_USERS_UPDATE, and it cannot be greater than the maximum number of interacted users stored in the db for each user. When describing the heuristic, this detail is ignored for simplicity purposes, but this must be taken into account when executing the script.

For illustrating how the heuristic works, let's take for example the user FAKE.PROMOTER.

Assume that she's the user being analyzed. First, a synthesis of her interactions (the interactions and users interacted by her) are fetched, using the 'NetworkAnalyzer' class from the 'network_analysis' module(4). The results are better if the synthesis of the interactions of a user is ordered decreasingly by the number of interactions started with them, so its iterated in this order. Also the synthesis of the interacted users with the bot_detector_pbb of each one of them is fetched. The exact next steps depends on the specific method of the heuristic, and are described below. For now suffice it say that the heuristic calculates a score that if it turns out to be greater than a defined threshold(5), it returns 1 otherwise 0.

There are two approaches and four different methods for computing the value of the heuristic -the score-. All of them implemented, and one can switch between one and another by setting the constant FAKE_PROMOTER_HEUR to the number corresponding to the desired heuristic.

Approach Number 1

This approach aims at computing the number of interactions of an account with users who have a bot_detector_pbb greater than some minimum value (BOT_DET_PBB_THRS), and what fraction of the total number of interactions this number represents

Method 0

Select the interacted users that have a bot_detector_pbb greater than BOT_DET_PBB_THRS
Compute the total number of interactions started by the analyzed user (in the example, FAKE.PROMOTER) with the selected users.
Check if the number of interactions computed in the previous step is greater than SCORE_TOP_INTRCTNS_THRESHOLD. Return 1 if the condition applies, 0 otherwise.
Divide the number of interactions computed in 2. by the total number of interactions started by the user. This corresponds to the fraction of the total that the former interactions represents.
Check if the relative fraction of interactions computed in the previous step is greater than SCORE_TOP_INTRCTNS_PRCNTG_THRESHOLD. Return 1 if the condition applies, 0 otherwise.

Approach Number 2

This approach differ from Approach Number 1 in that it aims at computing the average bot_detector_pbbs of the users interacted (or interacted users) by the user being analyzed, rather than the absolute number of interactions. The methods in this approach differ between themselves in what is the weight that they use for computing the averages.

Method 1

Compute the average bot_detector_pbb of the interacted users.
If the average is greater than AVG_PBB_THRESHOLD, return 1 otherwise 0

Method 2

Compute the average bot_detector_pbb of the interacted users, but using their total-relative number of interactions(6) as a weight. This aims at giving a more representative average bot_detector_pbb, because the greater the numer of interactions with a user, the greater its bot_detector_pbb representativity.
Compute the product between the weight and the bot_detector_pbb, and accumulate it into the weighted average.
If the weighted average computed in the previous step is greater than AVG_ALL_INTRCTNS_WGHTD_PBB_THRESHOLD, return 1 otherwise 0.

Method 3

Compute the average bot_detector_pbb of the interacted users, but using their top-relative number of interactions(7) as a weight. Like Method 2, this method also aims at giving a more representative bot_detector_pbb by weighting them. The difference is the denominator used in the computation of the weight.
Compute the product between the weight and the bot_detector_pbb, and accumulate it into the weighted average.
If the weighted average computed in the previous step is greater than AVG_TOP_INTRCTNS_WGHTD_PBB_THRESHOLD, return 1 otherwise 0.

Footnotes:

The result of executing the 'compute_bot_probability' method over the desired user, but without considering the present heuristic. This is controled with the 'promotion_heur_flag' flag.
Provisorily. Perhaps a larger sample could be fetched.
It and all the constants used in the heuristics can be set in the 'heuristics_config.json' file.
Hence, the interactions types considered in the heuristic (RTs, Quotes, Mentions, etc.) are the same that those considered in the 'network_analysis' module.
The determination of this value (and the other thresholds) was done somewhat arbitrarily. What it indicates is how many interactions started with users with a bot_detector_pbb greater than BOT_DET_PBB_THRS is considered normal.
The total-relative number of interactions started by user A with user B is computed as follows:
1. Compute the number of interactions started by user A with user B.
2. Compute the total number of interactions started by user A with any user.
3. Compute what fraction of the total computed in 2. represents the number of interactions computed in 1.
The top-relative number of interactions started by user A with user B is computed as follows:
1. Compute the number of interactions started by user A with user B.
2. Make a list that has all the users that have been interacted by A, as well as the number of interactions that A started with them. Order that list decreasingly by that number of interactions.
3. Compute the number of interactions started by user A with the first NUM_INTERACTED_USERS users of the list made in the previous step.
4. Compute what fraction of the total computed in 3. represents the number of interactions computed in 1.