Large event networks (tutorial) - juergenlerner/eventnet GitHub Wiki

In this tutorial we illustrate how large networks of relational events, comprising millions of nodes and hundreds of millions of relational events can be analyzed with eventnet. Analyzing such large networks requires the application of two sampling techniques: (1) case-control sampling where non-events (controls) are sampled from the risk set (already described in the basic tutorial) and (2) sampling from the observed events, that is, from the sequence of input events. This tutorial assumes that you are familiar with the first-steps tutorial and with the basic tutorial.

Reference

Details on this study are in the paper

Lerner and Lomi (2020). Reliability of relational event model estimates under sampling: how to fit a relational event model to 360 million dyadic events. Network Science, 8(1):97-135. (DOI: https://doi.org/10.1017/nws.2019.57)

Study overview

In this study we analyze network effects that drive attention of contributing Wikipedia users to Wikipedia articles. From this aspect, the study is similar to the one described in the basic tutorial. However, on this page we describe the analysis of all edit events in the English-language edition of Wikipedia, giving rise to an event network comprising more than 6 million Wikipedia users and more than 5 million Wikipedia articles, connected by more than 360 million relational events. We further describe how the reliability of estimates under sampling can be experimentally assessed by repeated sampling. Results also yield guidelines for choosing the sample size.

Replication (data, eventnet configuration, and analysis)

(top)

All edit events are extracted from the public database dumps provided by the Wikimedia foundation. We extracted all events in which any registered user uploaded a new revision of any article in the English-language edition of Wikipedia in the time frame from January 15th, 2001 to January 1st, 2018. Users are identified by their user names, articles by their titles, and edit times are given in milliseconds.

This preprocessed data is available at Zenodo as Wikipedia Edit Event Data 2018 (WikiEvent.2018) (DOI: 10.5281/zenodo.1626323). If you are interested in replicating or modifying the analysis described on this page, download the ZIP file WikiEvent.2018.csv.zip provided under this link and un-zip it on your computer.

A similar newer data file containing Wikipedia edit events up to January 2021 can be found here: WikiEvent.2021 (DOI: 10.5281/zenodo.4522066).

An eventnet configuration that generates just one single output table (for fixed sample parameters) is simple.edit.events.config.xml. The CSV filename in this configuration is for the 2021 data. Note that you also may have to adapt the input and/or output directory name. The computation of explanatory variables can be started with the command java -Xmx120g -jar eventnet-x.y.jar simple.edit.events.config.xml (where x.y is to be replaced by the version number). This will produce a CSV file containing REM statistics for sampled events and associated controls. Model parameters of a REM can be estimated as illustrated in this R code file: simple_edit_events_model.R

The eventnet configuration for replicating the reliability analysis proposed in Lerner and Lomi 2020 (see above) is provided in the file config.test.sampling.reliability.xml. Note that this configuration does not analyze the event network just once. Rather it repeats the analysis 380 times, varying the sample parameters, to experimentally assess the variability of parameter estimates that is caused by sampling. Thus, the configuration contains 380 observations (see the basic tutorial) that are almost identical but vary in the sample parameters as described below.

The computation of explanatory variables can be started with the command java -Xmx52g -jar eventnet-x.y.jar config.test.sampling.reliability.xml (where x.y is to be replaced by the version number). This will work only if the eventnet JAR file and the event input file WikiEvent.2018.csv are in the same directory from which you execute this command. (If not, update the input directory or simply move these files to the current directory.) Execution will create a directory output in the current directory (you might change this) and creates 380 CSV files (one for each observation) containing the computed statistics for all sampled events and controls. Each of these output files can be analyzed, for instance, with the coxph function of the R package survival, as described in the basic tutorial.

Updating directories or filenames might also be done by editing the configuration XML file directly, without starting the eventnet GUI, as indicated in the XML code below.

  ...
  <input.files accept=".csv" has.header="true" delimiter="SEMICOLON" quote.char="DOUBLEQUOTE">
    <input.directory name="."/>
    <file name="WikiEvent.2018.csv"/>
  </input.files>
  <output.directory name="./output"/>
  ...

Network model

(top)

Many of the settings provided in the configuration file are similar to those described in the basic tutorial. Differences include that in this study we have no event types (all events are edit events) and that we set a decay with a halflife of 30 days to all attributes. The network effects are given by the statistics repetition, popularity, activity, four-cycle, and the interaction effect of popularity with activity (assortativity). The last effect does not have to be computed explicitly by eventnet.

A crucial difference is in the sampling strategy, defined in all observations. In this study, where we analyze hundreds of millions of events, we also sample from the observed events, in addition to case-control sampling. This is described in more detail below.

Computing and analyzing a single sample

Explanatory variables (the REM statistics repetition, popularity, activity, and four-cycle) can be computed with eventnet using the configuration file simple.edit.events.config.xml. The CSV filename in this configuration is for the 2021 data. Note that you also may have to adapt the input and/or output directory name. This will produce a CSV file containing REM statistics for sampled events and associated controls. Model parameters of a REM can be estimated as illustrated in this R code file: simple_edit_events_model.R

# install.packages("survival") # uncomment if needed

# attach the library
library(survival)

# set the working directory to the output directory of eventnet; change potentially
setwd(".")

# read explanatory variables from the eventnet output file 
edit.events <- read.csv("WikiEvent.2021_EDIT.csv", sep = ";")

sum(edit.events$IS_OBSERVED) # number of events

summary(edit.events)

# check summary statistics separately for events and non-events
# especially note the big difference in the repetition statistic
summary(edit.events[edit.events$IS_OBSERVED == 1,])
summary(edit.events[edit.events$IS_OBSERVED == 0,])

# specify and estimate a Cox proportional hazard model
edit.model <- coxph(Surv(time = rep(1,nrow(edit.events)), event = edit.events$IS_OBSERVED) ~ repetition 
                    + article_popularity 
                    * user_activity 
                    + four_cycle
                    + strata(EVENT) 
                    , data = edit.events)

# print model parameters
print(summary(edit.model))

The relevant part of the output is as follows:

 n= 9033488, number of events= 4516744 

                                       coef  exp(coef)   se(coef)      z Pr(>|z|)    
repetition                        4.046e+01  3.711e+17  2.126e+00  19.03   <2e-16 ***
article_popularity                1.350e+00  3.856e+00  6.469e-03 208.62   <2e-16 ***
user_activity                     1.666e+00  5.292e+00  6.464e-03 257.75   <2e-16 ***
four_cycle                        9.130e-01  2.492e+00  2.439e-02  37.43   <2e-16 ***
article_popularity:user_activity -1.858e-01  8.304e-01  5.518e-03 -33.67   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the parameter table, we find that there is a very strong repetition effect and positive effects of article popularity, user activity, and four-cycle closure. The interaction of popularity with activity is negative, pointing to disassortative mixing. That is, the more active users are less drawn to the popular articles than the less active users. Qualitatively, these effects coincide with those reported in Lerner and Lomi 2020 (see above). We note that the parameters above have been estimated from more than 9 million sampled instances (half of which are events), which is far bigger than any of the sample sizes considered in Lerner and Lomi 2020. The objective of that paper was to determine which sample sizes are just enough to get reliable estimates for the different REM effects. This is discussed in the following.

Sampling

(top)

The configuration file config.test.sampling.reliability.xml contains 380 different observations (see an example below). This allows to assess variation in estimated parameters as a function of the chosen sample parameters.

     ...
     <observation name="EDIT.FIX.00" type="DEFAULT_DYADIC_OBSERVATION" description="edit events"
	apply.case.control.sampling="true" number.of.non.events="5"
	apply.sampling.from.observed.events="true" prob.to.sample.observed.events="1.0E-4"
	source.node.set="users" target.node.set="articles"/>
     ...

Sampling from the observed events means that statistics are not computed for all input events but only for a random sample of them. In the observation above, the probability to include any given input event in the sample is p=0.0001, that is, we include on average one event out of 10,000 input events in each of the resulting CSV files in the output directory. Case-control sampling means that for each sampled observed event we include a fixed number randomly selected controls, that is, dyads from the risk set not experiencing the event at that time. In the example above we select m=5 controls per event. The risk set is the full cartesian product consisting of all user-article pairs. (The size of this risk set is larger than 30 trillion at the end of the observation period.)

The given configuration file defines observations for four different types of experiments.

(fixed p and m) We define 100 identical observations with the sample parameters p=0.0001 and m=5. This allows us to assess the variability of model parameter estimates caused by sampling for these given sample parameters.
(varying p for fixed m) For m=5 and ten different values of p, starting from p=0.00005 and being divided by 2 in each step, we define ten observations for each of these combinations of sample parameters. This allows us to assess which sample size is just sufficient to reliably estimate the model parameters.
(varying m for fixed p) For p=0.00001 and eight different values of m, starting from m=1 and doubling in each step, we define ten observations for each of these combinations of sample parameters. This allows us to assess the benefit of sampling more controls per event.
(varying m and p for a given total budget of dyads) Keeping the number of sampled dyads (events plus controls) constant at the value determined by p=0.0001 and m=5, we let m vary from 2 to 256 and decrease p accordingly, and define 10 observations for each of these combinations of values. This allows us to assess whether more events at the expense of less controls, or the other way round, gives more reliable estimates.

Results

(top)

In general, some tens of thousands of events (sampled from the 360 million input events) and a small number of controls per event seem to be enough to reliably estimate model parameters. Some of the effecs, most notably the degree effects popularity and activity can even be estimated with much smaller numbers of events, such as a few hundred. The parameter of the repetition effect turned out to be the one that has the highest variability caused by sampling. This is probably caused by very skewed distributions of the repetition variable among the controls: most controls have a value of zero and only very few are assigned non-zero values. The variability in parameters is probably due to the rarity of sampling controls with non-zero values in the repetition statistic.

In general, it seems to be preferable to sample more events and less controls if the total budget of sampled dyads is limited. Thus, the number of controls per event (that is, the parameter number.of.non.events in the observation definitions) should be set to a small number, for instance, in the interval from two to ten.

For more details see the paper: Lerner and Lomi (2020). Reliability of relational event model estimates under sampling: how to fit a relational event model to 360 million dyadic events. Network Science, 8(1):97-135. (DOI: https://doi.org/10.1017/nws.2019.57)

Large event networks (tutorial) - juergenlerner/eventnet GitHub Wiki

Reference

Study overview

Replication (data, eventnet configuration, and analysis)

Network model

Computing and analyzing a single sample

Sampling

Results

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️