Directed RHEM for multicast interaction (tutorial) - juergenlerner/eventnet GitHub Wiki

This tutorial illustrates how to specify and estimate relational hyperevent models (RHEM) for directed multicast interaction networks, such as email communication, citation networks, or virus transmission networks. In general, a directed hyperevent has a set of source nodes and a set of target nodes. (This marks a difference to undirected hyperevents that have a single set of participants without any distinction of source and target.) In many empirical applications - though not necessarily - directed hyperevents have a single source node and a set of target nodes of any size. For instance, an email message has a single sender (the source of the hyperevent) and any number of receivers (the set of targets of the hyperevent). This tutorial illustrates directed RHEM mostly for the single-source, multiple-target case. We assume familiarity with basic RHEM (see, for instance, the RHEM first steps tutorial) and indirectly with the use of eventnet in general (see, for instance, the first-steps tutorial or basic tutorial).

A general reference for RHEM is Lerner, Lomi, Mowbray, Rollings, and Tranmer (2021). Dynamic network analysis of contact diaries. Social Networks, 66:224-236. (DOI: 10.1016/j.socnet.2021.04.001), which defines and applies undirected RHEM.

RHEM for directed hyperevents, as discussed in this tutorial, are treated in: Lerner and Lomi (2023). Relational hyperevent models for polyadic interaction networks. Journal of the Royal Statistical Society: Series A https://doi.org/10.1093/jrsssa/qnac012.

RHEM (undirected and directed) have first been defined in the preprint (not peer-reviewed): Lerner, Tranmer, Mowbray, and Hancean (2019). REM beyond dyads: relational hyperevent models for multi-actor interaction networks. arXiv preprint arXiv:1912.07403.

Replication data

As illustrating data we use the Enron email data, which is a collection of email messages among employees of Enron Corporation that has been published after the company filed for bankruptcy. More precisely we use that part of the data, compiled by Zhou et al. (2007), which has also been used by Perry and Wolfe (2013). The replication data along with further pointers to its sources is available at https://github.com/juergenlerner/eventnet/tree/master/data/enron.

The input data for eventnet is in the single CSV file enron_events.csv. The data comprises 21,635 email messages among 156 employees, along with the actor-level attributes gender, senior, and department. An illustrating configuration for the analysis described in this tutorial is in the file enron_config.txt, which can be processed with the eventnet extension for hyperevents, available in the JAR files eventnet-x.y.jar, see the wiki for further information. To process this configuration with eventnet call, for instance, the command

java -Xmx8g -jar eventnet-1.1.jar enron_config.txt

This requires that the data file enron_events.csv is in the same directory from which you execute the command (otherwise update the input directory in the configuration file). While the example configuration enron_config.txt is sufficient for the analysis described below, some variations are illustrated in the larger configuration file enron_config_large.txt.

In the following we describe how to specify the configuration from scratch via the eventnet GUI (which is possible since eventnet version 1.0). Recall that there is a troubleshooting help page, which lists some common problems and tries to come up with solutions.

Structure of the input data for directed RHEM

The format to specify directed hyperevents is similar to that for undirected hyperevents; see here for the respective explanation. The difference is that directed hyperevents have sources and targets. Since events can have varying numbers of sources and/or targets, a single event has to be coded in several rows in the input table, one for each source/target. For instance, consider the following excerpt from the file enron_events.csv

"message.id","sender.id","receiver.id","time","type","weight"
...
16,138,58,914298240,"email",1
16,138,26,914298240,"email",1
16,138,45,914298240,"email",1
16,138,53,914298240,"email",1
17,45,138,914997420,"email",1
17,45,113,914997420,"email",1
17,45,53,914997420,"email",1
18,138,53,915416460,"email",1
19,138,59,915423060,"email",1
...

From that data snippet we see that email message number 16 has been sent by employee 138 to the four receivers 58, 26, 45, and 53. This is followed by email message number 17, sent by 45 to the three receivers 138, 113, and 53. The next email in the list has the message id 18 and has been sent by 138 to the single receiver 53; followed by email number 19 sent by 138 to the single receiver 59. Thus, for representing a hyperevent with one source and k target nodes, we need k different rows in the table. An "event id" (in our example given in the column message.id) clarifies which rows belong together to form one single hyperevent.

The fourth column in the table gives the event time (the numbers represent seconds - but the time granularity is nevertheless given by the minute). This is followed by the column type which does not vary for the email events (the type of email events is always equal to email; however, the type column is used to distinguish email events from dummy events used to set actor attributes, as explained below) and by the the column weight (the weight does not vary for emails in our data; however, the weight column is used to set the values of some actor-level covariates).

Besides the rows of the input data table that define the actual events, we create a number of "dummy events" to define the set of actors (who could potentially send or receive emails) and to define various actor-level attributes. Since in our data neither the set of actors, nor their attributes, change over time, all dummy events are listed before the first email event and have as time stamp a value smaller than all the time stamps of the emails (for instance, the minimum time minus one).

The add.actor events to define the set of actors have the same structure as for undirected RHEM, see here. To provide an example, consider the following snippet from the file enron_email.csv.

"message.id","sender.id","receiver.id","time","type","weight"
...
0,154,154,910930019,"add.actor",1
0,155,155,910930019,"add.actor",1
0,156,156,910930019,"add.actor",1
1,138,59,910930020,"email",1
2,138,15,911459940,"email",1
...

In that example, we see three add.actor events, adding the actors number 154, 155, and 156. (Note that, in general, node ids can be arbitrary text, not necessarily numbers.) The respective actors are given as the source and the target of the dummy event. The event id given in the first column of all dummy events has been set to 0 (this id won't be used; any value different from that of the email events will do) and the time of the dummy events is smaller than that of all the email events. The three dummy events are followed by two email events, each of which has a single receiver.

Besides the add.actor events we list several dummy events that set the values of actor-level attributes. These attribute-setting dummy events have a similar structure as the add.actor events. In our data we define four binary attributes by the following types of dummy events: is.female, is.senior, is.in.Trading, and is.in.Legal (the latter two define whether the respective employee works in the trading department, or in the legal department, respectively). In general, we only have to specify those actors that have a non-zero value in a given attribute - zero is implied by not specifying any value. Moreover, we (redundantly) code the department in which the employee works by dummy events of type department taking integer values: 2 for the trading department, 1 for the legal department, and 0 for employees working in any other department. The values in the weight column always have to be numbers (integers or decimal). However, the department attribute will be understood as a categorical variable: we just use the information whether two employees work in the same or in a different department. The following snippet from enron_events.csv illustrates some of the attribute-setting dummy events.

"message.id","sender.id","receiver.id","time","type","weight"
...
0,149,149,910930019,"is.senior",1
0,153,153,910930019,"is.senior",1
0,154,154,910930019,"is.senior",1
0,155,155,910930019,"is.senior",1
0,156,156,910930019,"is.senior",1
0,1,1,910930019,"department",2
0,2,2,910930019,"department",2
0,3,3,910930019,"department",2
0,4,4,910930019,"department",1
...

We see that actors number 149, 153, 154, 155, and 156 are senior employees (it is implied that 150, 151, and 152 are junior employees), that employees 1, 2, and 3 work in the trading department (department coded by the value 2), and that employee 4 works in the legal department (coded by the value 1).

The R code file enron_preprocess.R illustrates how to create the input file for RHEM (enron_events.csv) from differently structured data.

Differences to configurations for undirected RHEM

In the first part of a configuration for directed RHEM there are few differences to configurations for undirected RHEM, discussed in the RHEM first steps tutorial. One notable difference is that for directed RHEM the event components SOURCE and TARGET, specified in the events tab, map to different columns in the input CSV file since there is a distinction between the sources of events and the targets of events. (Recall that when specifying configurations for undirected RHEM, the components SOURCE and TARGET typically map to the same column giving the participant of the undirected hyperevent.) Other major differences are that attributes, statistics, and observations for directed hyperevents are different from their counterparts for undirected hyperevents. In fact, there are many more possibilities for directed hyperevents.

In the following we provide details on the configuration for the Enron email data, serving as an example for directed RHEM.

RHEM configurations (files, events, and time)

The parts in the configuration giving the location and format of the input file(s) and the output directory can be set as usual for dyadic REM or for undirected RHEM.

In the events tab, we map the event component SOURCE to the column sender.id (giving the id of the employee that sends the message) and we map TARGET to the column receiver.id in which the ids of the receivers are listed. The EVENT_INTERVAL_ID is mapped to the column giving the message ids, which serve to identify which rows in the input file belong together to define one hyperevent (see the discussion on the input format given above). The event components TIME, TYPE, and WEIGHT map to the columns of the respective name in lower case. See the screenshot below.

events tab

The email communication network is a one-mode network (senders and receivers belong to the same set of nodes), which also means that node set names do not have to be specified.

Clicking on the learn event types from file button fills the first column of the table in the lower part of the events tab. Source node sets and target node sets do not have to be specified (since it is a one-mode network). In the last column (allow loops) we check the box of all event types except email (see the screenshot below). Indeed, events of type email indicate which sender sends a message to which receivers and in the enron data a sender never sends an email to herself. All other event types are dummy events, setting the values of node-level attributes, where the id of the node whose values are to be set are given in the SOURCE and TARGET fields, implying that the event types have to admit loops.

events tab

In the Enron email data, time is given as integers (representing seconds). However, since time resolution is by the minute, we define a time unit to be 60 (seconds). The event interval type is set to EVENT_INTERVAL, which implies that each hyperevent is assumed to happen on its own - even if another hyperevent has the same time stamp. (This happens in only a few cases in the enron data so that simultaneous events can be ignored.) See a screenshot of the time tab below.

time tab

RHEM configurations (attributes)

In the configuration for the Enron email data, there are different types of attributes. First there is a series of five node-level attributes recording the given actor attributes (see above): is.female, is.senior, is.in.Trading, is.in.Legal, and department. Attributes can be specified in the attributes tab in the eventnet GUI. The screenshots below demonstrate how to define the node-level attribute is.female.

Click on the button create attribute. In the next dialog, set attribute class to NODE_LEVEL and the type name to DEFAULT_NODE_LEVEL_ATTRIBUTE; click on ok. In the next dialog enter the name is.female and choose SET_VALUE_TO as the update type. The latter setting declares that values encountered in the CSV file overwrite previous values of the same attribute (if any) and do not increment previous values (contrast this to the settings for the attributes recording past email events, explained below). At the bottom of this dialog, under updates by events, click on the button add event type and in the next dialog that opens select is.female for the event type and click on the button set. The event type is.female then appears in the former dialog and clicking on the ok button creates the attribute.

attribute tab attribute tab attribute tab

Creating the other node-level attributes proceeds in a similar way. The difference is that these attributes respond to other event types. To create these events more efficiently you can click on the edit copy button in the attributes tab (and change the settings accordingly), or you could open the current configuration file in a text editor and copy, paste, and adapt the settings of the node level attributes, as shown in the snipped below. (Then read the modified configuration into the eventnet GUI via file --> merge into current configuration.)

  <attribute name="is.female" class="NODE_LEVEL" type="DEFAULT_NODE_LEVEL_ATTRIBUTE" description="">
    <attr.update type="SET_VALUE_TO"/>
    <event.response event.type="is.female" direction="OUT"/>
  </attribute>
  <attribute name="is.senior" class="NODE_LEVEL" type="DEFAULT_NODE_LEVEL_ATTRIBUTE" description="">
    <attr.update type="SET_VALUE_TO"/>
    <event.response event.type="is.senior" direction="OUT"/>
  </attribute>
  <attribute name="is.in.trading" class="NODE_LEVEL" type="DEFAULT_NODE_LEVEL_ATTRIBUTE" description="">
    <attr.update type="SET_VALUE_TO"/>
    <event.response event.type="is.in.Trading" direction="OUT"/>
  </attribute>
  <attribute name="is.in.legal" class="NODE_LEVEL" type="DEFAULT_NODE_LEVEL_ATTRIBUTE" description="">
    <attr.update type="SET_VALUE_TO"/>
    <event.response event.type="is.in.Legal" direction="OUT"/>
  </attribute>
  <attribute name="department" class="NODE_LEVEL" type="DEFAULT_NODE_LEVEL_ATTRIBUTE" description="">
    <attr.update type="SET_VALUE_TO"/>
    <event.response event.type="department" direction="OUT"/>
  </attribute>

Then we have one node-level attribute at.risk giving the information which actors can send or receive. (This attribute is not really needed in for this data since the set of actors does not change over time. But in general, this attribute demonstrates the mechanism how varying actor sets could be specified.) The snippet from the configuration file below shows the settings of the at.risk attributes. The specification via the GUI works similarly to the other node-level attributes.

  <attribute name="at.risk" class="NODE_LEVEL" type="DEFAULT_NODE_LEVEL_ATTRIBUTE" description="">
    <attr.update type="SET_VALUE_TO"/>
    <event.response event.type="add.actor" direction="OUT"/>
  </attribute>

Finally, there are three attributes storing information on past events in different ways. The directed hyperedge attribute emails records for each directed hyperedge all past events on that hyperedge. For instance, for a hyperedge h=(A,{B,C,D}) the value of the attribute emails on h at a given time t is a function of all emails sent by A to the receiver set {B,C,D} before t. Note that the receiver set has to be exactly the set {B,C,D}; if an email has been sent by A to, say, {B,C,D,E}, then this past email has no influence on the value on the hyperedge h. We let the influence of past events decay over time with a half life period of one week (note that 10,080 minutes are one week). See this tutorial for general information on specifying decay in eventnet.

To specify the directed hyperedge attribute emails via the eventnet GUI, click on the create attribute button in the attributes tab; then select the attribute class DIR_HYPER_LEVEL and the type name DEFAULT_DHE_ATTRIBUTE; click on ok. In the create attribute dialog, give the name emails, select the update type INCREMENT_VALUES_BY (meaning that update values are added to the current attribute values, rather than set as the new attribute values), check the box right before decays, and give a halflife of 10080.0 time units. See the screenshot below.

attribute tab

Further down in the dialog, click on the add event type button and in the dialog that opens, select the event type email and click on the set button. In the former dialog click on ok to create the attribute. The corresponding snipped in the configuration file looks like this:

  <attribute name="emails" class="DIR_HYPER_LEVEL" type="DEFAULT_DHE_ATTRIBUTE" description="">
    <attr.update type="INCREMENT_VALUE_BY" decay.time.scale="TIME_UNIT" halflife="10080.0"/>
    <event.response event.type="email"/>
  </attribute>

Strictly speaking, the attribute emails captures all information about past events that is needed to define hyperedge statistics (i.e., explanatory variables in RHEM). However, for defining some statistics (see below) we specify two more attributes, รจmails.dyadic and emails.undirected, that redundantly store information about past email events in different ways. Specifically, emails.dyadic is a dyad-level attribute, dependent on past events. For instance, a past email sent by actor A to the receiver set {B,C,D} updates the values of the attribute emails.dyadic on three dyads: (A,B), (A,C), and (A,D). This attribute is later used to compute the various triadic closure statistics (see below). To create the attribute emails.dyadic via the GUI, click on create attribute and in the next dialog, choose the class DYAD_LEVEL and the type name DYAD_LEVEL_ATTRIBUTE_FROM_DHE; click on ok and in the following dialogs type the name emails.dyadic and otherwise choose the identical settings as for the directed hyperedge attribute emails.

attribute tab

Finally, the attribute emails.undirected stores information of past email events on undirected hyperedges. For instance, a past email sent by actor A to the receiver set {B,C,D} updates the value of the attribute emails.undirected on the undirected hyperedge {A,B,C,D} (disregarding who is the sender of the email). This attribute is later used to compute the statistic undirected.repetition, which recognizes patterns of repeated communication within a fixed group with turn-taking among the senders. This pattern arises for instance from the reply-to-all functionality typically offered by email clients: a receiver of a past message replies to the sender and to all other receivers. Creating the attribute emails.undirected is nearly identical to the creation of emails or emails.dyadic, taking into account that the attribute class is UNDIR_HYPER_LEVEL and the type name is DEFAULT_UHE_ATTRIBUTE.

Exhaustive information about all classes and types of attributes relevant for specifying RHEM is given in the reference guide on RHEM effects; see the section event network attributes.

RHEM configurations (statistics)

We compute a series of hyperedge statistics which later serve as the explanatory variables in RHEM. We have statistics of different types: statistics dependent on the size of the receiver set, statistics that are functions of the actor-level attributes gender, seniority, and department, and statistics dependent on past events, capturing "network effects". Among the latter there are again several different types representing different structural patterns in email communication.

Exhaustive information about all classes and types of statistics relevant for specifying RHEM is given in the reference guide on RHEM effects; see the section RHEM statistics.

Statistics: hyperedge size

The statistic num.receivers gives the size of the target set of the directed hyperedge, that is, the number of receivers. To specify this statistic in the eventnet GUI, go the the statistics tab and click on the create statistic button. Then, in the dialog that opens, select the class DHE, type name DHE_SIZE_STAT, give the name (num.receivers), check the box endpoint and select TARGET. (A respective statistic counting the number of source nodes is not needed for the email data since each email has exactly one sender, so that the source size does not vary.) Click on ok to create the statistic.

statistics tab

Statistics: actor-level attributes

For each of the two actor-level attributes is.female and is.senior, we define a series of four hyperedge statistics: giving the covariate value of the sender, the average over the receivers, the average absolute difference between the sender and the receivers, and the average absolute difference among all pairs of receivers. Below we illustrate the definition of these statistics for the attribute is.senior; the respective definitions for the is.female attribute are done by changing the node attribute name.

To create the attribute sender.seniority, click on the create attribute button. Then select the class DHE, type name DHE_NODE_STAT, give the name of the statistic, select the node attribute is.senior, endpoint SOURCE, and missing value -1.0 (or any negative number). At the bottom of the dialog, check the box below set function to aggregate values and choose AVERAGE. Click on ok to create the attribute.

statistics tab

To create the attribute receiver.avg.seniority proceed very similarly. The only difference is to select TARGET as the endpoint. The most efficient way is to click on the edit copy button right of sender.seniority, adapt the name, and make the single change; then click on ok. To define receiver.pair.abs.diff.seniority, set endpoint equal to TARGET and set the function to aggregate values to ABSDIFF. Finally, for sender.receiver.abs.diff.seniority, uncheck the box before endpoint and set the function to aggregate values to ABSDIFF. The snippet from the configuration file defining these four attributes looks like this.

  <statistic name="sender.seniority" type="DHE_NODE_STAT" node.attr.name.1="is.senior" na.value="-1.0" endpoint="SOURCE">
    <aggregation.function type="AVERAGE"/>
  </statistic>
  <statistic name="receiver.avg.seniority" type="DHE_NODE_STAT" node.attr.name.1="is.senior" na.value="-1.0" endpoint="TARGET">
    <aggregation.function type="AVERAGE"/>
  </statistic>
  <statistic name="receiver.pair.abs.diff.seniority" type="DHE_NODE_STAT" node.attr.name.1="is.senior" na.value="-1.0" endpoint="TARGET">
    <aggregation.function type="ABSDIFF"/>
  </statistic>
  <statistic name="sender.receiver.abs.diff.seniority" type="DHE_NODE_STAT" node.attr.name.1="is.senior" na.value="-1.0">
    <aggregation.function type="ABSDIFF"/>
  </statistic>

The effect of the statistic sender.seniority can reveal whether senior employees are more or less likely to send emails, while receiver.avg.seniority measures whether senior employees are more or less likely to receive emails. The effect of the statistic sender.receiver.abs.diff.seniority can reveal whether employees have a tendency to send emails to other employees of the same or different seniority. Note that the statistic captures the difference in seniority; that is, a negative parameter would reveal homophily with respect to seniority. Finally, the statistic receiver.pair.abs.diff.seniority can assess whether employees have a tendency to send emails to homogeneous receiver sets (mostly composed of senior employees or mostly composed of junior employees), regardless of the sender's own seniority. Again, receiver-set homophily would be revealed by a negative parameter.

The respective attributes related with the node attribute is.female are done accordingly, changing is.senior to is.female. It might be most efficient to do this in the configuration file, doing copy, past, search, and replace with a text editor; then the modified configuration file has to be merged into the current configuration in the eventnet GUI.

The parameter na.value in the definitions above has no importance in our empirical example since none of the actor-level attributes has any missing value (so we just set na.value to an arbitrary value that does not appear in our data). However, given data with missing values (represented by -1 in the example above), the statistics would compute averages only over nodes (senders or receivers) with non-missing values.

The two statistics giving the attribute value of the sender and the average value over the receivers are also specified for the two binary attributes is.in.legal and is.in.trading (see the example configuration). However, the respective homophily statistics (measuring differences between sender and receiver, or among the receivers) are defined on the categorical attribute department. Recall that department takes integer values: 2 for the trading department, 1 for the legal department, and 0 for employees working in any other department (see above). However, these values are not interpreted numerically but as categorical values that just reveal whether employees work in the same or in different departments. This is specified by setting CATDIFF (instead of ABSDIFF) for the function to aggregate values. These statistics computes the fraction of sender-receiver pairs such that the sender works in a different department than the receiver, respectively the fraction of receiver pairs working in different departments. Again, negative parameter values would point to homophily with respect to the employees' department.

Statistics for network effects: exact repetition

The tendency to repeat email communication with exactly the same sender and the same set of receivers is captured by the statistic repetition, defined below. The tendency to repeat email communication within a fixed group of actors, but with turn-taking among the senders, is captured by the statistic undirected.repetition. To illustrate the difference, if an email from actor A to receivers {B,C,D} is eventually repeated by another email from the same sender (A) to the same set of receivers ({B,C,D}), then this pattern is captured by repetition. In contrast, if an email from actor A to receivers {B,C,D} is eventually followed by another email from the sender B to the set of receivers {A,C,D}, then this pattern is captured by undirected.repetition. In the second example, a receiver of the first email (B) sends a message to the previous sender (A) and to all other receivers (C and D). This pattern is frequent in email communication due to the reply-to-all functionality.

To specify the repetition statistic via the eventnet GUI, click on create attribute, select the class DHE and the type name DHE_REPETITION_STAT, type the name of the statistic, select direction equal to OUT, do not check the box "endpoint", and select the hyperedge attribute emails; then click on ok. Settings for undirected.repetition are similar. The only difference (apart from the name of the statistic) are the direction which is set equal to SYM and the hyperedge attribute emails.undirected. See the screenshots below.

statistics tab statistics tab

Statistics for network effects: partial receiver set repetition

Besides exact repetition, multicast communication often gives rise to interaction events that partially, but not exactly, repeat receiver lists. Such patterns can point to a clustering in the set of actors. This effect is related with subset repetition in undirected RHEM: sets of actors that jointly participate in the same event are often more likely to co-participate in future events - possibly together with varying other participants. In the case of directed single-source, multiple-receiver events, we have two versions of subset repetition: partial receiver set repetition that is independent of the sender and sender-specific partial receiver set repetition.

We note that the statistics type DHE_SUB_REPETITION_STAT is much more versatile and can also be used to define (subset) reciprocation, generalized reciprocation, interaction among receivers, and more (discussed further below in this tutorial). See the reference guide on RHEM effects for an exhaustive treatment.

In the example configuration for the Enron email data we define a list of statistics capturing receiver subset repetition of varying order. The argument source.size=0 implies that this form of subset repetition is independent of the sender. For instance, if an email sent by actor A to receiver set {B,C,D} is eventually followed by an email sent by actor E to receiver set {B,C,F,G}, then this could point to receiver subset repetition of order one and two (note that the two actors B and C jointly receive both messages). The argument target.size may range from 1 up to the maximal number of receivers of any email in the data. However it is usually the case that for high values of the target size the statistic is too sparse in the Enron email data to be included in a model. Models in general can be specified by a subset of the statistics computed by eventnet.

The screenshot below illustrates how to define partial receiver set repetition of order one. Among others, direction is set to OUT, endpoint is unchecked, the source size is set to 0, and the target size is set to 1. Further down in the dialog the function to aggregate values can be set to AVERAGE; however, this is not necessary since this setting is the default. To define partial receiver set repetition of higher order, increase the values of the target size.

statistics tab

The statistics for sender-specific partial receiver set repetition (e.g., s.r.sub.rep.1 in the example configuration) are rather similar. The only difference is that we set the argument source size to 1, requiring that the sender of events is repeated together with subsets of receivers of varying size. For instance, if an email sent by actor A to receiver set {B,C,D} is eventually followed by an email by the same sender (A) to receiver set {B,C,F,G}, then this could point to sender-specific receiver subset repetition of order one and two (note that the two actors B and C jointly receive both messages and both messages have been sent by A). If the second message has been sent by another actor E, instead of A, the two emails would not point to sender-specific receiver set repetition. In general, the source size could take any value up to the maximum number of senders of a directed hyperevent. Since in our given example, emails always have just one sender, the source size cannot be set higher than one.

Statistics for network effects: (generalized) reciprocation

To include reciprocation and generalized reciprocation (indegree of the sender and outdegree of the receiver) in RHEM for multicast interaction networks we add the three statistics that are also based on DHE_SUB_REPETITION_STAT - but set the direction to IN (instead of OUT) to declare that hyperedges of past events have the reverse direction than the current hyperedge. The statistic reciprocation captures patterns in which a receiver of a previous email sends an email to the sender of that same previous email. For instance, if an email send by actor A to receivers {B,C} is eventually followed by an email from B to {A,D,E}, then this could point to reciprocation. Note that B, a receiver of the previous emails sends a message among others to the previous sender A. The statistic reciprocation has source size and target size equal to 1 (and direction IN). Generalized reciprocation is included by the two statistics measuring the indegree of the sender and the outdegree of the receiver. The statistic sender.indeg is a function of previous emails received by the sender of the current hyperedge. For instance, if an email sent by actor A to receivers {B,C} is eventually followed by an email from B to {D,E,F}, then this could point to an effect of the sender's indegree: actor B receives a message and then becomes the sender of another message which, however, is not directed to the original sender but to other actors in the network. The statistic receiver.outdeg is a function of previous emails sent by the receiver of the current hyperedge. For instance, if an email sent by actor A to receivers {B,C} is eventually followed by an email from actor D to {A,E,F}, then this could point to an effect of the receiver's outdegree: actor A sends a message and then receives another message which, however, does not originate from a receiver of the first message but from another actor (D) in the network. The statistic sender.indeg has source size 1 and target size 0 and these two values are reversed for receiver.outdeg (besides that, the direction is set to IN). Note that for email data, subset reciprocation cannot be specified with source size or target size greater than one since the current hyperedge, as well as the hyperedges of past events, cannot have more than one sender.

Since in our example data, the source size is constrained to be equal to one (single-sender events), neither the source.size nor the target.size argument of the subset reciprocation statistics can be set to any value larger than one (note that such a statistic would be necessarily zero).

Statistics for network effects: addressing senders and their receivers

Yet another structural pattern in directed hyperevent networks arises if events are sent to the sender of a previous event, jointly with some of the receivers of the same previous event. For instance, the pattern arises in citation networks if a paper cites another paper and some of the references of the latter. This effect can again be included as a statistic of type DHE_SUB_REPETITION_STAT, setting endpoint to TARGET and direction to OUT to declare that the sender and receiver(s) of past events have to be members of the receivers of the current hyperedge. These statistics are dubbed "interaction among receivers" in the following. For instance, if an email sent by actor A to the receiver set {B,C,D} is eventually followed by an email sent by actor E to receivers {A,B,C,F}, then this could point to an effect of interaction among receivers of order one and two. Note that the sender of a past event (A) receives another event jointly with two of the receivers of the same past event (B and C). The source size of these statistics is always set to 1 and the target size may range from 1 up to any number.

Statistics for network effects: triadic closure

Finally, we add a list of statistics capturing four variants of triadic closure in directed hyperevent networks: transitive closure, cyclic closure, shared sender, and shared receiver. The definition of these statistics is nearly identical. The only difference is in the arguments direction.1 and direction.2. For instance, if we have two past events on the hyperedges (A,{B,C}) and (C,{D,E,F}), then the two-path from A over C to D would be transitively closed by a future event on the hyperedge h=(A,{D,G}). In the definition of transitive.closure we set direction.1=OUT (meaning that past events "out" of the source A of h are considered) and we set direction.2=IN (meaning that past events "into" the targets D and G of h are considered).

statistics tab statistics tab

Observation generators for directed hyperevents

We define two observation generators for directed hyperevents. Both specify conditional-size RHEM, that is, an observed hyperevent is only compared with non-event hyperedges that have the same number senders and the same number of receivers as the observed event. (Note that the number of senders is always one in this application.) Both observations apply case-control sampling where we sample 100 non-event hyperedges associated with each observed event. (It would be possible to choose much lower numbers of sampled non-events, at the expense of increased standard errors and lower reliability of parameter estimates.) The difference between the two observations is that one (EMAILS_COND_SENDER) conditions on the sender of events, while the other (EMAILS) considers also varying senders. For example, in the conditional-sender observation, an observed email event (A,{B,C,D}) is compared only with non-event hyperedges of the form (A,{E,F,G}). That is, the sender of all associated non-event hyperedges matches the sender A of the observed event. Thus, the conditional-sender observation conditions on the observation that A is the sender of the next email and models the probability of the different possible receiver sets. In the other observation (not conditioning on the sender, an observed email event (A,{B,C,D}) is also compared with non-event hyperedges of the form (H,{E,F,G}). That is, the sender of the alternatives can be different from the sender of the observed event.

Observations can be created in the observations tab. The screenshot below illustrates how to define settings for the conditional-sender observation; for the other uncheck the box condition on source.

observations tab

This configuration can be executed from within the eventnet GUI or from a command line; see for instance the tutorial for undirected RHEM. Executing the configuration will produce two output files, one for each observation. In these output files are all statistics for the observed events and the sampled non-event hyperedges. These tables serve as input for subsequent estimation of RHEM parameters.

Recall that there is a troubleshooting help page, which lists some common problems and tries to come up with solutions.

Fitting RHEM in R

Once the statistics of all observed events and sampled non-events have been computed, estimating parameters of directed RHEM is nearly identical with estimation of undirected RHEM; see the tutorial for undirected RHEM.

The given example configuration computes two tables containing statistics of events and sampled non-events: enron_events_EMAILS_COND_SENDER.csv for RHEM conditioning on the sender of observed events and enron_events_EMAILS.csv for models with unconstrained sender. We first demonstrate how to fit the conditional-sender models.

Import of the survival package, import of statistics tables, and transformation of variables can be done by the following code.

library(survival)

## set the working directory
setwd("<output directory of eventnet>")

## read the output table of eventnet (observation that conditions on the sender of events)
events.cond.sender <- read.csv("enron_events_EMAILS_COND_SENDER.csv")

# apply square-root transformation to statistics of network effects, but not those dependent on actor-level attributes
# (network effects are consecutive at the end of the list of variables, starting with "repetition")
first.network.var <- which(names(events.cond.sender) == "repetition")
events[,c(first.network.var:ncol(events))] <- sqrt(events[,c(first.network.var:ncol(events))])

The RHEM parameters are estimated, and printed, by the following code.

RHEM.cond.sender <- coxph(Surv(time = rep(1,nrow(events)), event = events$IS_OBSERVED) ~ 
                            receiver.avg.female + sender.receiver.abs.diff.female
                          + receiver.pair.abs.diff.female + receiver.avg.seniority
                          + sender.receiver.abs.diff.seniority + receiver.pair.abs.diff.seniority
                          + receiver.avg.in.legal + receiver.avg.in.trading
                          + sender.receiver.frac.diff.department 
                          + repetition + undirected.repetition 
                          + r.sub.rep.1 + r.sub.rep.2 + r.sub.rep.3 + s.r.sub.rep.1 + s.r.sub.rep.2
                         + reciprocation + receiver.outdeg + interact.receivers.1 + interact.receivers.2
                          + shared.sender + shared.receiver + transitive.closure + cyclic.closure
                          + strata(EVENT_INTERVAL)
                          , data = events
)
summary(RHEM.cond.sender)

The summary function produces the following output.

  n= 2185135, number of events= 21635 

                                           coef  exp(coef)   se(coef)       z Pr(>|z|)    
receiver.avg.female                   2.289e-01  1.257e+00  2.450e-02   9.343  < 2e-16 ***
sender.receiver.abs.diff.female      -1.709e-01  8.429e-01  2.312e-02  -7.391 1.46e-13 ***
receiver.pair.abs.diff.female        -2.435e-01  7.839e-01  7.036e-02  -3.461 0.000538 ***
receiver.avg.seniority                3.242e-01  1.383e+00  2.390e-02  13.564  < 2e-16 ***
sender.receiver.abs.diff.seniority   -3.664e-01  6.932e-01  2.234e-02 -16.399  < 2e-16 ***
receiver.pair.abs.diff.seniority     -8.042e-01  4.475e-01  7.303e-02 -11.012  < 2e-16 ***
receiver.avg.in.legal                 9.675e-02  1.102e+00  3.347e-02   2.891 0.003845 ** 
receiver.avg.in.trading              -1.218e-01  8.853e-01  2.807e-02  -4.339 1.43e-05 ***
sender.receiver.frac.diff.department -6.563e-01  5.188e-01  2.334e-02 -28.124  < 2e-16 ***
repetition                           -4.978e-01  6.079e-01  5.827e-02  -8.543  < 2e-16 ***
undirected.repetition                 1.233e+00  3.432e+00  5.123e-02  24.069  < 2e-16 ***
r.sub.rep.1                          -1.031e-01  9.020e-01  1.456e-02  -7.085 1.39e-12 ***
r.sub.rep.2                           4.670e-01  1.595e+00  6.314e-02   7.396 1.41e-13 ***
r.sub.rep.3                           3.193e+00  2.436e+01  2.206e-01  14.476  < 2e-16 ***
s.r.sub.rep.1                         1.635e+00  5.129e+00  3.604e-02  45.357  < 2e-16 ***
s.r.sub.rep.2                         5.986e+00  3.979e+02  2.302e-01  26.002  < 2e-16 ***
reciprocation                         3.231e-01  1.381e+00  3.214e-02  10.054  < 2e-16 ***
receiver.outdeg                       1.241e-02  1.012e+00  1.521e-02   0.816 0.414584    
interact.receivers.1                  2.695e+00  1.480e+01  9.436e-02  28.559  < 2e-16 ***
interact.receivers.2                  9.915e+00  2.024e+04  8.349e-01  11.875  < 2e-16 ***
shared.sender                         6.706e-01  1.955e+00  2.990e-02  22.430  < 2e-16 ***
shared.receiver                      -1.457e-01  8.644e-01  2.797e-02  -5.208 1.90e-07 ***
transitive.closure                    2.653e-01  1.304e+00  3.351e-02   7.916 2.46e-15 ***
cyclic.closure                        4.454e-03  1.004e+00  3.637e-02   0.122 0.902531    
---
Signif. codes:  0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

For instance, we find that female employees and senior employees receive emails at a higher rate. We find homophily for the attributes gender, seniority, and department (negative parameters associated with the diff statistics).

Among the network effects we get a negative parameter for repetition and a positive parameter for undirected.repetition. These two effects have to be considered together. Both effects consider repeated interaction within a fixed set of actors. However, while repetition requires that the sender of repeated events has to be identical to the sender of past events, undirected repetition allows turn taking among the given set of actors. Together the findings on these two effects suggest that actors have a tendency to interact within given groups - but that the sender of the last event within that group is less likely to be the sender of the next event than another member of the group.

Most effects on (sender-specific) partial repetition of receiver sets, interaction among receivers, and reciprocation are consistently positive. Among the triadic effects, having received emails from a common third actor (shared.sender effect) and being transitively connected increases interaction rates and there is a tendency against interaction among actors having received messages from the same third actor.

Estimation of the RHEM that does not condition on the sender of observed events can be done by very similar code. However, since in that RHEM the sender is also varying, we can include additional effects that are only functions of the sender. In the given example configuration, these are the effects sender.female, sender.seniority, sender.in.legal, sender.in.trading, and sender.indeg.

References

โš ๏ธ **GitHub.com Fallback** โš ๏ธ