RHEM effects (reference guide) - juergenlerner/eventnet GitHub Wiki

This page provides an exhaustive list and discussion of RHEM effects available in eventnet starting from Version 1.0 ("eventnet one"). We recommend that new users should first have a look at least at some of the more basic tutorials linked in the eventnet wiki and only then read this more formal reference guide.

Eventnet one (Version 1.0 or later) comes with three important changes.

  • The functionality for dyadic relational event models (REM) and for relational hyperevent models (RHEM) is now provided in a single JAR file (eventnet-1.0.jar or later; see the JAR files listed in https://github.com/juergenlerner/eventnet/tree/master/jars).
  • RHEM can now also be specified purely in the graphical user interface (GUI).
  • RHEM effects have been completely reorganized. The number of different core types of RHEM statistics could be reduced but a more efficent use of the arguments of statistics actually provides a much larger variation of possible RHEM effects than in prior versions. Note that because of this reorganization, configuration files from versions prior to 1.0 will most likely no longer work with eventnet one. (Note, however, that the JAR files of prior versions are still available at https://github.com/juergenlerner/eventnet/tree/master/jars/old_versions-0.x.)

We recall that RHEM effects are specified via a combination of event network attributes (representing the state of the event network by explicitly storing specific information about past events and/or exogenous variables) and event network statistics (which are the explanatory variables of RHEM and are defined as functions of attributes). There is only a relatively small number of different core types of attributes and statistics. These core types can give rise to a larger variety of RHEM effects by specifying the arguments of attributes and statistics. We start by giving an informal overview of the different effects and provide a more exhaustive discussion further below.

Overview of effects

The core types of RHEM effects for directed and undirected hyperevents are similar, although the undirected versions typically have fewer variations. In the type names introduced below DHE stands for "directed hyperedge" and UHE stands for "undirected hyperedge".

  1. Hyperedge size (typenames: DHE_SIZE_STAT and UHE_SIZE_STAT) gives the number of participating nodes. The directed version has an argument endpoint (which can be SOURCE, TARGET, or unspecified) to give the number of source nodes, target nodes, or the sum of the two, respectively.
  2. Node attribute statistics (typenames: DHE_NODE_STAT and UHE_NODE_STAT) gives a summary function of a specified node attribute over the participating nodes. Node attributes may be exogenous (such as "age") or endogenous (such as the number of prior events sent or received) and a specific value na.value indicates a missing value. There is a variety of summary functions ("aggregation functions) representing central tendencies (e.g., averages) or dispersion/heterogeneity (e.g., pairwise absolute differences or standard deviation) over the event participants. The directed version can specify an endpoint to aggregate values only over sources or targets.
  3. Repetition and reciprocation statistics (exact) (typenames: DHE_REPETITION_STAT and UHE_REPETITION_STAT) can test for the tendency to exactly repeat the participating nodes. For instance, a directed hyperevent could be explained by exact repetition if there is a prior event on exactly the same hyperedge. In contrast to the more lenient "subset repetition" discussed below, this requires that the prior and current event participants do not differ in a single source node or target node. The directed version has an argument direction, which can be used to specify repetition, reciprocation, or undirected repetition effects (see the extended discussion below) and an argument endpoint to test for repetition/reciprocation only among the sources or targets. Information on prior events is stored in a specified hyperedge attribute (see the extended discussion below).
  4. Subset repetition and subset reciprocation statistics (typenames: DHE_SUB_REPETITION_STAT and UHE_SUB_REPETITION_STAT) can test for the tendency to partially repeat the participating nodes. For instance, if a new event has five participating nodes and three of them have co-participated in an event before (possibly together with yet other participants), then the new event could be explained by subset repetition of order three. Subset repetition is a very versatile family of effects which often, empirically, have strong explanatory power in RHEM. For instance, it can represent effects dependent on node attributes, dyad attributes, or attributes on larger subsets of nodes; it can explain repetition, reciprocation, as well as degree effects (activity or popularity). The many variations of subset repetition are explained in the extended discussion below.
  5. Geometrically-weighted subset repetition statistics (typenames: DHE_GW_SUB_REP_STAT and UHE_GW_SUB_REP_STAT; since version 1.2) are very similar to subset repetition discussed above, where the difference is in how the size of the overlap between past events and future events is weighted. GW subset repetition is expected to yield more robust effects and more parsimonious models than the more traditional subset repetition. The weighting in GW subset repetition is inspired by respective statistics defined for exponential random graph models; see, e.g., Hunter and Handcock (2006).
  6. Triadic closure statistics (typenames: DHE_CLOSURE_STAT and UHE_CLOSURE_STAT) can test whether two actors, who have previously co-participated in events with the same "third" actor, are likely to jointly participate in future events themselves (thereby "closing" a triangle). The many variations of triadic closure are explained in the extended discussion below.
  7. Four-cycle statistics (typenames: DHE_4CYCLE_STAT and UHE_4CYCLE_STAT; since Version 1.1) can test for tendencies to close "three-paths" of the form $i-i'-i''-j$ to four-cycles by events in which nodes $i$ and $j$ co-participate.
  8. Neighbor statistics (typenames: DHE_NEIGHBOR_STAT and UHE_NEIGHBOR_STAT; since Version 1.1) can aggregate node-level attributes over the neighbors $i'$ of event participants $i$. The neighbors can be weighted by specified dyad attributes on the dyad $(i,i')$ connecting the participant $i$ with the neighbor $i'$. These statistics are, thus, variants of weighted degrees, allowing to consider weights on the dyads connecting participants with neighbors and/or weights (numeric properties specified as node-level attributes) of the neighbors.
  9. Network (or "global") statistics (typenames: DHE_NETWORK_STAT and UHE_NETWORK_STAT) give global variables, such as indicators of specific time periods or the number of transpired events in the entire network.

Eventnet configurations, specifying RHEM effects through network attributes and network statistics, can be defined either in the eventnet GUI (graphical user interface), by editing the XML configuration files, or by a combination of the two (also compare the more basic tutorials linked from the eventnet wiki). Working with the GUI might be easier at the beginning since users don't have to care about the correct XML syntax. Directly working with the configuration files (e.g., by extending an initial configuration) might be more efficient for complex models containing many effects that are systematic variations of some core effects.

Event network attributes

Event network attributes represent the state of the network by recording selected, specified information on past events. RHEM effects are then added to the model through hyperedge statistics that are functions of these attributes. Even though this page only discusses RHEM effects, the corresponding hyperedge statistics can be functions of node-level attributes, dyad-level attributes, network-level attributes, and/or hyperedge attributes. For instance, the probability that a group of actors constitutes the participant list of an event (that is, the nodes of a hyperedge) may depend on exogenous actor-level attributes, such as age or gender, which can be represented in eventnet through node-level attributes.

In general, attributes of all classes can be exogenous (i.e., properties of actors, dyads, or hyperedges that are defined irrespective of any interaction events) or they can be endogenous (attributes depending on the history of past interaction events). Values for both types of attributes are provided to eventnet in a single input file - a CSV file typically having the columns "source", "target", "time", "type", "weight", and "event id" (see the introductory tutorials in the eventnet wiki).

Common arguments for all attribute classes

Irrespective of their class (node, dyad, network, or hyperedge) and their specific type (see below), attribute specifications in eventnet have the following arguments.

  • A list of event responses indicating which types of events in the input file cause an update of the attribute. Each event response has an event type, a specification of the initial value of the update (which is usually the event weight, or 1.0 if no column for event weights is given, but can also be a counter in one of five different time scales; choosing a time scale allows, for instance, to record the last event time of a node, dyad, network, or hyperedge), and an optional list of elementary functions which are applied successively to the initial value before updating. (Event responses for node-level attributes can have a few more arguments, discussed below.)
  • An update type which is either INCREMENT_VALUE_BY or SET_VALUE_TO. The first option (increment by) specifies that update values are added to the previous value on the same element (node, dyad, network, or hyperedge) and the second option (set to) specifies that the new value of the attribute is set to the update value, irrespective of the previous value on the same element. (Note that all attributes assume by default the value 0.0 on elements that have never been updated.) Setting values to the given update value is often chosen for exogenously defined attributes (for instance, a node-level attribute "smokes" taking the value 1.0 for smokers and 0.0 for non-smokers; another common example is an exogenous "at risk" indicator revealing which nodes could potentially participate in an event at a given point in time). Incrementing values allows, for instance, to count the cumulative number of prior events (of certain types) on nodes, dyads, networks, or hyperedges - or to record the cumulative weight of such events.
  • An optional decay which is specified by a given halflife period, together with a given time scale. A decay means that the value of an attribute on a given element gets halved, whenever time advances by one halflife period (note however, that the value might get updated in-between as a response to a new event). No decay means that attribute values remain constant until they get updated due to events.
  • Optional threshold values specifying that attribute values are rounded to 0.0 if they drop below (or rise above) the given threshold. These settings can be relevant for attributes that have a decay in two scenarios. First, to save computers' memory/space when processing very large event networks. Second, (a rare and more sophisticated use) to introduce attributes with sharp decays, for instance, attributes assigning the value one only to the single dyad that experienced the last event in the entire network (at any given moment in time) and the value zero to any other dyad.

Node-level attributes

A node-level attribute defines values on individual nodes $i$, where the values may vary by time. Node-level attributes can be used in the specification of RHEM in four different ways. (1) In hyperedge node statistics which are defined as summary statistics of attribute values over the nodes participating in a hyperevent. (2) In hyperedge neighbor statistics which are defined as summary statistics of attribute values over the neighbors of event participants. (3) To let node properties moderate the effect of "common neighbors" in hyperedge closure statistics. (4) To specify which nodes are at risk of participating in an event at a given point in time. The last usage applies when defining observations, the other usage scenarios apply in the specification of hyperedge statistics (see below). The first usage scenario could even be replaced by specifying hyperedge subset repetition statistics on appropriate hyperedge attributes (note that a hyperedge may also contain just one node so that a hyperedge attribute could mimic a node-level attribute; see further explanation below).

There are two types of node-level attributes in eventnet.

  • DEFAULT_NODE_LEVEL_ATTRIBUTE can update values on the source $i$, the target $j$, or both due to an event on the dyad $(i,j)$. Apart from the arguments common to all attributes, node-level attributes have to specify for each event response a direction which is either OUT (only the value for the source $i$ gets updated), IN (only the value for the target $j$ gets updated), or SYM (both get updated). Other choices of the direction are not supported. A node-level attribute can specify an additional list of functions for a given endpoint (SOURCE or TARGET), which are applied only to the update values of that endpoint (not to the other if SYM is the selected direction). For instance, if a node-level attribute should represent the number of past wins minus the number of past losses, an event on $(i,j)$, representing that $i$ wins against $j$, could thus increment the attribute for the winner $i$ and decrement the attribute for the loser $j$.
  • NODE_LEVEL_EVENT_RISK_ATTRIBUTE (rarely used) is a type of attribute that can record how often a node has been at risk of experiencing an event (as a sender, receiver, or both). Thus, it is a rare attribute that also updates the values of elements not participating in an event (but that could have participated). In addition to the arguments of the default node level attribute, this attribute type allows to specify the time scale on which updates should be triggered and to specify a node-level attribute that gives the initial update values (if the latter is not specified, the initial value is 1.0).

Dyad-level attributes

A dyad-level attribute defines values on directed pairs of nodes $(i,j)$, where the values may vary by time. Dyad-level attributes are used in the specification of RHEM for the definition of hyperedge closure statistics, hyperedge four-cycle statistics, and hyperedge neighbor statistics. Another thinkable scenario would be to represent dyadic exogenous attributes (such as being in a kinship relation or being co-workers) and let these attributes influence the probability to jointly participate in hyperevents. Such effects, however, can be specified by the use of hyperedge attributes (recall that a hyperedge may also contain exactly two nodes), together with hyperedge subset repetition statistics (see below).

Eventnet implements three types of dyad-level attributes

  • DEFAULT_DYAD_LEVEL_ATTRIBUTE can update the value on a dyad $(i,j)$ due to an event on the same dyad $(i,j)$. Note that dyads in eventnet are always directed and that the value is always updated on the "forward dyad" (from the source to the target). Effects dependent on reverse dyads (such as reciprocation) specify this reversal in the definition of the respective statistic, by setting the direction to IN. Similarly, if an effect is "undirected" (for instance, if interaction on a dyad $(i,j)$ are assumed to depend on past interaction from $i$ to $j$ or from $j$ to $i$), statistics can specify this by setting the direction to SYM. The default dyad-level attribute has no additional arguments (apart from those common to all attributes).
  • DYAD_LEVEL_ATTRIBUTE_FROM_UHE can update values on all pairs of different nodes $(i,i'),{i\neq i'\in[i_1,\dots,i_k]}$ due to an undirected hyperevent with participants $[i_1,\dots,i_k]$. Note that values are updated in both directions $(i,i')$ and $(i',i)$; indeed, since the participants of an undirected hyperevent are not ordered, there is no "first" and "second" node. Further note that an undirected hyperevent with $k$ participants results in the update of $k*(k-1)$ dyads. The "dyad-level attribute from undirected hyperedges" has no additional arguments (apart from those common to all attributes).
  • DYAD_LEVEL_ATTRIBUTE_FROM_DHE can update values on pairs of different nodes participating in a directed hyperevent $(t,[i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$. Apart from the arguments common to all attributes, this attribute has an additional argument endpoint, which can be SOURCE, TARGET, or unspecified. If the endpoint is SOURCE, values are updated on all pairs among the sources $(i,i'),{i\neq i'\in[i_1,\dots,i_k]}$. Thus, the behavior is equivalent to that of DYAD_LEVEL_ATTRIBUTE_FROM_UHE if the undirected hyperedge is $[i_1,\dots,i_k]$. Similarly, if the endpoint is TARGET, values are updated on all pairs among the targets $(j,j'),{j\neq j'\in[j_1,\dots,j_{\ell}]}$. Thus, the behavior is equivalent to that of DYAD_LEVEL_ATTRIBUTE_FROM_UHE if the undirected hyperedge is $[j_1,\dots,j_{\ell}]$. Finally, if no endpoint is specified, values are updates on all source-target pairs $(i,j),{i\in[i_1,\dots,i_k],j\in[j_1,\dots,j_{\ell}]}$. In the latter case, values are only updated in the forward direction (from source to target). If values on reverse dyads are needed, this can be specified in the statistics by setting the appropriate direction.

Network-level attributes

A network-level attribute defines one single value for the entire network, where the value may vary by time. Network-level attributes can be used to define network-wide, or "global", statistics that take the same value for all nodes, dyads, or hyperedges. There is one type of network-level attribute, DEFAULT_NETWORK_LEVEL_ATTRIBUTE, which updates its only value due to dyadic events $(i,j)$. The attribute can, for instance, record the cumulative number of events (of a certain type) transpired in the entire network, their aggregated weight, but can also be updated due to "dummy events", for instance, setting the onset and termination of specific, exogenously defined, time periods.

General considerations on hyperedge attributes

Hyperedge attributes define values for undirected hyperedges (i.e., unordered sets of nodes $[i_1,\dots,i_k]$) or directed hyperedges (i.e., pairs $([i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$ comprising a set of senders $[i_1,\dots,i_k]$ and a set of receivers $[j_1,\dots,j_{\ell}]$). Hyperedge attributes provide more information than the derived dyad-level attributes (see above). For instance, if a dyad-level attribute from undirected hyperedges reveals that three nodes $i_1,i_2,i_3$ have pairwise co-participated in prior events (that is, if the three pairs $(i_1,i_2)$, $(i_1,i_3)$, and $(i_2,i_3)$ take non-zero values), then this does not reveal whether there is any prior event comprising all three nodes $i_1,i_2,i_3$ as participants. Hyperedge attributes do reveal this information and are necessary for the specification of the family of effects repetition and subset repetition.

We point out that effects in a RHEM for directed hyperevents $(t,[i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$ may still depend on undirected hyperedge attributes. For instance, a frequent and strong effect in email communication networks, see Lerner and Lomi (2023), is the "reply-to-all" pattern in which a receiver of a previous email sends a message to the previous sender and to all the other receivers of the first email. This pattern is neither exact repetition (since some nodes switch the roles of senders and receivers), nor is it exact reciprocation (since some nodes do not switch the role from receiver to sender). The pattern is characterized by the property that the union of senders and receivers of the first event is equal to that of the second event. This pattern can be expressed by a repetition statistic (see below) defined on an undirected hyperedge attribute storing previous interaction among nodes participating either as senders or as receivers. In general, a directed hyperevent $(t,[i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$ may change the values of hyperedge attributes in four different ways. (1) It may change the value of a directed hyperedge attribute on the directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$; (2) it may change the value of an undirected hyperedge attribute on the undirected hyperedge comprising the source nodes $[i_1,\dots,i_k]$, or (3) on the undirected hyperedge comprising the target nodes $[j_1,\dots,j_{\ell}]$; or (4) is may change the value of an undirected hyperedge attribute on the undirected hyperedge comprising the union of the source nodes with the target nodes $[i_1,\dots,i_k,j_1,\dots,j_{\ell}]$. In a given specification of a RHEM any or all of these variants may be included.

Hyperedge attributes are used in the RHEM effect families repetition and subset repetition. Hyperedge attributes may store information on prior interaction events (e.g., the number of prior events with participant list $[i_1,\dots,i_k]$) but may also represent exogenous properties of nodes, dyads, or larger subsets. For instance, a hyperedge attribute may provide the information which actors are members of the same family, organization, or department. Subset repetition statistics defined on such attributes can then test if members of the same family (organization, department) are likely to co-attend the same events.

Below we explain how to specify such undirected and directed hyperedge attributes.

Undirected hyperedge attributes

Undirected hyperedge attributes define values on undirected hyperedges, that is, unordered sets of nodes $[i_1,\dots,i_k]$, where the values may change over time. There are two types of undirected hyperedge attributes.

  • DEFAULT_UHE_ATTRIBUTE can update values on the hyperedge of participants $[i_1,\dots,i_k]$ of an undirected hyperevent $(t,[i_1,\dots,i_k])$ or it can update values on the hyperedge of source nodes $[i_1,\dots,i_k]$, on the hyperedge of target nodes $[j_1,\dots,j_{\ell}]$, or on the hyperedge $[i_1,\dots,i_k,j_1,\dots,j_{\ell}]$ comprising the union of source nodes and target nodes of a directed hyperevent $(t,[i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$. Besides the arguments common to all event network attributes, DEFAULT_UHE_ATTRIBUTE has the argument endpoint which can take the values SOURCE (meaning that values are updated on the hyperedge comprising the source nodes), TARGET (meaning that values are updated on the hyperedge comprising the target nodes, or the endpoint can be unspecified (meaning that values are updated on the hyperedge comprising the union of the source nodes and target nodes). The latter option (unspecified) is also used if the update is due to an undirected hyperevent.
  • UHE_P_DEGREE_ATTRIBUTE (rarely used) can update values on all hyperedges $[i'_1,\dots,i'_p]$ of order $p$ that are subsets of the hyperedge of participants $[i_1,\dots,i_k]$ of an undirected hyperevent $(t,[i_1,\dots,i_k])$. The argument hyperedge size is a positive integer, giving the value of $p$. By setting the argument endpoint, it is possible to update values on all $p$-element sub-hyperedges of the sources, targets, or the union of sources and targets of a directed hyperevent (similar to the semantics of the endpoint argument of DEFAULT_UHE_ATTRIBUTE).

We note that a subset repetition statistic of order $p$ (see below) can be based on a hyperedge attribute specified via DEFAULT_UHE_ATTRIBUTE or via UHE_P_DEGREE_ATTRIBUTE (where the $p$ of the statistic must match exactly the $p$ of the attribute). In most cases, using DEFAULT_UHE_ATTRIBUTE is the better option for two reasons. First, the same hyperedge attribute specified via DEFAULT_UHE_ATTRIBUTE can be used to define subset repetition of different order (that is, varying $p$), while doing this via UHE_P_DEGREE_ATTRIBUTE would require a different attribute for each value of $p$. Second, using UHE_P_DEGREE_ATTRIBUTE is much less efficient in terms of computational runtime and in use of the computer's memory / space. Note that a single hyperevent $(t,[i_1,\dots,i_k])$ will result in the update of as many values as there are $p$-element subsets of a set with $k$ elements. This number can be prohibitively large even for moderate values of $k$ and $p$. In practical applications UHE_P_DEGREE_ATTRIBUTE should be only used for very small values of $p$, such as one, two, or perhaps three if the maximum hyperedge size is rather limited. In fact, the only reason to ever use UHE_P_DEGREE_ATTRIBUTE at all (rather than the more efficient DEFAULT_UHE_ATTRIBUTE) is if subset repetition should be defined with an aggregation function different from SUM and AVERAGE. This is still a rather uncommon use, which however enables to specify new types of effects that do not only depend on the average past co-attendance (or familiarity) of a hyperevent's participants but also on the heterogeneity in their familiarity over subsets of the hyperedge. Another potentially useful scenario is to define higher-order heterogeneity effects based on exogenous attributes (such as kinship or affiliation). For more explanation, see the discussion of the subset repetition statistic below.

Directed hyperedge attributes

Directed hyperedge attributes define values on directed hyperedges, that is, pairs $([i_1,\dots,i_k],[j_1,\dots,j_l])$ comprising a set of source nodes $[i_1,\dots,i_k]$ and a set of target nodes $[j_1,\dots,j_l]$, where the values may change over time. There are two types of directed hyperedge attributes.

  • DEFAULT_DHE_ATTRIBUTE can update values on the directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_l])$ of a directed hyperevent $(t,[i_1,\dots,i_k],[j_1,\dots,j_l])$. DEFAULT_DHE_ATTRIBUTE has no further arguments - besides the arguments common to all event network attributes
  • DHE_PQ_DEGREE_ATTRIBUTE (rarely used) can update values on all directed hyperedges $([i'_1,\dots,i'_p],[j'_1,\dots,j'_q])$ of order $(p,q)$, such that $[i'_1,\dots,i'_p]$ is a subset of the sources $[i_1,\dots,i_k]$ and $[j'_1,\dots,j'_q]$ is a subset of the targets $[j_1,\dots,j_l]$ of a directed hyperevent $(t,[i_1,\dots,i_k],[j_1,\dots,j_l])$. Besides the arguments common to all event network attributes, DHE_PQ_DEGREE_ATTRIBUTE has two arguments source size, giving the value of $p$ and target size, giving the value of $q$.

Similar comments apply as for the UHE_P_DEGREE_ATTRIBUTE. In particular, using DHE_PQ_DEGREE_ATTRIBUTE is very inefficient for even moderate values of $p$ and/or $q$ and is only needed for directed subset repetition with an aggregation function different from SUM or AVERAGE.

RHEM statistics

RHEM statistics are the explanatory variables of RHEM, that is, the variables seeking to explain by which factor the interaction frequency on one hyperedge is higher or lower than the frequency on another hyperedge. RHEM statistics are typically functions of event network attributes - which might be attributes on the node-level, dyad-level, network-level, or hyperedge-level. The core types of RHEM effects for directed and undirected hyperevents are similar, although the undirected versions typically have fewer variations. Effects in directed RHEM are specified via statistics assigning values to directed hyperevents $(t,[i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$, where $t$ is the time of the event, $[i_1,\dots,i_k]$ is the set of $k$ source nodes, and $[j_1,\dots,j_{\ell}]$ is the set of $\ell$ target nodes. Effects in undirected RHEM are specified via statistics assigning values to undirected hyperevents $(t,[i_1,\dots,i_k])$, where $t$ is the time of the event and $[i_1,\dots,i_k]$ is the set of $k$ participating nodes. There is a relatively small number of core types of RHEM statistics, listed below, which can give rise to many variations through type-specific arguments. The argument function is the only one that is common to all statistics; it gives an elementary function transforming the value of the statistic before it is written to the output file. In the type names introduced below DHE stands for "directed hyperedge" and UHE stands for "undirected hyperedge".

Hyperedge size

Typenames UHE_SIZE_STAT and DHE_SIZE_STAT.

The hyperedge size statistic for undirected hyperedges gives the number of participating nodes, that is the value $k$ for the hyperevent $(t,[i_1,\dots,i_k])$. Hyperedge size for directed hyperedges gives the number of source nodes ($k$), target nodes ($\ell$), or the sum of the two ($k+\ell$) for the hyperevent $(t,[i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$. DHE_SIZE_STAT has one argument.

  • endpoint indicates whether the number of sources (if endpoint is SOURCE), targets (if endpoint is TARGET), or the sum of the two (if endpoint is unspecified) should be returned.

Node attribute statistics

Typenames UHE_NODE_STAT and DHE_NODE_STAT.

Node attribute statistics aggregate the value of a node-level attribute over the participants, sources, targets, or sources and targets of hyperedges. An aggregation function can be specified to indicate how values are aggregated. A specified na value is used to indicate missing values. (Nodes with missing values are ignored in the aggregation.) Node attribute statistics have the following arguments, where the argument endpoint is only for the directed version (all other arguments are for the directed and undirected version).

  • node attribute gives the name of the node-level attribute that should be aggregated.
  • endpoint (only for DHE_NODE_STAT) takes the values SOURCE (aggregation is over source nodes), TARGET (aggregation is over target nodes), or is unspecified (aggregation is over source and target nodes). Note that the precise meaning of the latter depends on the specified aggregation function (see below).
  • na value a decimal number specifying which value should indicate a "missing value". Missing values are ignored in the aggregation. If no nodes have any non-missing value, the attribute takes a default value, dependent on the specified aggregation function.
  • aggregation function specifies the type of function used to aggregate values. The default (if no aggregation function is specified) is AVERAGE: the arithmetic mean of the values over the nodes is taken. The other possible aggregation functions are SUM (sum of values), MAX (maximum value), MIN (minimum value), PRODUCT (product of values), SDEV (standard deviation of values, that is, the square root of the average sum of squared differences of values from the mean value; note: division is by the number of values), SAMPLESDEV (sample standard deviation of values, sometimes called the "unbiased" estimator of the standard deviation; in contrast to SDEV, the division is by the number of values minus one), ABSDIFF (average of absolute differences over the pairs of values; see further details below), and CATDIFF (ratio of pairs of different values; typically applied for categorical attributes, such as "affiliation" or "department"; see further details below).

Details

The aggregation functions AVERAGE, SUM, MAX, MIN, and PRODUCT are indicators of the central tendency of values over the nodes of hyperedges. The resulting node attribute statistics can be used to test "first order effects" of node attributes: whether hyperedges whose participants, sources, and/or targets take higher values in the respective attribute typically experience events at a higher or lower rate. For instance, if the attribute gives the age of actors, the aggregation function takes the average, and the endpoint specifies the SOURCE, then a positive parameter of the statistic indicates that older actors are typically more active in sending events; if the endpoint specifies TARGET, a positive parameter indicates that older actors are typically more popular in receiving events; if the endpoint is unspecified, then a positive parameter indicates that older actors are typically more active and/or popular. For a binary node-level attribute (e.g., "female" taking the value one for females and zero else), the aggregation function AVERAGE gives the ratio of nodes (in the sources, targets, or both) that are females.

The aggregation functions SDEV, SAMPLESDEV, ABSDIFF, and CATDIFF are indicators of the dispersion, or the heterogeneity, of values over the nodes of hyperedges. The resulting node attribute statistics can be used to test whether nodes typically interact with (that is, co-participate in the same hyperevents with) dissimilar or similar other nodes. A positive parameter value would reveal heterophily (preference for dissimilar others) and a negative parameter would reveal homophily (preference for similar others).

The two aggregation functions ABSDIFF and CATDIFF have a special behavior in relation to the specified endpoint, when applied to a directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$. If the endpoint is SOURCE, differences among all pairs of source nodes $i\neq i'\in [i_1,\dots,i_k]$ are aggregated. If the endpoint is TARGET, differences among all pairs of target nodes $j\neq j'\in [j_1,\dots,j_{\ell}]$ are aggregated. If the endpoint is unspecified differences among source-target pairs $(i,j)$ with $i\in [i_1,\dots,i_k]$ and $j\in [j_1,\dots,j_{\ell}]$ are aggregated. Thus, for these two aggregation functions, one can test whether nodes tend to co-participate in events with (dis-)similar others as sources, as targets, or (if the endpoint is unspecified) whether source nodes tend to send events to (dis-)similar target nodes. See Lerner and Lomi (2023) for further details.

Exact repetition (and exact reciprocation)

Typenames: UHE_REPETITION_STAT and DHE_REPETITION_STAT.

An exact repetition statistic applied to an undirected hyperedge $[i_1,\dots,i_k]$ returns the value of a hyperedge attribute on exactly the same hyperedge $[i_1,\dots,i_k]$. This marks a difference to the much more general and lenient subset repetition statistics discussed below which return (summaries of) values of hyperedge attributes on overlapping, rather than identical, hyperedges. Similarly, exact repetition applied to a directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$ returns, depending on the arguments endpoint and direction, the value of a hyperedge attribute on the exact same directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$, on the reverse hyperedge $([j_1,\dots,j_{\ell}],[i_1,\dots,i_k])$, on the exact source hyperedge $[i_1,\dots,i_k]$, on the exact target hyperedge $[j_1,\dots,j_{\ell}]$, or on the exact union of sources and targets $[i_1,\dots,i_k,j_1,\dots,j_{\ell}]$.

DHE_REPETITION_STAT has the following three arguments; UHE_REPETITION_STAT only has the argument hyperedge attribute but not the other two.

  • direction (only for DHE_REPETITION_STAT and only if the endpoint is unspecified) can take the values OUT (the default, yielding an exact repetition statistic), IN (yielding an exact reciprocation statistic in which all sources and targets are swapped, or SYM (yielding an undirected repetition statistic). All other directions are set to the default OUT with a warning.
  • endpoint (only for DHE_REPETITION_STAT) can take the values SOURCE (the statistic returns the value of the given attribute on the undirected source hyperedge $[i_1,\dots,i_k]$), TARGET (the statistic returns the value of the given attribute on the undirected target hyperedge $[j_1,\dots,j_{\ell}]$), or endpoint can be unspecified (depending on the direction argument, the statistic returns the value of the given attribute on the following hyperedges. If direction=OUT, it takes the directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_{\ell}])$. If direction=IN it takes the reverse directed hyperedge $([j_1,\dots,j_{\ell}],[i_1,\dots,i_k])$. If direction=SYM, it takes the undirected hyperedge comprising the union of the sources and the targets $[i_1,\dots,i_k,j_1,\dots,j_{\ell}]$.
  • hyperedge attribute gives the name of an hyperedge attribute. It has to be the name of a directed hyperedge attribute if the directed DHE_REPETITION_STAT is specified, and endpoint is unspecified, and direction is OUT or IN. It has to be the name of an undirected hyperedge attribute if the undirected UHE_REPETITION_STAT is specified, or if endpoint is SOURCE or TARGET, or if direction is SYM.

Remarks

Exact repetition and exact reciprocation are very restrictive RHEM effects as they require that all nodes participating in a prior event, and only the nodes participating in this prior event, are repeated as participants of a future event. A single node leaving the group of participants, or a single new participant, would already imply that the future event cannot be explained by exact repetition. Exact reciprocation is even rarer and is in some cases not even theoretically possible. For instance, email communication gives rise to multicast events in which exactly one source node sends a message to an arbitrary number of receivers. If the number of receivers of an email is larger than one, then this email cannot be exactly reciprocated as the group of receivers cannot jointly send a future email (as there can only be one sender). Email communication, however, often gives rise to relaxations of exact reciprocation. First, email communication often gives rise to a "reply-to-all" pattern in which one receiver of a previous email sends a future email to the sender of the first message and to all the other receivers. This pattern can be captured by "undirected exact reciprocation", compare Lerner and Lomi (2023), specified via the setting direction=SYM, in which the union of source and target nodes is exactly repeated but where some nodes (one in this case) may switch the roles of being sender or receiver. The second relaxation of exact reciprocation is "subset reciprocation" (or "partial reciprocation") in which a target of a previous event sends a future event to the sender of the same previous event, irrespective of whether any or all of the other receivers are repeated or not. This family of subset repetition (and subset reciprocation) effects is discussed next.

Undirected subset repetition

Typename: UHE_SUB_REPETITION_STAT.

Undirected subset repetition (or undirected partial repetition) can test for patterns in which some, but not necessarily all, participants of a prior event are repeated in a future event. For example, a first event with participants $[a,b,c,d]$ followed by a second event with participants $[a,b,c,e,f]$ would give rise to subset repetition of order one, two, and three, since three participants of the first event ($a,b,c$) jointly participate in the second event. Subset repetition allows that some participant of the first event (namely $d$) does not participate in the second event and it allows that some participants of the second event (namely $e$ and $f$) have not participated in the first event. Thus, the hyperedges of the two events overlap but none of the two hyperedges is contained in the other one. It is allowed that some participants drop out and other nodes joint the set of participants of a future event. We discuss undirected subset repetition here and turn to directed subset repetition in the next subsection.

Empirically, subset repetition effects (undirected or directed) are often among the strongest effects in many application settings. Therefore, subset repetition effects should usually be included in the specification of most RHEM - irrespective of whether there is substantive interest in these effects or whether they are included to control for strong regularities (or dependence on prior events) in event network dynamics.

UHE_SUB_REPETITION_STAT returns an aggregation of the values of a specified undirected hyperedge attribute over all subsets of size $p$, $[i'_1,\dots,i'_p]\subseteq[i_1,\dots,i_k]$, of an undirected hyperedge $[i_1,\dots,i_k]$. UHE_SUB_REPETITION_STAT has three type-specific arguments.

  • hyperedge attribute gives the name of an undirected hyperedge attribute whose values are to be aggregated.
  • hyperedge size is a positive integer giving the value of $p$, that is, the size of the subsets $[i'_1,\dots,i'_p]\subseteq[i_1,\dots,i_k]$ over which the attribute values are to be aggregated.
  • aggregation function gives the function to aggregate values. The list of available aggregation functions and a discussion of their behavior is given in the documentation of node attribute statistics above. The default for subset repetition is AVERAGE and this is the aggregation function that is typically applied in the vast majority of cases. Some effects, such as the prior shared success statistics in Lerner and Hâncean (2023), can be specified by selecting the aggregation function SUM (and normalizing the values in a way that is explained in this paper). Using any of the other aggregation functions (MAX, MIN, PRODUCT, SDEV, SAMPLESDEV, ABSDIFF, and CATDIFF) is rare and requires special care when defining the hyperedge attribute whose values are to be aggregated. In particular, while the hyperedge attribute for subset repetition with aggregation function AVERAGE or SUM is typically specified via DEFAULT_UHE_ATTRIBUTE (see above), using any of the other aggregation functions necessarily requires to specify the hyperedge attribute via UHE_P_DEGREE_ATTRIBUTE, where the values of $p$ for the attribute and the subset repetition statistic must be identical. The cautionary remarks regarding this type of attribute (see above) apply. In particular, its use can result in a prohibitively large computational runtime and/or use of computer memory, even for moderate values of $p$. UHE_P_DEGREE_ATTRIBUTE is typically only used for very small values of $p$, such as one or two (or three if the maximum hyperedge size is not too large). Notwithstanding these cautionary remarks, the unusual aggregation functions (in particular, functions giving the dispersion of values, such as SDEV or ABSDIFF) allow to specify totally new kinds of RHEM effects in which the probability that a group of actors $[i_1,\dots,i_k]$ co-participates in an event depends not only on their average familiarity (for instance, the average number of prior joint events over all pairs $i\neq i'\in [i_1,\dots,i_k]$) but also on their heterogeneity in this aspect. Indeed, a group of actors with moderate average pairwise familiarity might be characterized by a subgroup with very high pairwise familiarity and another subgroup with no joint participation at prior events. This would constitute a very different situation than a group of actors in which all pairs have approximately the same familiarity.

Directed subset repetition (and subset reciprocation)

Typename: DHE_SUB_REPETITION_STAT.

Directed subset repetition, including subset reciprocation, yields a family of effects for directed hyperevents in which some of the sources, targets, or sources and targets of a prior event are repeated in a future event. Directed subset repetition comes with more variations than the undirected counterpart, which are selected by setting the arguments endpoint and/or direction. This includes the following exemplary variants. Directed subset repetition requires that some of the sources of the prior event are sources of the future event and some of the targets of the prior event are targets of the future event. Subset reciprocation requires that roles are switched so that prior sources become future targets and prior targets become future sources. Undirected subset reciprocation (applied to directed hyperevents) allows that some, but not necessarily all, of the repeated nodes switch their roles from sources to targets or the other way round (see the related discussion for undirected exact repetition above). Another variant requires that the target nodes of a future hyperevent include some sources and some targets of a prior event. This variant allows, for instance, to specify the effect of "citing a paper and some of its references" introduced in the paper Lerner et al. (2024), which, empirically in this study, turns out to be the strongest effect explaining citation networks.

Formally, DHE_SUB_REPETITION_STAT returns an aggregation of the values of a specified directed or undirected hyperedge attribute over all sub-hyperedges of size $(p,q)$, $([i'_1,\dots,i'_p],[j'_1,\dots,j'_q])$, where $[i'_1,\dots,i'_p]$ is a $p$-element subset and $[j'_1,\dots,j'_q]$ is a $q$-element subset of the set of participating nodes $[i_1,\dots,i_k,j_1,\dots,j_l]$ of a directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_l])$, which is the hyperedge for which the value of the subset repetition statistic has to be computed (also denoted as "this hyperedge" below). Specific variants of directed subset repetition may require more specifically that $[i'_1,\dots,i'_p]$ and $[j'_1,\dots,j'_q]$ are subsets of the sources, subsets of the targets, or subsets of the union of sources and targets (see the discussion of the arguments below).

DHE_SUB_REPETITION_STAT has the following type-specific arguments.

  • hyperedge attribute gives the name of an undirected hyperedge attribute (if direction=SYM) or the name of a directed hyperedge attribute (if direction=OUT or direction=IN) whose values are to be aggregated.
  • source size gives the number of source nodes $p\geq 0$ of the sub-hyperedges over which values are to be aggregated. While $p=0$ is allowed, it must not be the case that both, $p$ and $q$, are zero. If direction=SYM, then the sum $p+q$ gives the size of undirected sub-hyperedges over which values are to be aggegated.
  • target size gives the number of target nodes $q\geq 0$ of the sub-hyperedges over which values are to be aggregated. While $q=0$ is allowed, it must not be the case that both, $p$ and $q$, are zero. If direction=SYM, then the sum $p+q$ gives the size of undirected sub-hyperedges over which values are to be aggegated.
  • endpoint indicates from which set (sources, targets, sources and targets, or the union of sources and targets) the elements of the sub-hyperedges have to be taken. If endpoint=SOURCE, then both $[i'_1,\dots,i'_p]$ and $[j'_1,\dots,j'_q]$ have to be subsets of the source nodes $[i_1,\dots,i_k]$ of this hyperedge. If endpoint=TARGET, then both $[i'_1,\dots,i'_p]$ and $[j'_1,\dots,j'_q]$ have to be subsets of the target nodes $[j_1,\dots,j_l]$ of this hyperedge. If endpoint is unspecified (the default setting) and direction=OUT or direction=IN, then $[i'_1,\dots,i'_p]$ has to be a subset of the source nodes $[i_1,\dots,i_k]$ of this hyperedge and $[j'_1,\dots,j'_q]$ has to be a subset of the target nodes $[j_1,\dots,j_l]$ of this hyperedge. If endpoint is unspecified (the default setting) and direction=SYM, then both $[i'_1,\dots,i'_p]$ and $[j'_1,\dots,j'_q]$ have to be subsets of the union $[i_1,\dots,i_k,j_1,\dots,j_l]$ of the sources and targets of this hyperedge.
  • direction distinguishes between subset repetition (OUT, the default), subset reciprocation (IN), or undirected subset repetition (SYM). Other values for the direction are not supported and are set to the default OUT with a warning. If the endpoint is specified (either SOURCE or TARGET), then the direction can only be OUT or SYM; setting direction to IN in this case would actually result in the identical statistic as direction=OUT and therefore IN is changed to OUT in this case. In any case, if direction=OUT, then values of the specified directed hyperedge attribute are aggregated over the sub-hyperedges $([i'_1,\dots,i'_p],[j'_1,\dots,j'_q])$, if direction=IN, then values of the specified directed hyperedge attribute are aggregated over the reversed sub-hyperedges $([j'_1,\dots,j'_q],[i'_1,\dots,i'_p])$, if direction=SYM, then values of the specified undirected hyperedge attribute are aggregated over the undirected sub-hyperedges $([i'_1,\dots,i'_p,j'_1,\dots,j'_q])$.
  • aggregation function specifies the function to aggregate values. The same comments as for undirected subset repetition (see above) apply, respectively. In particular, the for the vast majority of use cases the aggregation function is AVERAGE (the default). Specifying SUM results in statistic values that are not divided by the number of sub-hyperedges of the given sizes. If the aggregation function is AVERAGE or SUM, the hyperedge attribute is usually defined via DEFAULT_UHE_ATTRIBUTE (if undirected) or via DEFAULT_DHE_ATTRIBUTE (if directed). Using any other aggregation function, different from AVERAGE and SUM is rare and requires that the hyperedge attribute is specified via UHE_P_DEGREE_ATTRIBUTE (if direction=SYM), where the $p$ of the attribute is identical to the sum $p+q$ of the subset repetition statistic, or it requires that the hyperedge attribute is specified via DHE_PQ_DEGREE_ATTRIBUTE (if direction=OUT or direction=IN), where the two values $(p,q)$ of the attribute are identical to the values $(p,q)$ of the subset repetition statistic. We recall that using UHE_P_DEGREE_ATTRIBUTE or DHE_PQ_DEGREE_ATTRIBUTE is only possible for very small values of $p$ and $q$ for reasons of computational efficiency.

Remark: using subset repetition to replace node attribute statistics

We point out that (undirected and directed) subset repetition statistics can replace almost all of the functionality of node attribute statistics, since sub-hyperedges can also have size one (or source size one or target size one, respectively) and therefore can represent single nodes. There are two differences. First node attribute statistics, but not subset repetition statistics, can have a missing value. Second, if the aggregation function is ABSDIFF or CATDIFF and if the endpoint is unspecified, then node attribute statistics for directed hyperevents cannot be replaced by any specification of a subset repetition statistic, as the former compute average differences of node attribute values over all source-target pairs (see above). A purely technical difference is that node attribute statistics compute values based on node-level attributes and subset repetition statistics compute values based on hyperedge attributes. This does not restrict the use in any way but only requires that attributes of the appropriate class have to be specified. Since subset repetition can be specified for any sub-hyperedge size (not only one), subset repetition statistics can largely extend the functionality of node attribute statistics. It is not only possible to specify statistics based on (exogenous or endogenous) attributes of single nodes but also to define statistics based on (exogenous or endogenous) attributes on node pairs, triples, or, in general, hyperedges. Examples of effects based on exogenous higher-order attributes would be given by kinship relations, co-worker relations, affiliations of actors to organizations, etc.

Geometrically-weighted subset repetition (and GW sub reciprocation)

Typenames: DHE_GW_SUB_REP_STAT and UHE_GW_SUB_REP_STAT.

Geometrically-weighted subset repetition statistics (since version 1.2) are very similar to subset repetition discussed above, where the difference is in how the size of the overlap between past events and future events is weighted. GW subset repetition is expected to yield more robust effects and more parsimonious models than the more traditional subset repetition. The weighting in GW subset repetition is inspired by respective statistics defined for exponential random graph models; see, e.g., Hunter and Handcock (2006). Preliminary experimental evidence suggests that geometrically-weighted subset repetition is especially beneficial in sparse networks (e.g., having many nodes and relatively few events). On the other hand, in dense networks, the geometrically-weighted versions may give no advantage over the more traditional subset repetition. Note that these findings are preliminary.

Formally, GW subset repetition returns the sum of specifically weighted values of a directed or undirected hyperedge attribute over all hyperedges $([i'_1,\dots,i'_p],[j'_1,\dots,j'_q])$ that have a non-empty overlap with the set of participating nodes $[i_1,\dots,i_k,j_1,\dots,j_l]$ of this hyperedge, that is, with the hyperedge for which the value of the GW subset repetition statistic has to be computed. The values of the hyperedge attribute are weighted dependent on the size of the overlap with this hyperedge and dependent of given scaling parameters kappa and lambda, which are real values larger than or equal to 0.0. If the scaling parameters are equal to 0.0, then GW subset repetition is identical with (the above-described) subset repetition of order one. If the values of the scaling parameters increase then the statistic puts more and more weight on hyperedges with larger overlap with this hyperedge, that is, there is an increasing emphasis on the repetition of larger subsets of nodes.

The type-specific arguments of (directed and undirected) GW subset repetition are very similar to those of subset repetition discussed above. The differences are the following. The geometrically-weighted versions do not have the arguments hyperedge size, source size, or target size but instead have the scaling parameters kappa (for the undirected version and for the scaling of source nodes in the directed version) and lambda (for the scaling of target nodes in the directed version). Moreover, GW subset repetition has no aggregation function.

DHE_GW_SUB_REP_STAT has the following type-specific arguments; the undirected counterpart UHE_GW_SUB_REP_STAT has only the arguments hyperedge attribute and kappa (but not lambda).

  • hyperedge attribute gives the name of an undirected hyperedge attribute (if direction=SYM) or the name of a directed hyperedge attribute (if direction=OUT or direction=IN) whose values are to be taken.
  • kappa defines the scaling dependent on the size of the overlap among the source nodes (or nodes for the undirected version). In general, kappa can be any real value larger than or equal to 0.0. If kappa=0.0, then the behavior of GW sub repetition, with respect to the source nodes is identical to the behavior of subset repetition of order p=1. In a typical use of subset repetition, for kappa=0.0 one therefore can test whether there is a tendency that individual nodes with a larger number of past events are more likely to participate in future events - irrespective of whether these future events are attended with the same co-participants or not. That is, one can test activity effects of individual nodes. If the value of kappa increases, then the statistic puts more weight on hyperedges with larger overlap with this hyperedge. For larger kappa one can therefore test whether there is a tendency for repeated co-participation in events. In a typical use of GW subset repetition one can specify models with one subset repetition statistic of order p=1, to control for activity effects of individual nodes, and one geometrically-weighted subset repetition statistic with kappa>0.0, to test for a tendency for repeated co-participation to events, on top of any activity effects. In the directed version and if direction is not equal to SYM, it is allowed to set kappa to a negative value. This is to be understood symbolically to specify a directed GW subset repetition statistic that ignores any overlap among the source nodes and only considers overlap among the target nodes. It is not allowed to set both, kappa and lambda to negative values. If direction=SYM, then kappa specifies how to weight the size of the overlap among induced undirected hyperedges; in this case kappa must not be negative.
  • lambda defines the scaling dependent on the size of the overlap among the target nodes for the directed version of GW subset repetition. In general, lambda can be any real value larger than or equal to 0.0. If lambda=0.0, then the behavior of GW sub repetition, with respect to the target nodes is identical to the behavior of subset repetition of order q=1. The discussion of the behavior of kappa, given above, applies to the behavior of lambda, respectively. If direction is not equal to SYM, it is allowed to set lambda to a negative value. This is to be understood symbolically to specify a GW subset repetition statistic that ignores any overlap among the target nodes and only considers overlap among the source nodes. It is not allowed to set both, kappa and lambda to negative values.
  • endpoint indicates which set of this hyperedge (sources, targets, sources and targets, or the union of sources and targets) are intersected with hyperedges that assume non-zero values in the given hyperedge attribute. The behavior is essentially the same as discussed for subset repetition - apart from the differences in scaling.
  • direction distinguishes between GW subset repetition (OUT, the default), GW subset reciprocation (IN), or undirected GW subset repetition (SYM). Other values for the direction are not supported and are set to the default OUT with a warning. If the endpoint is specified (either SOURCE or TARGET), then the direction can only be OUT or SYM; setting direction to IN in this case would actually result in the identical statistic as direction=OUT and therefore IN is changed to OUT in this case. The behavior is essentially the same as discussed for subset repetition - apart from the differences in scaling.

Triadic closure

Typenames: UHE_CLOSURE_STAT and DHE_CLOSURE_STAT.

Triadic closure denotes a large family of RHEM effects around the following pattern. If nodes $i$ and $i'$ have jointly participated in an event and nodes $j$ and $i'$ have also jointly participated in a potentially different event, then $i$ and $j$ are indirectly related over the "third" node $i'$. Given this precondition, if $i$ and $j$ co-participate in a future event, then they "close" the triad comprising $i$, $i'$, and $j$ - an effect that is often denoted as triadic closure. Triadic closure statistics for hyperevent networks aggregate the value of two-paths $i-i'-j$ over all pairs of nodes $(i,j)$ participating in some way (depending on the setting of the statistic's arguments) in an undirected or directed hyperevent (see below)

Triadic closure statistics in hyperevent networks are more lenient (that is, they apply to a larger set of dynamic patterns) than subset repetition of order three or higher. This is because subset repetition of order three requires that all three actors, $i$, $j$, and $i'$, repeatedly co-participate in the same events. In contrast, triadic closure allows that $i$ and $i'$ co-participate in a different prior event than $j$ and $i'$ and that the third node $i'$ does not need to participate in the future event linking $i$ and $j$.

Formally, triadic closure statistics aggregate values of indirect two-paths $i-i'-j$, where the two nodes $i$ and $j$ iterate over all participants, sources, targets, or sources and targets of undirected or directed hyperevents. Setting the arguments of triadic closure statistics can specialize this general pattern to several hyperevent effects. Directed closure statistics have an endpoint, allowing to specifiy whether $i$ and $j$ iterate over the sources or targets of a directed hyperevent or whether $i$ iterates over the sources and $j$ iterates over the targets. Two possibly different dyad-level attributes (and associated directions) specify the value of the dyads $(i,i')$ and $(j,i')$. A dyadic function (such as product or minimum) specifies how to combine these two values to a value for the concatenated two-path $i-i'-j$. An optional node-level attribute may specify how properties of the third actor $i'$ moderates the value of a two-path $i-i'-j$. Another dyadic function (such as sum or maximum) specifies how to combine the values of several "parallel" two-paths $i-i'-j$, $i-i''-j$, when iterating over different "third" actors $i'$, $i''$, etc. Finally, an aggregation function (such as average) specifies how to aggregate the values over all pairs of nodes $(i,j)$.

UHE_CLOSURE_STAT and DHE_CLOSURE_STAT have the following type-specific arguments, where endpoint only applies to the directed variant and all other arguments apply to directed and undirected triadic closure. The undirected closure statistic computes a value for a given undirected hyperedge $[i_1,\dots,i_k]$ and the directed closure statistic computes a value for a directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_l])$.

  • endpoint (only for DHE_CLOSURE_STAT) can be SOURCE, TARGET, or unspecified. If endpoint=SOURCE, the two outer nodes $i$ and $j$ are obtained by iterating over all pairs of different source nodes $i\neq j\in [i_1,\dots,i_k]$. If endpoint=TARGET, the two outer nodes $i$ and $j$ are obtained by iterating over all pairs of different target nodes $i\neq j\in [j_1,\dots,j_l]$. If the endpoint is unspecified, the two outer nodes $i$ and $j$ are obtained by iterating $i$ over all source nodes $i\in [i_1,\dots,i_k]$ and iterating $j$ over all target nodes $j\in [j_1,\dots,j_l]$. For UHE_CLOSURE_STAT (where no endpoint needs to be specified), the two outer nodes $i$ and $j$ are obtained by iterating over all pairs of different participants $i\neq j\in [i_1,\dots,i_k]$. For undirected and directed closure, the third node $i'$ iterates over all nodes (not only participants, sources, or targets of the given hyperedge).
  • dyad-level attribute.1 and associated direction.1 specify the name and direction of a dyad-level attribute to obtain values for the dyad $(i,i')$, that is, the dyad connecting the node $i$ with the third node $i'$. If direction is OUT (pointing away from $i$), the value on the dyad $(i,i')$ is taken. If direction is IN (pointing towards $i$), the value on the reverse dyad $(i',i)$ is taken. If direction is SYM (pointing in both ways), the sum of the values on the two dyads $(i,i')$ and $(i',i)$ is taken. If direction is SYM_MIN (or SYM_MAX), the minimum (or maximum) of the values on the two dyads $(i,i')$ and $(i',i)$ is taken. If direction is DIFF_OUT, the difference of the values on $(i,i')$ minus the value on $(i',i)$ is taken. If direction is DIFF_IN, the difference of the values on $(i',i)$ minus the value on $(i,i')$ is taken. Varying directions OUT and/or IN can be used to specify variants such as "transitive closure", "cyclic closure", "shared senders", or "shared receivers". SYM specifies an undirected variant that takes into account previous interaction in any direction. SYM_MIN can specify variants where previous interaction on $(i,i')$ is judged by the amount of reciprocated interaction; for instance, a strong previous interaction from $i$ to $i'$ does not count if there is no previous interaction in the other direction. DIFF_OUT or DIFF_IN may be appropriate for interaction that establishes status differences between the source and the target node.
  • dyad-level attribute.2 and associated direction.2 specify the name and direction of a dyad-level attribute to obtain values for the dyad $(j,i')$, that is, the dyad connecting the node $j$ with the third node $i'$. The semantics is identical to that of the first dyad, considering that OUT is the direction pointing away from $j$.
  • function to combine concatenated / serial paths is a function that takes as input the values on the two dyads $(i,i')$ and $(j,i')$, compare the discussion of the two dyad-level attributes and directions above, and returns a value for the indirect two-path $i-i'-j$. There are two usual choices for this function: PRODUCT (the default), which returns the product of the two values, and MIN, which returns the minimum of the two values (choosing MIN would be uncommon if dyad values can also be negative). Note that if the two dyad values are binary (that is, can take only the values zero or one), then the two functions, PRODUCT and MIN will actually give the same values. Other choices for the function to combine concatenated paths are uncommon and have to be handled with care.
  • node-level attribute (optional) gives the name of a node-level attribute whose value on the "third" node $i'$ is taken to moderate the value of the indirect two-path $i-i'-j$ by multiplying the node-level attribute value with the value of this two-path. This functionality can be used to test whether nodes with certain properties have a stronger or weaker tendency to connect their neighbors.
  • function to combine parallel paths is a function that takes the values on all indirect paths $i-i'-j$, $i-i''-j$, $\dots$ (see above), iterating over all "third" nodes $i',i'',\dots$, and returns a combined value indicating how strongly $i$ and $j$ are indirectly connected via two-paths. There are two typical choices for this function: SUM (the default), which returns the sum of the two values, and MAX, which returns the maximum of the two values (choosing MAX would be uncommon if dyad values can also be negative). Other choices for the function to combine parallel paths are uncommon and have to be handled with care.
  • aggregation function gives a function aggregating the values for the strength of indirect connections via two-paths over all node pairs $(i,j)$, where $i$ and $j$ iterate over the sets of nodes determined by other arguments, especially the endpoint (see above). The default is AVERAGE. However, all other choices for the aggregation function (see the discussion given for the node attribute statistics above) are also supported (irrespective of whether they make sense in a given application).

Four-cycle statistics

Typenames: UHE_4CYCLE_STAT and DHE_4CYCLE_STAT.

Four-cycle statistics are related to triadic closure but apply to situations in which two nodes $i$ and $j$ are indirectly related via two (rather than one) intermediate nodes, that is, where there is a three-path $i-i'-i''-j$, and where this three-path gets closed to a four-cycle by an event in which $i$ and $j$ jointly participate. The arguments of the four-cycle statistics are very similar to those of triadic closure. A difference is that four-cycle statistics get as arguments three (rather than two) dyad-level attributes, and associated directions, to get the values of the three dyads on the three-path indirectly connecting $i$ and $j$. The first dyad is $(i,i')$, the one pointing out of the source, the second dyad is $(j,i'')$, the one pointing out of the target, and the third dyad is $(i',i'')$, the one pointing from the neighbor of the source, $i'$, to the neighbor of the target, $i''$. Note that these dyad directions are the ones set by OUT; setting the direction to IN reverses the direction and SYM considers both directions. Another difference (to triadic closure) is that there is no optional node-level attribute to mediate the effects of three-paths.

Four-cycle statistics are particularly relevant for directed two-mode hyperevents. For instance, if hyperevents represent scientific papers whose authors $i_1,\dots,i_k$ cite references $j_1,\dots,j_l$, a four-cycle statistic might test the pattern that if $i$ and $i'$ have both cited reference $j'$ and if $i'$ has cited reference $j$, then $i$ has an increased tendency to also cite $j$ in the future. It is important to note that four-cycle statistics allow that the "edges" of the intermediate three-path originate from different events. A single event in which two authors $i$ and $i'$ jointly cite the two references $j$ and $j'$ may look like a four-cycle - and if the two authors repeatedly jointly cite the same two references it may seemingly point to a four-cycle effect. However, such a pattern is actually captured by subset repetition of order $(2,2)$. This comment implies that, when testing four-cycle effects one should always control for subset repetition which often turns out to be the simpler (and much stronger) explanation.

UHE_4CYCLE_STAT and DHE_4CYCLE_STAT have the following type-specific arguments, where endpoint only applies to the directed variant and all other arguments apply to directed and undirected four-cycle statistics. The undirected four-cycle statistic computes a value for a given undirected hyperedge $[i_1,\dots,i_k]$ and the directed four-cycle statistic computes a value for a directed hyperedge $([i_1,\dots,i_k],[j_1,\dots,j_l])$.

  • endpoint (only for DHE_4CYCLE_STAT) can be SOURCE, TARGET, or unspecified. If endpoint=SOURCE, the two outer nodes $i$ and $j$ are obtained by iterating over all pairs of different source nodes $i\neq j\in [i_1,\dots,i_k]$. If endpoint=TARGET, the two outer nodes $i$ and $j$ are obtained by iterating over all pairs of different target nodes $i\neq j\in [j_1,\dots,j_l]$. If the endpoint is unspecified, the two outer nodes $i$ and $j$ are obtained by iterating $i$ over all source nodes $i\in [i_1,\dots,i_k]$ and iterating $j$ over all target nodes $j\in [j_1,\dots,j_l]$. For UHE_4CYCLE_STAT (where no endpoint needs to be specified), the two outer nodes $i$ and $j$ are obtained by iterating over all pairs of different participants $i\neq j\in [i_1,\dots,i_k]$. For undirected and directed closure, the first intermediate node $i'$, the neighbor of the source $i$, iterates over all nodes (not only participants, sources, or targets of the given hyperedge) that are different from the target $j$ and the second intermediate node $i''$, the neighbor of the target iterates over all nodes that are different from the source $i$.
  • dyad-level attribute.1 and associated direction.1 specify the name and direction of a dyad-level attribute to obtain values for the dyad $(i,i')$, that is, the dyad connecting the node $i$ with the first intermediate node $i'$. If direction is OUT (pointing away from $i$), the value on the dyad $(i,i')$ is taken. If direction is IN (pointing towards $i$), the value on the reverse dyad $(i',i)$ is taken. If direction is SYM (pointing in both ways), the sum of the values on the two dyads $(i,i')$ and $(i',i)$ is taken. If direction is SYM_MIN (or SYM_MAX), the minimum (or maximum) of the values on the two dyads $(i,i')$ and $(i',i)$ is taken. If direction is DIFF_OUT, the difference of the values on $(i,i')$ minus the value on $(i',i)$ is taken. If direction is DIFF_IN, the difference of the values on $(i',i)$ minus the value on $(i,i')$ is taken. (Moreover, see the comments given in the discussion of triadic closure above.)
  • dyad-level attribute.2 and associated direction.2 specify the name and direction of a dyad-level attribute to obtain values for the dyad $(j,i'')$, that is, the dyad connecting the node $j$ with the second intermediate node $i''$. The semantics is identical to that of the first dyad, considering that OUT is the direction pointing away from $j$.
  • dyad-level attribute.3 and associated direction.3 specify the name and direction of a dyad-level attribute to obtain values for the dyad $(i',i'')$, that is, the dyad connecting the first intermediate node $i'$ with the second intermediate node $i''$. The semantics is identical to that of the first and second dyad, considering that OUT is the direction pointing $i'$ to $i''$.
  • function to combine concatenated / serial paths is a function that takes as input the values on the three dyads $(i,i')$, $(i',i'')$, and $(j,i')$, compare the discussion of the dyad-level attributes and directions above, and returns a value for the indirect three-path $i-i'-i''-j$. There are two typical choices for this function: PRODUCT (the default), which returns the product of the two values, and MIN, which returns the minimum of the three values. See further comments in the discussion of the closure statistic above.
  • function to combine parallel paths is a function that takes the values on all indirect paths $i-i'-i''-j$ (see above), iterating over all pairs of intermediate nodes $(i',i'')$, and returns a combined value indicating how strongly $i$ and $j$ are indirectly connected via three-paths. There are two typical choices for this function: SUM (the default), which returns the sum of the two values, and MAX, which returns the maximum of the two values. See further comments in the discussion of the closure statistic above.
  • aggregation function gives a function aggregating the values for the strength of indirect connections via three-paths over all node pairs $(i,j)$, where $i$ and $j$ iterate over the sets of nodes determined by other arguments, especially the endpoint (see above). The default is AVERAGE. However, all other choices for the aggregation function (see the discussion given for the node attribute statistics above) are also supported (irrespective of whether they make sense in a given application).

Neighbor statistics (degree statistics)

Typenames: UHE_NEIGHBOR_STAT and DHE_NEIGHBOR_STAT.

Neighbor statistics aggregate the value of a node-level attribute over the neighbors of participants, sources, targets, or sources and targets of hyperedges. This type of statistics is similar to node attribute statistics, discussed above, with the difference that node statistics aggregate over the participants and neighbor statistics aggregate over their neighbors. An aggregation function can be specified to indicate how values are aggregated. Neighbor statistics have the following arguments, where the argument endpoint is only for the directed version (all other arguments are for the directed and undirected version).

  • endpoint (only for DHE_NODE_STAT) takes the values SOURCE (aggregation is over the neighbors of source nodes), TARGET (aggregation is over the neighbors of target nodes), or is unspecified (aggregation is over the neighbors of source and target nodes). Note that the precise meaning of the latter depends on the specified aggregation function (see the discussion given for node attribute statistics above).
  • node attribute (optional) gives the name of the node-level attribute whose value that should be aggregated over the neighbors. If this argument is unspecified the value 1.0 is taken for every neighbor.
  • dyad attribute.1 and associated direction.1 specifies the name of a dyad-level attribute to weight the neighbors of the nodes (and to define which nodes are neighbors in the first place). If $i$ is a node participating in the hyperedge and $i'$ is one of its neighbors, then the neighbor's value is the product of the dyad-level attribute on the dyad $(i,i')$, possibly varying the direction as specified, times the value of the node-level attribute on $i'$. These values are then combined over all the neighbors of $i$ (see the "function to combine parallel edges", discussed below) and then the resulting values are aggregated over all the participating nodes $i$ of the hyperedge (see the "aggregation function", discussed below).
  • function to combine parallel edges a function specifying how to aggregate the values over the neighbors of every node. The default is SUM. If all dyad and node attribute values are non-negative an alternative is to take MAX. All other options are uncommon and have to be handled with care; see the discussion for the closure statistic above.
  • aggregation function specifies the type of function used to aggregate values. The default (if no aggregation function is specified) is AVERAGE: the arithmetic mean of the values over the nodes is taken. The other possible aggregation functions are SUM (sum of values), MAX (maximum value), MIN (minimum value), PRODUCT (product of values), SDEV (standard deviation of values, that is, the square root of the average sum of squared differences of values from the mean value; note: division is by the number of values), SAMPLESDEV (sample standard deviation of values, sometimes called the "unbiased" estimator of the standard deviation; in contrast to SDEV, the division is by the number of values minus one), ABSDIFF (average of absolute differences over the pairs of values; see further details in the discussion of the node attribute statistic above), and CATDIFF (ratio of pairs of different values; typically applied for categorical attributes, such as "affiliation" or "department"; see further details in the discussion of the node attribute statistic above).

Network (global) statistics

Typenames: UHE_NETWORK_STAT and DHE_NETWORK_STAT.

The network statistics return the value of a specified network-level attribute (see above) at the given point in time. The only type-specific argument is

  • network-level attribute, giving the name of the attribute providing the values.

Since values are identical for all undirected or directed hyperedges (and may only change over time), values of this statistic cannot be used to explain whether some hyperedges are likely to experience events at a higher or lower rate. However, these statistics could be interacted with other statistics to test whether some RHEM effects are stronger or weaker in certain periods of time (testing "time heterogeneity of effects"). In addition, in a RHEM with a fully specified hazard rate this statistic could explain changes in the baseline (or average) rate over time.