Coevolution of collaboration and references to prior work (tutorial) - juergenlerner/eventnet GitHub Wiki

This tutorial explains how the coevolution of collaboration and citation networks can be modeled with relational hyperevent models (RHEM) using eventnet. An illustrative empirical setting are scientific networks, which comprise two types of nodes, scientists (authors) and papers, connected by two types of relations: authors are connected to the papers they write and papers are connected to the papers they cite as references (prior work). Scientific networks can be viewed as mixed two-mode networks, where the author-writes-paper relation constitutes a two-mode network and the paper-cites-paper relation constitutes a one-mode network that is interrelated with the two-mode network. Observations in scientific networks are sequences of time-stamped hyperevents, where each event is given by the publication of a scientific paper, involving a varying number of authors and references.

Modeling the coevolution of collaboration and citation network allows researchers to test hypotheses explaining scientific collaboration (team formation in coauthoring networks) and scientific citation networks (which, depending on the point of view, may explain scientific orientation, impact, or inheritance), where the explanatory variables may be constructed from combinations of prior (co-)authoring and citation relations. Exemplary effects that may explain the formation of coauthor teams include familiarity (repeated coauthorship), triadic closure, or the tendency to start collaborations with those who cited the own work or with those who cited the same other references in the past. Hypothetical effects explaining references of published papers include citation popularity (the tendency to cite papers that have already received many citations), repeated cocitations, the tendency of authors to cite their own prior work, the work of their past coauthors, or to repeatedly cite the same papers in several publications.

Besides scientific networks, there are several other social settings giving rise to dynamic interdependent networks representing collaboration and references to prior work. Common examples include patent networks where teams of inventors file patent applications that reference other patents and networks of artistic or cultural production, where teams produce, e.g., songs, movies, or other artistic products which implicitly or explicitly make references to prior work.

Technically, hypotheses in networks of collaboration and citation networks can be tested with the use of the eventnet software by specifying appropriate hyperedge statistics quantifying how instances (that is, actually observed or possible publications by given teams of coauthors that cite a given list of references) are embedded into the network of past publication events - similar to models explaining multicast communication described in the tutorial on directed RHEM. The difference is that many more structurally different RHEM effects are possible in mixed two-mode networks. This tutorial explains such effects - and how they can be specified with eventnet - using as an illustrative example a public dataset on more than a million scientific papers written by more than a million unique authors.

References

Models presented in this tutorial have been proposed and applied to the same data in the following paper.

Jürgen Lerner, Marian-Gabriel Hâncean, and Alessandro Lomi: Relational hyperevent models for the coevolution of coauthoring and citation networks. Journal of the Royal Statistical Society Series A: Statistics in Society, qnae068, 2024. https://doi.org/10.1093/jrsssa/qnae068

Moreover, the same model family has been applied in a study of cultural production to analyze collaboration and use of shared stylistic references of filmmakers by:

Katharina Burgdorf, Mark Wittek, and Jürgen Lerner: Communities of Style: Artistic Transformation and Social Cohesion in Hollywood, 1930 to 1999. Socius, 10, 23780231241257334, 2024. https://doi.org/10.1177/23780231241257334

Reproducibility (data, preprocessing, modeling)

This tutorial illustrates modeling with data from the Aminer citation network data set (https://www.aminer.org/citation, version V14). A reference for the data is the following (also see the Aminer webpage for further references).

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998.

Auxiliary software for preprocessing (AminerJSON2CSVDocType.java) and an eventnet configuration file (aminer.config.joint.txt) and R script (aminer_model.R) to reproduce the analysis described below is linked from the README file of this directory, where we also give the commands to start the processing.

The following sections focus on a high-level description of the steps to be done.

Structure of the input data

One unit of analysis in scientific networks is given by a tuple (t, j, {a1, a2, ...}, {j1, j2, ...}) indicating that at time t authors a1, a2, ... published paper j citing the prior papers j1, j2, ... as their references. The number of authors and references per paper is varying and theoretically unbounded. RHEM described in this tutorial seek to estimate relative publication rates: factors by which a publication with the team of coauthors a1, a2, ... citing the references j1, j2, ... is more or less likely than a possible publication with a different coauthor team a1', a2', ... citing different references j1', j2', .... The eventnet software can compute hyperedge statistics (explanatory variables) for observed publication events and "non-events", that is, pairs consisting of alternative coauthor teams and alternative references that theoretically could be the authors and references of a paper published at the same time, but are not.

The focal event type (labeling events whose rate we are going to estimate) is denoted as author.ref.paper in the illustrative eventnet configuration. It is the type for directed, two-mode hyperevents whose first component is the set of authors of a newly published paper and whose second component is the set of references of that paper. Since the number of authors and references is varying, such a single hyperevent is specified in several rows in the CSV input files (compare the tutorial on directed RHEM). An example is given by the following snippet, which can be found relatively early in the input file (around Line 173).

Year,EventID,Source,Target,Type
...
1948,53e9aeb1b7602d97038d305e:author.ref.paper,53f43dc9dabfaeecd69967bb,53e9a472b7602d9702d92e92,author.ref.paper
1948,53e9aeb1b7602d97038d305e:author.ref.paper,53f43dc9dabfaeecd69967bb,53e9b188b7602d9703c0eeda,author.ref.paper
1948,53e9aeb1b7602d97038d305e:author.ref.paper,53f43dc9dabfaeecd69967bb,53e9adc2b7602d97037c8a5f,author.ref.paper
1948,53e9aeb1b7602d97038d305e:author.ref.paper,53f43dc9dabfaeecd69967bb,53e9a472b7602d9702d92e92,author.ref.paper
1948,53e9aeb1b7602d97038d305e:author.ref.paper,53f45ed9dabfaee2a1d9289c,53e9a472b7602d9702d92e92,author.ref.paper
...

These rows result from a paper whose id ends with 05e (the ids of papers and authors are taken without change from the Aminer data), which has been published in the year 1948, has two authors whose ids end with 7bb and 89c, and cites three papers whose ids end with e92, eda, and a5f. (The authors appear in the column Source and the references in the column Target.) This publication results in 5 rows for this event type since we first fix the first author and iterate over all the references, then we fix the first reference and iterate over all the authors. The number of rows from one publication is therefore equal to the number of authors plus references. (There would be a slightly more space efficient way to specify the sets of authors and references generating a number of rows equal to the maximum of these two numbers. This would reduce the size of the input file but not the size of the main memory needed during execution of eventnet.) It does not matter that some authors (or references) appear more than once, since duplicate authors (or references) are removed when generating the hyperedges. These five rows have the event type author.ref.paper and the "event interval id" (giving the information which rows belong together to define one event) being the id of the published paper (ending with 05e), concatenated with a colon (:), and the event type author.ref.paper.

Events of type author.ref.paper are the ones whose rate will be estimated by the models described below. However, to define explanatory variables we represent events of four other types in the input file. In a nutshell, events of these four types give the information (1) which set of authors publishes which paper, (2) which paper cites which set of other papers, (3) which set of authors cite papers of which set of authors while publishing a paper, and (4) which paper cites which set of authors. These event types are illustrated in the following.

Events of type author encode authors-publish-paper hyperedges. These are directed hyperedges with a variable number of source nodes ("authors") and exactly one target node (the published paper). An example illustrating these events (using the same published paper as above) is given as follows.

Year,EventID,Source,Target,Type
...
1948,53e9aeb1b7602d97038d305e:author,53f43dc9dabfaeecd69967bb,53e9aeb1b7602d97038d305e,author
1948,53e9aeb1b7602d97038d305e:author,53f45ed9dabfaee2a1d9289c,53e9aeb1b7602d97038d305e,author
...

We see that two authors (whose ids end with 7bb or 89c, respectively) publish the paper whose id ends with 05e. The event type (last column) is author and the event interval ids are constructed by the same logic as before, with the difference that now the prefix of the event id is the event type author instead of author.ref.paper. Therefore, eventnet generates a different hyperevent from these author events than from the author.ref.paper events. In the illustrative configuration we model only hyperevents of type author.ref.paper - those of type author are used to construct some of the explanatory variables (hyperedge statistics).

Events of type paper.ref.paper encode paper-cites-papers hyperedges. These are directed hyperedges with exactly one source node (the published paper) and a variable number of target nodes (the references of the published paper). An example illustrating these events (using the same published paper as above) is given as follows.

Year,EventID,Source,Target,Type
...
1948,53e9aeb1b7602d97038d305e:paper.ref.paper,53e9aeb1b7602d97038d305e,53e9a472b7602d9702d92e92,paper.ref.paper
1948,53e9aeb1b7602d97038d305e:paper.ref.paper,53e9aeb1b7602d97038d305e,53e9b188b7602d9703c0eeda,paper.ref.paper
1948,53e9aeb1b7602d97038d305e:paper.ref.paper,53e9aeb1b7602d97038d305e,53e9adc2b7602d97037c8a5f,paper.ref.paper
...

We see that the paper whose id ends with 05e, published in 1948 cite the three papers whose ids end with e92, eda, and a5f, respectively. The event type and the event interval ids are constructed by the same logic as before.

Events of type author.ref.author encode authors-cite-papers-of-authors hyperedges. These are directed hyperedges with a varying number of source nodes (the authors of the published paper) and a variable number of target nodes (the union of the authors of the references of the published paper). An example illustrating these events (using the same published paper as above) is given as follows.

Year,EventID,Source,Target,Type
...
1948,53e9aeb1b7602d97038d305e:author.ref.author,53f43dc9dabfaeecd69967bb,53f477c4dabfaedf43689e53,author.ref.author
1948,53e9aeb1b7602d97038d305e:author.ref.author,53f43dc9dabfaeecd69967bb,53f45ed9dabfaee2a1d9289c,author.ref.author
1948,53e9aeb1b7602d97038d305e:author.ref.author,53f43dc9dabfaeecd69967bb,53f477c4dabfaedf43689e53,author.ref.author
1948,53e9aeb1b7602d97038d305e:author.ref.author,53f45ed9dabfaee2a1d9289c,53f477c4dabfaedf43689e53,author.ref.author
...

We see that the two authors (...7bb and ...89c) of the paper ...05e, published in 1948 cite papers that jointly have two different authors ...e53 and 89c. (Implicitly, we know that the three cited papers of ...05e together have these two authors. The respective publication events are given in other rows in the preprocessed file.) We see that while publishing ...05e, the author ...89c cites at least one of her own prior papers. The event type and the event interval ids are constructed by the same logic as before.

Finally, events of type paper.ref.author encode paper-cite-papers-of-authors hyperedges. These are directed hyperedges with a single source node (the published paper) and a variable number of target nodes (the union of the authors of the references of the published paper). An example illustrating these events (using the same published paper as above) is given as follows.

Year,EventID,Source,Target,Type
...
1948,53e9aeb1b7602d97038d305e:paper.ref.author,53e9aeb1b7602d97038d305e,53f477c4dabfaedf43689e53,paper.ref.author
1948,53e9aeb1b7602d97038d305e:paper.ref.author,53e9aeb1b7602d97038d305e,53f45ed9dabfaee2a1d9289c,paper.ref.author
...

The paper ...05e cites papers that jointly have two different authors ...e53 and 89c.

We recall that the entire preprocessed file for the Aminer citation network data, including all of the event types described above, can be generated by the auxiliary software for preprocessing (AminerJSON2CSVDocType.java). More information is given in the README file of this directory.

Specification of RHEM

The eventnet configuration file (aminer.config.joint.txt) illustrates how hyperedge statistics for various effects can be specified. The R script (aminer_model.R) shows how the resulting RHEM can be estimated. An in-depth explanation of these steps will follow in this tutorial.