Methods - softwaresaved/rse-repo-analysis GitHub Wiki

Our research was exploratory and as such the collected data has some limitations which we have laid out in the respective sections of the wiki. The scripts were developed to be modular and reusable so that they should be useful in further research. We have made every effort to keep the documentation up to date.

We first parsed publications available through ePrints to look for links to Then, we validated those links. As we were interested only in repositories created for research, we had to manually review links and the respective publications to distinguish between repositories cited as related work, used tools and those created for the publication.

We only reviewed links found on the first two pages of the publication and disregarded any that were found on later pages. This is because we empirically found that repositories that were created as part of a publication are often linked in the abstract or introduction, placing them at the front of a paper. The selection we have analysed is thus by no means exhaustive. We then only mined the GitHub repositories found on those two pages.

We manually examined all publications with links in the first two pages and produced two lists of repositories: false_positives.txt and true_positives.txt, where true_positives.txt are the ones that were indeed created for the publication. This resulted in a separate dataset, eprints_w_intent.csv, which for each of the repositories from the first two pages lists the repository ID, citation intent and publication details.

We produced timelined datasets and various plots which we used to separate repositories into three categories: one person, high interest and inbetween. The wording should be reviewed, as well as the thresholds, but the following was used for our exploratory analysis:

  • One-person repositories are those where only one user interacts with the repository through issues and/or commits to the main branch. Other users might star or fork the repository, but they don't open issues or commit to the main branch, becoming contributors. These repositories could be considered to have low engagement with the community even though some of them have impressive amounts of stars.
  • High-interest repositories are those with at least five distinct users over the lifetime of the repository who interact with the issues and/or commits to the main branch. These might be repositories that start out with a larger team, or that get community members opening issues, or grow their contributor team over time.
  • Inbetween repositories are all those that don't fall into either category.

The mapping of repositories to those three categories is reported in one_person_repos.txt, high_interest_repos.txt and inbetween_repos.txt. We only considered repositories marked as true positives before.