Dataset "SourceForge Research Data Archive" - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Dataset "SourceForge Research Data Archive"

  1. @aserhany
  2. @chaoran-chen
  3. @ashwarypande

Summary

SourceForge.net is the first provider of a centralized location for free and open-source software developers to control and manage software development and offering this service for free. The platform is comparable to GitHub or Bitbucket. For research purposes, they give out a data Archive that contains the tables and attributes described in this ER-diagram. The data can be queried using a web-based form that permits SQL queries. A database schema is also available.

Prediction Goals

  • Predict which project could get inactive in the next year
  • Predict download/view count for a project
  • Predict consumer satisfaction in correlation with other attributes
  • Find the most urgent bug to solve
  • Compute a value/user indicator which describes how much value the user brought to the project. (this can be used to remove useless contributors or contributors which uploaded bad code.
  • Suggest best fitting person for a job according to skills and past project experiences

Long Description

###SourceForge.net

SourceForge.net is the world's largest Open Source software development web site, with the largest repository of Open Source code and applications available on the Internet. Owned and operated by Dice, SourceForge.net provides free services to Open Source developers. The SourceForge.net web site is database driven and the supporting database includes historic and status statistics on over 320,000 projects, over 850,000 developers' activities, and over 3.4 million registered users' activities at the project management web site. Dice has shared certain SourceForge.net data with the University of Notre Dame for the sole purpose of supporting academic and scholarly research on the Free/Open Source Software phenomenon. Dice has given Notre Dame permission to in turn share this data with other academic researchers studying the Free/Open Source Software phenomenon. More information about SourceForge can be found here.

Description of Data available

SourceForge.net uses relational databases to store project management activity and statistics. There are over 100 relations (tables) in the data dumps provided to Notre Dame. Some of the data have been removed for security and privacy reasons. SourceForge.net cleanses the data of personal information and strips out all Dice specific and site functionality specific information. On a monthly basis, a complete dump of the databases (minus the data dropped for privacy and security reasons) is shared with Notre Dame. The Notre Dame researchers have built a data warehouse comprised of these monthly dumps, with each stored in a separate schema. Thus, each monthly dump is a snapshot of the status of all the SourceForge.net projects at that point in time. As of November 2005, the data warehouse was almost 300 GBytes in size, and is growing at about 25 GBytes per month. Much of the data is duplicated among the monthly dumps, but trends or changes in project activity and structure can be discovered by comparing data from the monthly dumps. Queries across the monthly schema may be used to discover when changes took place, to estimate trends in project activity and participation, or even that no activity, events or changes have taken place. To help researchers determine what data is available, an ER-diagram and the definitions of tables and views in the data warehouse are provided.

For each month, the data warehouse includes three major parts.

  • The tables supporting the SourceForge.net web site, for example, the tables user, group etc..
  • The tables used to store the statistics of the whole community, including daily page access, downloads etc..
  • The tables with the history information on the other tables.

Types of data that can be extracted

The following are types of data that we have been from the SourceForge.net Research Data Archive:

  • Project sizes over time (number of developers as a function of time presented as a frequency distribution)
  • Development participation on projects (number of projects individual developers participate on presented as a frequency distribution)
  • The above two items are used to create a "collaboration social-network"
  • The above two items were used to discover scale-free distributions among developer activity and project activity
  • The extended-community size around each project including project developers plus registered members who participated in any way on a project (discussion forum posting, bug report, patch submission, etc.)
  • Date of project creation (at SourceForge.net)
  • Date of first software release for a project
  • SourceForge.net ranking of projects at various times
  • Activity statistics on projects at various times
  • Number of projects in various software categories, e.g., games, communications, database, security, etc.

Since all of the archived data is stored in a relational database, data to support F/OSS investigations will be extracted using SQL queries against the data warehouse.

More information about the schema can be found here.

How to get the research data:

  1. Submit the questionnaire and signed agreement
  2. Study the "live" SourceForge.net site to understand the context of the data collection.
  3. Review the ER-diagram and the Table Definitions to identify research data of interest
  4. The researcher will be sent a userid and password that will provide access to a web-based form that will permit direct SQL queries against the data archive
  5. Submit email requests for support to [email protected]

##Links