How to run - ALPFA/RepositorySearch GitHub Wiki

This wiki page explains the sequence in which the code has to be executed.

The first step in fetching java repositories is to get the list of projects that might be possible subjects for ALPFA algorithm to test. This step is done as below

1.Fetch initial project list - using bat/sh files

What to do:

Execute script file located in resources directory

Description:

  1. Decide what type of projects you are looking for -> in our case we are interested in projects which might have testing, syntax errors,stand alone applications as key words and the primary language should be java.
  2. Decide what projects you don't want to look -> in our case we don't want projects related to javascript, web-app development, android, ios apps etc.
  3. Using github search API write a script that fetches the names of repositories that we are interested
  4. The unix and windows version of the script can be found in /resources directory
  5. The Github API limits us to fetch only first 1000 results. Github support replied that can't provide a further increase in the rate limit for the Search API.
  6. The further details of search API can be found here
  7. Please edit these script files to provide the directory location that is meaningful with your local machine. In my local machine these search results are placed in /home/theja/ALPFA/SearchResults location

2.Generate Project Metadata

What to do:

Execute GetProjectMetadata.java

Description:

The goal of this step is to collect project metadata. In the previous step we have the name of projects which are possible test subjects. Using those files we extract the following metadata

  1. Get release details of projects which have more than 2 versions
  2. Get IssueURLs of these projects and store them in an independent file

The above functionality is achieved by running GetProjectMetadata.java file from the src directory

Each project version details are stored on a separate directory named as project name as shown below: version details

In addition to this 2 JSON files are generated. They are

  1. IssueURLs.JSON: This file stores the URLs of bug trackers of selected subjects
  2. NameAndIds.JSON: This file stores a map of project names and their ids which will help us in next steps

By the end of this step we have release details of selected projects and URL of bug trackers for selected subjects.

3.Generate bug metadata for all the subjects

What to do:

Execute GetIssueDetailsForProjects.java

Description:

The reason for not running this step as part of previous step is the shortage with API rate limit. In this step we generate bug related metadata for the projects. This issue metadata is in the following format:

{

 `"Project Name": "AxonFramework",`

 `"Closed At": "2012-10-11T11:38:24Z",`

 `"Comments_content": ["Hi Andy,\r\n\r\nthanks for this one!\r\n\r\nAllard"],`

 `"Issue Number": "73",`

 `"Comments": "1",`

 `"Project ID": "2899580",`

 `"Title": "Better index use in MongoSagaRepository",`

 `"Body": "Hi.I was having a problem with slow-down when using the MongoSagaRepository"`

 `"Created At": "2012-10-11T08:37:09Z"`

}

Each metadata object for an issue will be in above format. If there are 100 closed bugs in a projects there will be 100 such objects in the respective directory for that project. Each issue object is saved as bugNumber.json

Issues_metadata location

API Issues:

This program will use all the API rate limit assigned to us. Thus this program checks what capacity we are left with and depending on remaining quota sleeps for an hour and then continues the execution when we have full quota of 5000 requests available. With few trails from my end the average run time to fetch all these objects is 3 hours 30 minutes which includes 3 hour of sleep time

Update: For 3 days from 6th April 2015 we have increased ratelimit of about 12500 requests/ hour

Executing this program in parallel from multiple machines and multiple client_ids is one option which will be looked into in future.

4. Download zip/tar files of source code

What to do:

Execute DownloadTarBalls.java (for linux) or DownloadZipBalls.java (for windows)

Description:

This program downloads compressed folder of source code for our subjects. It downloads source code of all the versions of project.

downloaded sourcecode

This program took around 1 hour 30 minutes to download all the compressed folders on my machine. The compressed files for all projects took around 9 GB.

With this initial search criteria we have around 143 projects as our test subjects. To further improve our result at this level we have to modify our search query on shell script file used in step 1.