http://wiki.swarma.net/index.php/%E7%82%B9%E5%87%BB%E6%B5%81%E7%BD%91%E7%BB%9C%E7%9A%84%E8%80%97%E6%95%A3%E4%B8%8E%E5%BC%82%E9%80%9F%E5%A2%9E%E9%95%BF

Table of Contents GithubArchive数据数据源数据预处理数据读取 Toward a metabolic theory of ecology Theory 研究发现研究发现 Wikipedia Clickstream data 参考文献

GithubArchive数据

400px

数据源

从http://www.githubarchive.org/这里可以下载github自2011年2月12日到现在的数据。数据以Json的压缩包形式提供（截至2014-07-16，数据的物理体积为56G）。每一个小时一个数据文件，例如2012年3月11日12点的数据文件为：http://data.githubarchive.org/2012-03-11-12.json.gz。

数据下载 根据这个特点，可以构造随时间变化的所有下载链接并获取相应数据。在githubarchive网站提供了ruby的下载数据方式。虽然谷歌的bigquery也提供了迅速计算该数据的方法，但有计算量和分析方式的限制，适合于做数据的描述性分析，而不是深入的数据挖掘。此处推荐熟悉python的研究者使用编写python script的形式下载数据，可参考Mazieres所编写的python代码，见这里：https://github.com/mazieres/github_archive 。数据下载之后，可以较为自由地分割数据并进行处理。

Data storytelling using the github archive https://www.oreilly.com/learning/data-storytelling-using-the-github-archive

Githut http://githut.info/

数据预处理

把数据按照行为的类别（types）分解，获取每种行为的 actor，对应的repo，和时间。每个类别的行为存放在一个文件夹，一天一个数据文件。

import json

def saveData(ad):
    f = gzip.open(ad, 'rb') 
    f = f.read().split(' ')
    num = 0
    for line in f:
        num += 1
        try:
            line = json.loads(line)
            types = line['type']
            if 'repo' in line:
                repo = line['repo']['name']
                actor = line['actor']['login']
            else:
                repo = line['repository']
                repo = repo['owner'] + '/' + repo['name']
                actor = line['actor']
            time = line['created_at']
            date = time[0:10]
            ts = time[11:19].split(':')
            ts = ts[0] + ts[1] + ts[2]
            record = actor+"\t"+repo +"\t"+ts
            newpath = path +'days/'+ types  + '/'
            if not os.path.exists(newpath): os.makedirs(newpath)
            with open(newpath + date,'a') as p:
                p.write(record+"\n")
        except:
            pass

path='D:/chengjun/githubArchive/'
ads = glob.glob(path + "*")
ads = [f for f in ads if f[-2:] == 'gz']
ads =[f for f in ads if f[41:45] == '2012' and int(f[46:48]) > 6]
for ad in ads:
    n=ads.index(ad)
    print n, ad
    saveData(ad)

数据读取

import gzip, json, re

def readData(gz_path):
    f = gzip.open(gz_path, 'rb')
    files = f.readlines()
    length = len(files)
    if  length > 1:
        acts = []
        for subs in files:
            acts.append(json.loads(subs))
    else:
        f2 = files[0]
        r = re.split('(\{.*?\})(?= *\{)', f2)
        r = [i for i in r if i] # delete the 
        accumulator = 
        acts = []
        for subs in r:
            accumulator += subs
            try:
                acts.append(json.loads(accumulator))
                accumulator = 
            except Exception, e:
                print e
                pass
    return acts

数据初步清洗和提取的代码 :File:github_lingfei_chengjun.pdf

def get_member(act):
    if 'repo' in act:#old version data before 2013
        date=str(act['created_at'].split('T')[0])
        try:
            author=act['payload']['member']['login']
        except:
            author=act['payload']['member']
        repo_id=int(act['repo']['id'])
        repo_name=act['repo']['name']
    elif 'repository' in act:#new version data after 2013
        date=str(act['created_at'].split('T')[0])
        author=act['payload']['member']['login']
        repo_id=int(act['repository']['id'])
        repo_name = act['repository']['owner']+'/' + act['repository']['name']
    return date, author, repo_id, repo_name

重新提取后的最大团队的member数量是570。

Toward a metabolic theory of ecology

C. Cattuto1,, V. Loreto and V. D. P. Servedio. A Yule-Simon process with memory. Europhysics Letters PREPRINT^[1]

Brown, J. H., Gillooly, J. F., Allen, A. P., Savage, V. M., & West, G. B. (2004). Toward a metabolic theory of ecology. Ecology, 85(7), 1771-1789.^[2]

 In short, the metabolic rate was limited by the efficiency with which the organism could distribute resources to the cells. John Holland, Complexity: A Very Short Introduction, P17

Theory

Team Science

Machine Science

研究发现

这里首先处理的数据是watchevent这个Github的数据，因为2012年的json数据缺少换行符，所以这里仅仅用了2011年的数据。

400px 400px

研究发现

400px On GitHub’s Programming Languages

github的数据，考察各种universal patterns，参考towarding the metabolic theory of ecology，先有一个模型的突破点比较好如果可以把各种pattern 比如开源项目成长周期曲线之类都从一个preferential return那种微观模型推出来。

清洗MemberEvent，提取一个repo添加成员的信息，注意有很多团队其实没有添加成员，就只有一个创始人，不在这个统计之中。所以它的头比较平。 400px 400px

下图的team size的分布有误！

400px 400px

800px

Klug M, Bagrow JP. 2016 Understanding the group dynamics and success of teams. R. Soc. open sci. 3: 160007. http://dx.doi.org/10.1098/rsos.160007^[3]

- Science of science & Team science -- Why small team could make break-throughs?

400px 400px 400px 400px

Wikipedia Clickstream data

Page editors as a group, page can cite each other.

https://figshare.com/articles/Wikipedia_Clickstream/1305770

Github Research - socrateslab/zh GitHub Wiki

Table of Contents

GithubArchive数据

数据源

数据预处理

数据读取

Toward a metabolic theory of ecology

Theory

研究发现

研究发现

Wikipedia Clickstream data

参考文献

⚠️ GitHub.com Fallback ⚠️

Github Research - socrateslab/zh GitHub Wiki

Table of Contents

GithubArchive数据

数据源

数据预处理

数据读取

Toward a metabolic theory of ecology

Theory

研究发现

研究发现

Wikipedia Clickstream data

参考文献

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️