Github Research - socrateslab/zh GitHub Wiki
数据下载 根据这个特点,可以构造随时间变化的所有下载链接并获取相应数据。在githubarchive网站提供了ruby的下载数据方式。虽然谷歌的bigquery也提供了迅速计算该数据的方法,但有计算量和分析方式的限制,适合于做数据的描述性分析,而不是深入的数据挖掘。此处推荐熟悉python的研究者使用编写python script的形式下载数据,可参考Mazieres所编写的python代码,见这里:https://github.com/mazieres/github_archive 。数据下载之后,可以较为自由地分割数据并进行处理。
Data storytelling using the github archive https://www.oreilly.com/learning/data-storytelling-using-the-github-archive
Githut http://githut.info/
把数据按照行为的类别(types)分解,获取每种行为的 actor, 对应的repo,和时间。每个类别的行为存放在一个文件夹,一天一个数据文件。
import json
def saveData(ad):
f = gzip.open(ad, 'rb')
f = f.read().split(' ')
num = 0
for line in f:
num += 1
try:
line = json.loads(line)
types = line['type']
if 'repo' in line:
repo = line['repo']['name']
actor = line['actor']['login']
else:
repo = line['repository']
repo = repo['owner'] + '/' + repo['name']
actor = line['actor']
time = line['created_at']
date = time[0:10]
ts = time[11:19].split(':')
ts = ts[0] + ts[1] + ts[2]
record = actor+"\t"+repo +"\t"+ts
newpath = path +'days/'+ types + '/'
if not os.path.exists(newpath): os.makedirs(newpath)
with open(newpath + date,'a') as p:
p.write(record+"\n")
except:
pass
path='D:/chengjun/githubArchive/'
ads = glob.glob(path + "*")
ads = [f for f in ads if f[-2:] == 'gz']
ads =[f for f in ads if f[41:45] == '2012' and int(f[46:48]) > 6]
for ad in ads:
n=ads.index(ad)
print n, ad
saveData(ad)
import gzip, json, re
def readData(gz_path):
f = gzip.open(gz_path, 'rb')
files = f.readlines()
length = len(files)
if length > 1:
acts = []
for subs in files:
acts.append(json.loads(subs))
else:
f2 = files[0]
r = re.split('(\{.*?\})(?= *\{)', f2)
r = [i for i in r if i] # delete the
accumulator =
acts = []
for subs in r:
accumulator += subs
try:
acts.append(json.loads(accumulator))
accumulator =
except Exception, e:
print e
pass
return acts
数据初步清洗和提取的代码 :File:github_lingfei_chengjun.pdf
def get_member(act): if 'repo' in act:#old version data before 2013 date=str(act['created_at'].split('T')[0]) try: author=act['payload']['member']['login'] except: author=act['payload']['member'] repo_id=int(act['repo']['id']) repo_name=act['repo']['name'] elif 'repository' in act:#new version data after 2013 date=str(act['created_at'].split('T')[0]) author=act['payload']['member']['login'] repo_id=int(act['repository']['id']) repo_name = act['repository']['owner']+'/' + act['repository']['name'] return date, author, repo_id, repo_name
重新提取后的最大团队的member数量是570。
C. Cattuto1,, V. Loreto and V. D. P. Servedio. A Yule-Simon process with memory. Europhysics Letters PREPRINT[1]
Brown, J. H., Gillooly, J. F., Allen, A. P., Savage, V. M., & West, G. B. (2004). Toward a metabolic theory of ecology. Ecology, 85(7), 1771-1789.[2]
In short, the metabolic rate was limited by the efficiency with which the organism could distribute resources to the cells. John Holland, Complexity: A Very Short Introduction, P17
这里首先处理的数据是watchevent这个Github的数据,因为2012年的json数据缺少换行符,所以这里仅仅用了2011年的数据。
400px On GitHub’s Programming Languages
github的数据,考察各种universal patterns, 参考towarding the metabolic theory of ecology, 先有一个模型的突破点比较好 如果可以把各种pattern 比如开源项目成长周期曲线 之类 都从一个preferential return那种微观模型推出来。
清洗MemberEvent,提取一个repo添加成员的信息,注意有很多团队其实没有添加成员,就只有一个创始人,不在这个统计之中。所以它的头比较平。 400px400px
下图的team size的分布有误!
Klug M, Bagrow JP. 2016 Understanding the group dynamics and success of teams. R. Soc. open sci. 3: 160007. http://dx.doi.org/10.1098/rsos.160007[3]
- Science of science & Team science -- Why small team could make break-throughs?
Page editors as a group, page can cite each other.