Springer 杂志论文:运用现代大数据技术迅速洞察社会发展 (Leveraging modern big data stack for swift development of insights into social developments) - ziliantech-org/doc-zilian-wiki GitHub Wiki
文章提出,快速有效的追踪社会发展动态已经成为众多社会管理机构和企业关心的话题。社会发展动态的呈现方式,如指标,图表,文字总结,越来越以数据驱动的方式来生成,并借助计算机大数据技术来实现。论文展示描述了咨链科技公司设计的一个有弹性可扩展的大数据系统框架,并且介绍了在具体的两个社会发展动态的计算场景中,研发人员采用了什么评价标准,如何选取了合适的具体大数据技术,在这个系统框架下搭建了针对性的数据应用程序来满足计算场景的需求。文中介绍的两个大数据计算场景包括:
提供了对过去15年的520万条全球论文数据进行多维度的数据分析的能力,咨链科技的大数据系统把绝大部分的分析用时降到了分钟级别。 提供了一个特定领域的交互式的数据查询引擎,该引擎对三千六百万家中国注册企业,及两千八百万条专利数据建立了交叉检索,通过网页形式给用户提供毫秒级别反应时间的查询服务。
Insights of social development, presented in various forms, such as metrics, figures, text summaries, whose purpose is to summarize, explain, and predict the situations and trends of society, is extremely useful to guide organizations and individuals to better realize their own objectives in accordance with the whole society. Deriving these insights accurately and swiftly has become an interest for a range of organizations, including agen- cies governing districts, city even the whole country, they use these insights to inform policy-makings. Business investors who peak into statistical numbers for estimating current economical situations and future trends. Even for individuals, they could look at some of these insights to better align themselves with macroscopical social trends. There are many challenges to develop these insights in a data-driven approach. First, required data come from a large number of heterogeneous sources in a variety of formats. One single source’s data could be in the size of hundreds of Gigabytes to several TeraBytes, ingesting and governing such huge amount of data is not a small challenge. Second, many complex insights are derived by domain human experts in a trail-and-error fashion, while interacting with data with the aid of computer algorithms. To quickly experiment various algorithms, it asks for software capabilities for infusing human experts and machine intelligence together, this is challenging but critical for success. By designing and implementing a flexible big data stack that could bring in a variety of data components. We address some of the challenges to infuse data, computer algorithm and human together in Zilian Tech company [20]. In this paper we present the architecture of our data stack and articulate some of the important technical choices when building such stack. The stack is designed to be equipped with scalable storage that could scale up to PetaBytes, as well as elastic distributed compute engine with parallel computing algorithms. With these features the data stack enables a) swift data analysis, by human analysts interacting with data and machine algorithms via software support, with on-demand question answering time reduced from days to minutes; b) agile building of data products for end users to interact with, in weeks if not days from months.