Darwinazing biodiversity data in R - rstats-gsoc/gsoc2018 GitHub Wiki

Background

“Darwin Core (DwC) is a standard maintained by the Darwin Core maintenance group. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.” Darwin Core website

DwC is an evolving community-developed biodiversity data standard (Wieczorek et al. 2012). In simple words, it’s a list and definitions of common biodiversity data terms, ~200 of them, for more details see DwC quick reference guide. Today, hundreds of millions of biodiversity records from around the world are published in DwC format and aggregated into various portals (e.g. GBIF, vertNet, iDigBio). Nonetheless, data publishers still struggle with the essential step of mapping fields in their data to the terms in Darwin Core (Wieczorek et al. 2017a). Doing so requires a good understanding of both the data set and Darwin Core (Wieczorek et al. 2017b).

Related work

The remarkable Kurator project creates biodiversity data quality workflows, and via it web interface (Kurator-Web)-- data quality-control is highly accessible. Thanks to this invaluable project, we have easy access to different lookup tables, that aggregates rare and highly valuable data regarding DwC vocabularies. For example in the Darwin Cloud table, knowledge is being accumulated about variations in DwC field names. Fully utilizing these precious data in the R environment can significantly enhance our ability to address more biodiversity data quality issues.

Details of your coding project

Darwinizer workflow in R:

While DwC was adopted by most biodiversity data publishers, it implementation is somewhat incomplete. Imposing controlled vocabulary on millions of records is a complex and daunting task. For example, there are inconsistencies regarding field names between different data publishers. The CSV File Darwinizer Kurator workflow standardizes field names to the DwC standard name, thanks to the Darwin Cloud lookup file. By generating this workflow in R, we can easily input a wider range of data from different publishers. This module needs to work on various data files downloaded from different biodiversity portals, and handle all of them.

Data checks:

Data checks must be specifically tailored around the structure of the data, in our case- the DwC standard. Under this module we will address three major data checks collections:

  • TDWG core suite of tests and assertions: Tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. Ideally, this core suite of data quality checks need to be embrace by all data publishers (as a standard), and hopefully in the long term, this will be the case. However, in the short term, since constructing many of them in R is rather feasible we plan to achieve that. Furthermore, embracing this standard will improve our ability to properly manage data checks. In Ashwin and Thiloshon last year GSoC projects various data checks have been developed, while some adjustment and further development is still required.
  • Imposing controlled vocabulary on key data fields: Using Kurator’s vocabulary data, different DwC standardization procedures can be addressed. The challenge will be to assess and prioritize the development of these procedures.
  • New frontiers: Enriching DwC data (i.e. accurately joining external data) can greatly boost data checks capacity and diversity. For example, joining species trait data, or retrieving climatic data for each record opens variety of check capabilities. Here we need to screen for robust data enrichment procedures in R and design exciting data checks around them.

R based Kurator actors:

Following a communication with the Kurator development team, we will explore the development of R actors (functions), that hopefully, will be seemingly integrated into the kurator infrastructure. Handling some Java and Python code will be required.

Getting to know your data:

When answering any research questions using biodiversity occurrence data, first step is to download the data, usually obtained from an aggregator like GBIF. Once the data is downloaded, developing a good understanding of it’s strengths and gaps is crucial before planning and executing any data analysis. Summarizing key data fields and identifying biases in the data, is a good foundation.The development of this module would harness the ability of R user to graphically visualize and analyze complex biodiversity datasets. The primary challenges are (i) developing an impressive and useful, yet general template for summarizing biodiversity data (e.g. 1, 2, 3); (ii) Identifying and implementing key techniques for identifying common biases in species occurrence data (e.g. 1, 2, 3).

Expected impact

Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

  • Tomer Gueta tomer.gu@gmail.com is the author of R package bdclean and has been working with large biodiversity datasets for several years. Part of his research is dealing with integrating data-cleaning with data analysis to enhance usability of biodiversity big-data.
  • Vijay Barve vijay.barve@gmail.com is a biodiversity data scientist and has contributed to several R packages related to biodiversity i.e. rgbif, rvertnet, rinat, bdvis and so on.
  • Yohay Carmel yohay@technion.ac.il Associate Professor, Faculty of Civil and Environmental Engineering, Technion. Yohay is an ecologist dealing with a wide range of biodiversity research.

Tests

  • Easy: Explain the meaning of these two DwC fields, and thier interaction: coordinatePrecision & coordinateUncertaintyInMeters
  • Medium: GBIF’s occurrence issues data is quite useful (for more details read this vignette), however, interpreting the exact flag meaning is sometimes challenging. Use this GBIF’s file to create a CSV file appending each flag to it description.
  • Medium: Write a function to standardise field names of a data frame to DwC standard.
  • Hard: Propse a scheme for managing hunderds of data checks in R.

Solutions of tests

Students, please post a link to your test results here
Povilas Gibas - Test solutions