bdclean: User friendly biodiversity data cleaning pipeline - rstats-gsoc/gsoc2018 GitHub Wiki
Background
Package bdclean was initiated in GSoC 2017 and is available on github. The initial user response has been great and based on user feedback the package needs some more functionality to make it usable to wider user base.
Related work
There are several R packages which have great data check and cleaning functions but bdclean plans to provides features to manage complete pipeline for biodiversity data cleaning from data quality explorations, cleaning procedures to reporting.
Details of your coding project
- Input mechanism and support functions: Current bdclean package has questionnaire based mechanism to build user specific data cleaning pipeline. More work needs to be done in terms of making the process of building questionnaire simple but powerful. The questionnaire needs to handle nested and dependent questions effectively. Facility to store and retrieve responses is desirable.
- Managing data check modules: There is a lot of research and explorations happening in the field of biodiversity data cleaning protocols. The protocols may differ vastly depending on the end user needs. So more and more innovative data checks are being developed and bdclean needs a management module to have ability to quickly plug in these new data checks as they are made available. While developing this module, we need to make sure it works seamlessly with the report generation module as well as uniform flagging system followed by the modules.
- Storage and management: As the data is being cleaned and worked upon, there is a need for maintaining several intermediate data snapshots. There could be multiple ways to store these on disk i.e. .csv .rda, sqlite etc. We need to compare these options and decide which one will work best and implement it for the package.
- Reports generation: Basic report generations facility exists but needs improvements. This module is closely connected to Data Check management module, since the newer checks incorporated should start reflecting in reports.
- Vignettes: Package needs good vignettes to showcase all the functionality with screenshots or even a video tutorial of the basic working of the package.
- User Interface: Improving user interface of the package according to user feedback would be useful for user adoption of the package. Tools like shiny and Rcmdr need to be explored.
Expected impact
The package bdclean is already in making and initial user response has been good. The modular approach to plug in data checking and cleaning functions would make this package one stop shop for many biodiversity data cleaning needs.
Mentors
Students, please contact mentors below after completing at least one of the tests below.
- Vijay Barve [email protected] is a biodiversity data scientist and has contributed to several R packages related to biodiversity i.e. rgbif, rvertnet, rinat, bdvis and so on. Has been involved with GSoC and R since 2012.
- Tomer Gueta [email protected] is the author of R package bdclean and has been working with large biodiversity datasets for several years. Part of his research is dealing with integrating data-cleaning with data analysis to enhance usability of biodiversity big-data.
- Narayani Barve [email protected] is a biodiversity informatics scientist and was a GSoC student (2015) as well as mentor (2016-2017), has developed package ENMGadgets. Has extensive experience working with spatial information.
- Yohay Carmel [email protected] Associate Professor, Faculty of Civil and Environmental Engineering, Technion. Yohay is an ecologist dealing with a wide range of biodiversity research.
Tests
Students, please do one or more of the following tests before contacting the mentors above.
- Easy: Test package for various data sources like Vertnet, iDigBio, iNaturalist, GBIF etc.
- Medium: Write a function to standardise field names of a data frame to DwC standard.
- Hard: Propose a scheme for Data Check Management Module.
- Hard: Explore different data storage mechanism and propose a scheme. Get the scheme vetted from the community.