Home - HobnobMancer/cazy_webscraper GitHub Wiki

Welcome to the cazy_webscraper wiki!

Introduction

cazy_webscraper is an application and Python3 package for the automated retrieval of protein data from the CAZy database. The code is distributed under the MIT license.

cazy_webscraper retrieves protein data from the CAZy database into a local SQLite3 database. This enables users to integrate the dataset into analytical pipelines, and interrogate the data in a manner unachievable through the CAZy website.

Please do not perform a complete scrape of the CAZy database unless you specifically require to reproduce the entire CAZy dataset. A complete scrape will take several hours and may unintentionally deny the service to others.

Using the expand subcommand, a user can retrieve CAZyme protein sequence data from GenBank, and protein structure files from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

cazy_webscraper can recover specified CAZy Classes and/or CAZy families, and these queries can be filtered by taxonomy at Kingdoms, genus, species or strain level. Successive CAZy queries can be collated into a single local database. A log of each query is recorded in the database for transparency, reproducibility and shareablity.

Citation

If you use `cazy_webscraper, please cite the following publication:

Hobbs, Emma E. M.; Pritchard, Leighton; Chapman, Sean; Gloster, Tracey M. (2021): cazy_webscraper Microbiology Society Annual Conference 2021 poster. FigShare. Poster. [https://doi.org/10.6084/m9.figshare.14370860.v7](https://doi.org/10.6084/m9.figshare.14370860.v7)

Documentation

Please see the full documentation at ReadTheDocs.