Basic Use - HobnobMancer/cazy_webscraper GitHub Wiki

cazy_webscraper can be used to download specific record data from CAZy, providing either command-line arguments or a YAML configuration file.

To get help in the terminal for cazy_webscraper use:

cazy_webscraper -h

Documentation

Please see the full documentation at ReadTheDocs.

Command line arguments and operation

  • -c --config: path to a YAML configuration file.
  • --classes: comma-separated list of CAZyme classes to filter download from CAZy.
  • -d --database: path to cazy_webscraper database. If one does not already exist, it will be created.
  • --cazy_synonyms: path to JSON file containing accepted CAZy class name synonyms.
  • -f --force: force writing of plaintext files to existing output directory.
  • --families: comma-separated list of CAZy families to filter download from CAZy.
  • --genera: comma-separated list of genus names to filter download from CAZy.
  • --kingdoms: comma-separated list of Kingdom names to filter download from CAZy.
  • -h --help: show help message in terminal.
  • -l --log: path to log file.
  • -n --nodelete: do not delete content in plaintext output directory
  • -o --output: path to output directory for plaintext output. If it does not exist, the directory will be created.
  • -r --retries: number of retry attempts for CAZy HTTP queries, if an error is raised.
  • -s --subfamilies: comma-separated list of CAZy subfamilies to filter download from CAZy.
  • --species: comma-separated list of species names to filter download from CAZy.
  • --strains: comma-separated list of strain names to filter download from CAZy.
  • -v --verbose: report verbose messages

To specify a CAZy family, use the standard CAZy notation for the family, not only its number. For for example GH1 is understood by cazy_webscraper, but 1 is not.

If a parent family, e.g GH3, is specified and --subfamilies is enabled, all proteins catalogued under GH3 and its subfamilies will be retrieved.

Configuration files

A YAML configuration file can also be used to specify cazy_webscraper arguments, to support transparency and reproducibility of analyses. An example is shown below.

# All members of named CAZy classes will be recovered, unless the taxon-
# specific filters are active.
classes:  # Only members of named classes will be recovered
  Glycoside Hydrolases (GHs):
    - "GH1"
    - "GH2"
  GlycosylTransferases (GTs):
  Polysaccharide Lyases (PLs):
    - "PL28"
  Carbohydrate Esterases (CEs):
  Auxiliary Activities (AAs):
  Carbohydrate-Binding Modules (CBMs):

# Taxon-specific filters
genera:  # If specified, only members of named genera will be recovered
  - "Trichoderma"
species:  # If specified, only members of named genera will be recovered
strains:  # If specified, only members of named species will be recovered
kingdoms:  # If specified, only members of named Kingdoms will be recovered
  - "Bacteria"

Each requested family must be listed on a separate line and the name surrounded by double or single quotation marks.

All proteins catalogued under any of the named classes will be retrieved, unless the taxon-specific filters are active. If taxon-specific filters are active, then only sequences corresponding to those filters will be retrieved.

Synonyms

cazy_webscraper understands synonyms for the CAZy class names:

  • "Glycoside Hydrolases (GHs)":
    • "Glycoside-Hydrolases", "Glycoside-Hydrolases", "Glycoside_Hydrolases", "GlycosideHydrolases", "GLYCOSIDE-HYDROLASES", "GLYCOSIDE-HYDROLASES", "GLYCOSIDE_HYDROLASES", "GLYCOSIDEHYDROLASES", "glycoside-hydrolases", "glycoside-hydrolases", "glycoside_hydrolases", "glycosidehydrolases", "GH", "gh"
  • "GlycosylTransferases (GTs)"
    • "Glycosyl-Transferases", "GlycosylTransferases", "Glycosyl_Transferases", "Glycosyl Transferases", "GLYCOSYL-TRANSFERASES", "GLYCOSYLTRANSFERASES", "GLYCOSYL_TRANSFERASES", "GLYCOSYL TRANSFERASES", "glycosyl-transferases", "glycosyltransferases", "glycosyl_transferases", "glycosyl transferases", "GT", "gt"
  • "Polysaccharide Lyases (PLs)"
    • "Polysaccharide Lyases", "Polysaccharide-Lyases", "Polysaccharide_Lyases", "PolysaccharideLyases", "POLYSACCHARIDE LYASES", "POLYSACCHARIDE-LYASES", "POLYSACCHARIDE_LYASES", "POLYSACCHARIDELYASES", "polysaccharide lyases", "polysaccharide-lyases", "polysaccharide_lyases", "polysaccharidelyases", "PL", "pl"
  • "Carbohydrate Esterases (CEs)"
    • "Carbohydrate Esterases", "Carbohydrate-Esterases", "Carbohydrate_Esterases", "CarbohydrateEsterases", "CARBOHYDRATE ESTERASES", "CARBOHYDRATE-ESTERASES", "CARBOHYDRATE_ESTERASES", "CARBOHYDRATEESTERASES", "carbohydrate esterases", "carbohydrate-esterases", "carbohydrate_esterases", "carbohydrateesterases", "CE", "ce"
  • "Auxiliary Activities (AAs)"
    • "Auxiliary Activities", "Auxiliary-Activities", "Auxiliary_Activities", "AuxiliaryActivities", "AUXILIARY ACTIVITIES", "AUXILIARY-ACTIVITIES", "AUXILIARY_ACTIVITIES", "AUXILIARYACTIVITIES", "auxiliary activities", "auxiliary-activities", "auxiliary_activities", "auxiliaryactivities", "AA", "aa"
  • "Carbohydrate-Binding Modules (CBMs)"
    • "Carbohydrate-Binding-Modules", "Carbohydrate_Binding_Modules", "Carbohydrate_Binding Modules", "CarbohydrateBindingModules", "CARBOHYDRATE-BINDING-MODULES", "CARBOHYDRATE_BINDING_MODULES", "CARBOHYDRATE_BINDING MODULES", "CARBOHYDRATEBINDINGMODULES", "carbohydrate-binding-modules", "carbohydrate_binding_modules", "carbohydrate_binding modules", "carbohydratebindingmodules", "CBMs", "CBM", "cbms", "cbm"