Swiss-Prot |
Human-reviewed protein sequence database |
500K |
https://www.uniprot.org/uniprotkb?query=reviewed:true |
UniRef50 |
Generated by clustering UniProt proteins at 50% sequence identity |
50M |
https://www.uniprot.org/help/uniref |
Uncharacterized |
All proteins labeled as "Uncharacterized" at UniProt website |
30M |
https://www.uniprot.org/uniprotkb?query=Uncharacterized |
OMG_prot50 |
Created by clustering the Open MetaGenomic dataset (OMG) at 50% sequence identity |
200M |
https://huggingface.co/datasets/tattabio/OMG_prot50 |
PDB |
A database for the three-dimensional structural data of proteins |
700K (every chain in a structure was extracted and counted as one protein) |
https://www.rcsb.org/ |
GOPC |
Global ocean microbiome protein catalog sequences |
2B |
https://db.cngb.org/maya/datasets/MDB0000002 |
NCBI |
NCBI protein database |
700M |
https://www.ncbi.nlm.nih.gov/protein |