A list of Data Sources - doraithodla/notes GitHub Wiki
Data Sources scispaCy models are trained on data from a variety of sources. In particular, we use:
Source: SciSpacy
- The GENIA 1.0 Treebank, converted to basic Universal Dependencies using the
- Stanford Dependency Converter.
- We have made this dataset available along with the original raw data.
- word2vec word vectors trained on the Pubmed Central Open Access Subset.
- The MedMentions Entity Linking dataset, used for training a mention detector.
- Ontonotes 5.0 to make the parser and tagger more robust to non-biomedical text.
- Pile Dataset - The Pile is an 825 GiB open-source language modeling data set that consists of 22 smaller datasets combined. The importance of Pile is the diversity in its data sources that improves general cross-domain knowledge as well as downstream NLP tasks.