A list of Data Sources - doraithodla/notes GitHub Wiki

Data Sources scispaCy models are trained on data from a variety of sources. In particular, we use:

The GENIA 1.0 Treebank, converted to basic Universal Dependencies using the
Stanford Dependency Converter.
We have made this dataset available along with the original raw data.
word2vec word vectors trained on the Pubmed Central Open Access Subset.
The MedMentions Entity Linking dataset, used for training a mention detector.
Ontonotes 5.0 to make the parser and tagger more robust to non-biomedical text.
Pile Dataset - The Pile is an 825 GiB open-source language modeling data set that consists of 22 smaller datasets combined. The importance of Pile is the diversity in its data sources that improves general cross-domain knowledge as well as downstream NLP tasks.