Common Crawl - AshokBhat/ml GitHub Wiki
Organization
- A nonprofit organization
- Crawls the web since 2011
Data set
- Web archive consists of petabytes of data collected since 2011
- Archives freely provided to the public.
Usage
- Used by GPT-3 for pre-training.
See also