Common Crawl - AshokBhat/ml GitHub Wiki

Organization

  • A nonprofit organization
  • Crawls the web since 2011

Data set

  • Web archive consists of petabytes of data collected since 2011
  • Archives freely provided to the public.

Usage

  • Used by GPT-3 for pre-training.

See also