What's in a name? - michaeltelford/wgit GitHub Wiki

So why name the project Wgit? Why call it a crawler rather than a scraper? How important is getting this terminology correct? And who decides what's correct?

Naming things in Computing Science is hard. Fact.

But in a nutshell:

  • Wgit can be pronounced as double-you-get or wid-jit (which I heard someone pronounce once and liked it). It doesn't matter as much as what this gem can do for you.
  • All Ruby gems must be uniquely named. And obviously common terms like crawler or scraper are long gone. The name Wgit is a play on the wget Unix tool which Wgit had similar functionality to back in the early days. The wgit gem name was available, so here we are.
  • Crawler, scraper, spider, indexer... What's the difference? There's probably a lot of overlap between them but this article describes web crawling as what the big search engines do e.g. saving a webpage's textual content (for searching against); while web scraping is more bespoke to a particular site/purpose e.g. price comparisons. In truth, Wgit is capable of both scenarios but was primarily designed for building (custom) search engines with; and as a result, the term crawler better fits Wgit's true nature. Indexing in Wgit means to first crawl and then save the content to a database (to be searched etc). Wgit could even be described as an ETL framework since it Extracts from the web, Transforms the html into useful content and then Loads it into a database.
  • Is the terminology used in Wgit 100% correct? Who knows. But maybe a better question is "Who cares?". I figure the terminology used is close enough, which is good enough for me.