What's in a name? - michaeltelford/wgit GitHub Wiki
So why name the project Wgit? Why call it a crawler rather than a scraper? How important is getting this terminology correct? And who decides what's correct?
Naming things in Computing Science is hard. Fact.
But in a nutshell:
- Wgit can be pronounced as double-you-get or wid-jit (which I heard someone pronounce once and liked it). It doesn't matter as much as what this gem can do for you.
- All Ruby gems must be uniquely named. And obviously common terms like
crawler
orscraper
are long gone. The name Wgit is a play on thewget
Unix tool which Wgit had similar functionality to back in the early days. Thewgit
gem name was available, so here we are. - Crawler, scraper, spider, indexer... What's the difference? There's probably a lot of overlap between them but this article describes web crawling as what the big search engines do e.g. saving a webpage's textual content (for searching against); while web scraping is more bespoke to a particular site/purpose e.g. price comparisons. In truth, Wgit is capable of both scenarios but was primarily designed for building (custom) search engines with; and as a result, the term
crawler
better fits Wgit's true nature. Indexing in Wgit means to first crawl and then save the content to a database (to be searched etc). Wgit could even be described as an ETL framework since it Extracts from the web, Transforms the html into useful content and then Loads it into a database. - Is the terminology used in Wgit 100% correct? Who knows. But maybe a better question is "Who cares?". I figure the terminology used is close enough, which is good enough for me.