Tech Stack Research - 52North/ecmwf-dataset-crawl GitHub Wiki

Translation APIs

Azure Text Translator Free Tier
- free up to 2mio characters, should be more than enough
EU eTranslation Service
- accessible for EU administrations + CEF members only
- mostly trained on EU-internal documents, so best suited for policy documents :(
Yandex.Translate
- $15 per 1 mio characters
Google Translation
- $20 per 1 mio characters

Search APIs

This is important to get right, as the initial search defines the result set of the crawl. Search could become costly, expected search volume: 20 queries per crawl request and language

Google Custom Search
- 100 requests / day free, then $5 / 1000 requests
- :+1: allows result localization, emphasis
Azure Bing Search S2 Tier
- €2.53 / 1000 requests
- :+1: allows result localization, emphasis
faroo
- free
- :-1: bad results?
chatnoir
- free? API key on request
- based on CommonCrawl data

Crawling

Apache Nutch
- :+1: featurecomplete webcrawling application
- :-1: operates in batch mode, slow
- :-1: old, community rather inactive
Storm Crawler
- webcrawling SDK based on apache storm
- :+1: stream-based, very efficient, results available as they come in
- :-1: not as feature complete, more work required, but probably friendlier

Indexing

Elasticsearch
Solr

Both work well with both crawlers, we have more know how with Elastic.

Content Analysis

The exact approach - and thus the tooling - is TBD.

Apache Tika: content identification, extraction tool + SDK
MALLET: java package for statistical NLP, document classification,clustering, topic modeling, information extraction
Apache openNLP: NLP toolkit

UI / Result Presentation

Vue.js 2?
Views:
- "Launch Crawl"
- "View Crawls (completed/in progress)"
- "Search Results"

Deployment

all dockerized
orchestration with compose for now?