Tech Stack Research - 52North/ecmwf-dataset-crawl GitHub Wiki

Translation APIs

Search APIs

This is important to get right, as the initial search defines the result set of the crawl. Search could become costly, expected search volume: 20 queries per crawl request and language

Crawling

  • Apache Nutch

    • :+1: featurecomplete webcrawling application
    • :-1: operates in batch mode, slow
    • :-1: old, community rather inactive
  • Storm Crawler

    • webcrawling SDK based on apache storm
    • :+1: stream-based, very efficient, results available as they come in
    • :-1: not as feature complete, more work required, but probably friendlier

Indexing

  • Elasticsearch
  • Solr

Both work well with both crawlers, we have more know how with Elastic.

Content Analysis

The exact approach - and thus the tooling - is TBD.

  • Apache Tika: content identification, extraction tool + SDK
  • MALLET: java package for statistical NLP, document classification,clustering, topic modeling, information extraction
  • Apache openNLP: NLP toolkit

UI / Result Presentation

  • Vue.js 2?
  • Views:
    • "Launch Crawl"
    • "View Crawls (completed/in progress)"
    • "Search Results"

Deployment

  • all dockerized
  • orchestration with compose for now?