Nov 9, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Action Items from last day

  • develop unit testing for foxnews postprocessed rsults, for example, on text alias - Ar
  • Wa/Po twitter data set: look for lines producing errors - Fr
  • look for converter for CSV/JSON - Ar
  • add debugging to IA crawler like total crawled - Ra
  • add documentation about filtering out irrelevant URLs for IA crawler - Ra
  • start crawling electronicintifada and nytimes - Ra
  • sending email Gy asking about multiple crawlers running at the same time - Ra
  • sending email to Nat about difference b/w URLs and new URLs in archive.org data - Ra

crawler

  • download remaining history of arbutus - too many files error - but probably need to use r-sync
  • trying to shift small domain crawl over to Graham to see if it will work better
  • timesofisrael, israelnationalnews, jpost all still crawling

IA crawler

  • setting up for electronicintifada.net with settings too fast:
    • previous 200 msec delay and then with no delay we got blocked

Action Items

  • use r-sync to transfer data from Arbutus to Graham: apify storage folder inside small domain folder to prevent re-crawling same urls - Gy
  • nytimes archive crawl with keyword "Middle East"- try to run on Graham cloud - Gy
  • review 2 pull request before merging - Gy
  • if we don't hear from Nat today, write to IA email address and ask them about error and limits to speed for crawls, and if they can refer us to where documentation exists for crawling - Ra
  • (1) figure out how many unique urls without landing pages; (2) how many are /world/ and /world/middleeast/; (3) see if there is a time-bound characteristic to 2, and start crawl for those urls - Ra