Nov 30, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • IA NYT crawl (1) figure out how many unique urls without landing pages; (2) how many are /world/ and /world/middleeast/; (3) see if there is a time-bound characteristic to 2, and start crawl for those urls - Ra
  • continue dialogue with IA: (1) bulk download, (2) ask about rate of calls - Ra
  • update documentation on JSON/CSV conversion - Ar
  • develop unit testing for foxnews postprocessed results, for example, on text alias & hyperlinks - Ar
  • send small Wa/Po twitter files into postprocessor to troubleshoot error - Fr
  • do test on postprocessor with sample IA results (about 36,000) - Ar

Crawls

  • running Graham - once again error messages and then connection time out -- we will stop using Graham
  • Arbutus: strange error suddenly -- checking with Raazia

Internet Archive

  • getting new error, could it be we're blocked on some servers and not others?

Action Items

  • contact Nat about new kind of error with Arbutus cloud - Gy
  • check if any Arbutus settings that might have changed - Ra
  • compare old results of timesofisrael to new results and eliminate duplicates - Gy
  • continue checking on issue getting electronicintifada from IA - Ra
  • try to set up crawl of jadaliyya.com on IA - Ra