March 2, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • check on logs to see which domains giving us the rejections
  • slow down restarted crawl
  • Postprocessor
  • look into email function
  • add Irfan's key-pair and Alejandro's
  • Alejandro: change password
  • add more logging of errors to postprocessor
  • Alejandro come up with domain crawl scope
  • Alejandro: write to security person about adding Shawn to Graham

Server Access

  • did changing password do anything?
    • can't see dashboard but can ssh into Graham
  • Irfan and Alejandro added to Arbutus

small domain crawl

  • issues: tabletmag (403) is causing issues; occasinally electronicintifada (doesn't give 403)
    • aljazeera.com: giving time out issues (page not loading)
    • slowed down: 1 url per minute on average, seems to be better
  • email issue: authentication tokens needed to send email
    • seems to know the reason: tokens get invalidated somehow
    • more complicated than realized

postprocessor

  • tried different way of combining crawl output; new error
    • issue seems to be bigger file
  • more logging of errors: revisit once get it working
  • will try and contact Shengsong

Action Items:

  • add Irfan and Alejandro key pairs to Graham
  • Delete older key pairs from earlier devs
  • Alejandro: excel sheet that Shawn sent
  • remove tabletmag from small domain crawl, and update index to reflect and make separate pathway in storage
  • contact Shengsong about preparing twitter crawl for postprocessing
  • set up Israeli newspaper crawl

backburner

  • email issue for domain crawl break.