June 1, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • logistics
    • communications
    • hours
  • server and installation
    • able to access current crawls?
    • reading code?
    • postprocessor
    • other questions for Shawn
  • nytimes archive crawl issues
  • visualization environment - soon

Logistics

Server & Basic operations

  • access granted to Arbutus, Gy able to log in, Francisco will confirm
  • Shawn didn't show how to set up instance, showed different repositories and readmes

nytimes

  • nytimes archive crawl (Mid E/Israel/Palestinians) has a couple of years (1979-1981, 2006-2011) with lower results, unclear if this is a crawl or postprocessor error

Action Items

  • do the two mandatory training
  • set up instance of mediacat domain crawler
    • Gy write email to Shawn to ask for demo on setting up crawl (domain and twitter) and to show how to run postprocessor on Monday at 5:30pm
  • check on running crawls every 2-3 days - Gy
    • figure out how to count total URLs crawled
  • look into above NYT archive crawl to see if crawl errors - Gy
  • start looking at Shengsong (Charles Xu) jupyterlab environment - Francisco