June 9, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • do the two mandatory training
  • set up instance of mediacat domain crawler
    • Gy write email to Shawn to ask for demo on setting up crawl (domain and twitter) and to show how to run postprocessor on Monday at 5:30pm
  • check on running crawls every 2-3 days - Gy
    • figure out how to count total URLs crawled
  • look into above NYT archive crawl to see if crawl errors - Gy
  • start looking at Shengsong (Charles Xu) jupyterlab environment - Francisco

Logistics:

  • mandatory training
  • second training haven't received email yet

Server/set up

  • error connecting to Graham through vscode - will try to meet again and work through arbutus
    • much easier for starting crawls
  • for restarting crawls it's easy with terminal
  • Francisco: tried on laptop but couldn't due to issues with libraries

running crawls

  • we see the small domain and Israeli domains are running

NYT archive crawl?

  • compare full crawl with problem years (1979-1981 and 2006-2011) to see if low numbers after postprocessor seem correct
  • look at problem with keyword politics output

Visualization/jupyterlab

action items

  • check numbers on small domains 2022 and israeli domains 2023 crawl by crawled domain - gy
  • Gy will write to Shawn to check in about the video and alternatives to VScode, and ask for another meeting on Monday; to look esp at postprocessor - Gy
  • try to use Terminal to set up an instance/crawl - Gy
  • search server storage for string "/media/data/Post_processor" to see if data are housed elsewhere - gy
  • compare full crawl for NYT archive (mid east/palestinian/israel keyword) with problem years (1979-1981 and 2006-2011) to see if low numbers after postprocessor seem correct - gy
  • continue reading through jupyterlab and Shengsong code for visualization environment - fr
  • start trying to use existing jupyterlab environments - fr
  • create a new repository for visualization environment and add documentation - fr
  • with Shawn's help or without, try setting up an new instance - gy & fr