June 16, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • logistics: timesheets and hours worked?
  • check numbers on small domains 2022 and israeli domains 2023 crawl by crawled domain - gy
  • Gy will write to Shawn to check in about the video and alternatives to VScode, and ask for another meeting on Monday; to look esp at postprocessor - Gy
  • try to use Terminal to set up an instance/crawl - Gy
  • search server storage for string "/media/data/Post_processor" to see if data are housed elsewhere - gy
  • compare full crawl for NYT archive (mid east/palestinian/israel keyword) with problem years (1979-1981 and 2006-2011) to see if low numbers after postprocessor seem correct - gy
  • continue reading through jupyterlab and Shengsong code for visualization environment - fr
  • start trying to use existing jupyterlab environments - fr
  • create a new repository for visualization environment and add documentation - fr
  • with Shawn's help or without, try setting up an new instance - gy & fr

Crawl

Server

  • numbers on existing crawls?
    • could be that there is a problem with URL count -- error only with jewishjournal -- data corruption?
  • use terminal
    • using both vs-code and terminal interchangeably

Visualization environment

  • able to access existing visualization jupyterlab environments?
    • Colin's instructions useful but preferred Shengsong
    • Shengsong was modifiable very easilable
    • problem -- need results
  • create a repository and documentation? give us a tour? *

On-going task:

  • check crawl every 2 days - Gy

Action Items:

  • set up NYT archive (middle east - Israel - Palestinians) crawl - gy
  • see if NYT archive (politics) crawl results somewhere - gy
  • write Shengsong to see if he downloaded full results for NYT Mid E archive crawl - Al
  • set up domain crawler on server - Fr
  • check jewishjournal data to see if corrupted - Gy
  • work visualizations with sample data and prepare demo - Fr
  • attempt postprocesser on sample from Foxnews twitter data, check Jerusalem Post or Times of Israel - Gy
    • if working: * run postprocessor on found NYT results, March 25, 2023