June 23, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- set up NYT archive (middle east - Israel - Palestinians) crawl - gy
- see if NYT archive (politics) crawl results somewhere - gy
- write Shengsong to see if he downloaded full results for NYT Mid E archive crawl - Al
- set up domain crawler on server - Fr
- check jewishjournal data to see if corrupted - Gy
- work visualizations with sample data and prepare demo - Fr
- attempt postprocesser on sample from Foxnews twitter data, check Jerusalem Post or Times of Israel - Gy
- if working: * run postprocessor on found NYT results, March 25, 2023
Crawls & Postprocessing
- set up NYT archive (Mid E) and so far about 40,000 results
- couldn't find NYT archive (politics), will ask Shawn
- jewishjournal count: gives different numbers depending on when the count is run, and maybe partially corrupted
- Postprocessor:
- keeps getting 0 results even though input data has results, both twitter data and the NYT data
Visualizations
- make sure columns are correct
On-going task:
- check crawl every 2 days - Gy
- update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
Action Items:
- ask Shawn about NYT archive (politics) - Gy
- set up domain crawler on server - Fr
- make a new tab in the crawl index spreadsheet to track the number for the small domain crawl - Al
- add numbers for each domain to new tab - Gy
- postprocessor: continue to troubleshoot & especially look at formatting - and ask Shawn - Gy
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
- assess whether MatPlot has features that enable UI faster - Fr
- use existing data to test on jupyterlab platform - FR
- research whether there are existing libraries with friendlier UI than Jupyterlab - Fr
- write to Kirsta and Nat about the use of Jupyterlab - Al
- meet on MOnday June 26th 11am toronto to talk about postprocessor - Gy/Fr