July 14, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda:
- write to cloud support at Digital Alliance - Gy
- need to troubleshoot on-going crawls - Gy
- troubleshoot crawler problems to set up NYT archive crawl - Gy
- re-run NYT archive Mid E & Israel crawl using Arbutus and very slow - Gy
- if time, download sets of crawled data in both JSON and CSV for FoxNews & WaPo twitter crawls and send link to Alejandro - Gy
- meet Monday at 1pm to talk about postprocessor - Gy/Fr
- continue sorting the different aspects of the button in plotly, and then try with different datasets - Fr
- research d3 graph for possible use for network graph - Fr
- if time, consider KPP dataset from sharepoint - Fr
Postprocessor
- problem is probably the csv: some tweets have a new line character, but some records have multiple lines, and therefore can't read properly
- next step: test: "-n" for new line? or remove all extra spaces?
- maybe not only problem: still getting error messages: errors with pandas; not getting expected value
Crawls & Servers
- Gy wrote to digital alliance support & will make Graham server a back up
- NYT archive crawl suddenly stopped working
- "israel" keyword at 200,000 results - finished
- "Mid E" keyword - need to use different syntax (like quotation marks) to ensure it gets only articles related to Middle East
- Israel news site crawl: working fine
- small domain: data is being corrupted
visualization
- recreated and backed up plotly modules in jupyterlab
- next step: trouble shoot error where stacked area graph isn't filled in
- next step: add buttons to simplify using different types of charts and graphs
- D3 graph: started to look at documentation, need JSON
On-going task:
- check crawl every 2 days - Gy
- update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
- push corrected postprocessor code to master - Gy/Fr
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
Action Items:
- postprocessor: 2 issues: (1) new line character need to be deleted; (2) pandas errors - Gy/Fr
- revisit having a meeting with Nat to go over pandas error - everyone
- rebuild Graham instance - Gy
- check whether storage is being filled with tmp file or actual results - Gy
- look at current NYT archive "Israel" keyword results for date range, and if possible, re-start using only uncrawled date range - Gy
- potentially use new IP and monitor for 400 errors, slowing down or pausing for day if getting too many
- re-structure query for "Middle East" to ensure only relevant results are obtained - Gy
- look at small domain crawls to check for corruption - Gy
- vizualization: trouble shoot error where stacked area graph isn't filled in - Fr
- visualization: add buttons to simplify using different types of charts and graphs and simplify jupyterlab files for users - Fr
- D3: looking at how to convert csv to JSON - Fr
- look at Colin's and Shengsong's instructions about how Alejandro (user) connects to server and decide whether still best way - Fr