July 14, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • write to cloud support at Digital Alliance - Gy
  • need to troubleshoot on-going crawls - Gy
  • troubleshoot crawler problems to set up NYT archive crawl - Gy
  • re-run NYT archive Mid E & Israel crawl using Arbutus and very slow - Gy
  • if time, download sets of crawled data in both JSON and CSV for FoxNews & WaPo twitter crawls and send link to Alejandro - Gy
  • meet Monday at 1pm to talk about postprocessor - Gy/Fr
  • continue sorting the different aspects of the button in plotly, and then try with different datasets - Fr
  • research d3 graph for possible use for network graph - Fr
    • if time, consider KPP dataset from sharepoint - Fr

Postprocessor

  • problem is probably the csv: some tweets have a new line character, but some records have multiple lines, and therefore can't read properly
    • next step: test: "-n" for new line? or remove all extra spaces?
  • maybe not only problem: still getting error messages: errors with pandas; not getting expected value

Crawls & Servers

  • Gy wrote to digital alliance support & will make Graham server a back up
  • NYT archive crawl suddenly stopped working
    • "israel" keyword at 200,000 results - finished
    • "Mid E" keyword - need to use different syntax (like quotation marks) to ensure it gets only articles related to Middle East
  • Israel news site crawl: working fine
  • small domain: data is being corrupted

visualization

  • recreated and backed up plotly modules in jupyterlab
  • next step: trouble shoot error where stacked area graph isn't filled in
  • next step: add buttons to simplify using different types of charts and graphs
  • D3 graph: started to look at documentation, need JSON

On-going task:

  • check crawl every 2 days - Gy
  • update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
  • push corrected postprocessor code to master - Gy/Fr
  • postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy

Action Items:

  • postprocessor: 2 issues: (1) new line character need to be deleted; (2) pandas errors - Gy/Fr
    • meet Monday at 1pm
  • revisit having a meeting with Nat to go over pandas error - everyone
  • rebuild Graham instance - Gy
  • check whether storage is being filled with tmp file or actual results - Gy
  • look at current NYT archive "Israel" keyword results for date range, and if possible, re-start using only uncrawled date range - Gy
    • potentially use new IP and monitor for 400 errors, slowing down or pausing for day if getting too many
  • re-structure query for "Middle East" to ensure only relevant results are obtained - Gy
  • look at small domain crawls to check for corruption - Gy
  • vizualization: trouble shoot error where stacked area graph isn't filled in - Fr
  • visualization: add buttons to simplify using different types of charts and graphs and simplify jupyterlab files for users - Fr
  • D3: looking at how to convert csv to JSON - Fr
  • look at Colin's and Shengsong's instructions about how Alejandro (user) connects to server and decide whether still best way - Fr