July 21, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • postprocessor: 2 issues: (1) new line character need to be deleted; (2) pandas errors - Gy/Fr
    • meet Monday at 1pm
  • revisit having a meeting with Nat to go over pandas error - everyone
  • rebuild Graham instance - Gy
  • check whether storage is being filled with tmp file or actual results - Gy
  • look at current NYT archive "Israel" keyword results for date range, and if possible, re-start using only uncrawled date range - Gy
    • potentially use new IP and monitor for 400 errors, slowing down or pausing for day if getting too many
  • re-structure query for "Middle East" to ensure only relevant results are obtained - Gy
  • look at small domain crawls to check for corruption - Gy
  • vizualization: trouble shoot error where stacked area graph isn't filled in - Fr
  • visualization: add buttons to simplify using different types of charts and graphs and simplify jupyterlab files for users - Fr
  • D3: looking at how to convert csv to JSON - Fr
  • look at Colin's and Shengsong's instructions about how Alejandro (user) connects to server and decide whether still best way - Fr

Postprocessor

  • troubleshooting:
    • deleted new line character but then different errors about columns
  • problem with python file processor_twitter.py: twitter data was organized differently than what we have now, structured differently

Crawler

  • NYT archive crawl is still having errors, maybe stealth mode will help, need to look at whether it is possible to integrate
  • Israel domain crawls is still going
  • small domain crawl is having trouble due to apify errors, perhaps run crawls separately?
  • tmp folder isn't taking up a lot of memory

On-going task:

  • check crawl every 2 days - Gy
  • update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
  • push corrected postprocessor code to master - Gy/Fr
  • postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
  • backburner: figure out corruption in small domain crawl

Action Items:

  • take a look at the brake in the domain crawler and read through - Gy
  • look at date range for NYT - Israel & Palestine and send email to Alejandro about what is included - Gy
  • look at which folder is taking up most memory on server - Gy
  • attempt to combine stealthy crawl on NYT archive - Gy
  • continue rebuild Graham instance - Gy
  • vizualization: trouble shoot error where stacked area graph isn't filled in - Fr
  • visualization: add buttons to simplify using different types of charts and graphs and simplify jupyterlab files for users - Fr
  • D3: looking at how to convert csv to JSON - Fr
  • look at Colin's and Shengsong's instructions about how Alejandro (user) connects to server and decide whether still best way - Fr