July 28, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • take a look at the brake in the domain crawler and read through - Gy
  • look at date range for NYT - Israel & Palestine and send email to Alejandro about what is included - Gy
  • look at which folder is taking up most memory on server - Gy
  • attempt to combine stealthy crawl on NYT archive - Gy
  • continue rebuild Graham instance - Gy
  • vizualization: trouble shoot error where stacked area graph isn't filled in - Fr
  • visualization: add buttons to simplify using different types of charts and graphs and simplify jupyterlab files for users - Fr
  • D3: looking at how to convert csv to JSON - Fr
  • look at Colin's and Shengsong's instructions about how Alejandro (user) connects to server and decide whether still best way - Fr

Crawl

  • Brake:
    • wasn't on NYT crawl, still flaws, looking at proxy rotation, could be there were changes on the website
    • IP rotation is next step
  • Israeli domains crawl continues well

Postprocessor

  • after meeting with Nat, could get small selection but stopped reading after 500 selection
  • could be an issue with formating, maybe make a custom dataframe rather than Pandas
  • making changes to postprocessor, and developer friendly changes with documentation
  • error: tokenizing error for end of line, reading csv file, could be one record that isn't formatted properly or hidden character
    • instead of default pandas reader, and customizing from scratch
    • challenge: not sure how much work to producing custom dataframe
    • another approach is to modify the files: tool that properly format and cleanup of csv format before giving to postprocessor
    • challenge: will this modify the data

On-going task:

  • check crawl every 2 days - Gy
  • update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
  • push corrected postprocessor code to master - Gy/Fr
  • postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
  • backburner: figure out corruption in small domain crawl

Action Items:

  • upload NYT archive crawler with brake as separate branch and document what the difference is with the earlier version - Gy
  • speed up small domain crawl a bit - Gy
  • do a count of the Israeli domain crawl - Gy
  • crawl of NYT "Israel" for the years 2006-2009, , and use article filter - Gy
  • continue with the postprocessor - Fr