Aug 18, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • troubleshoot the postprocessor results to see why they aren't accurate - Fr
  • add comments to postprocessor - Fr
  • if time allows, run postprocessor on Fox News and Washington Post twitter crawls - Fr
  • document issue of deleting instance without detaching storage - Gy
  • try new crawl techniques as best as possible and experiment with new IP address - Gy
  • with new instance and new code, try NYT archive crawl - Gy
  • attempt to create instance and then attach storage - Gy

Postprocessor

  • URL expander wasn't working and now fixed;
    • URL expander: updated - javascript and then ran into bugs
  • added documentation to header cleaner: give it a directory of files and outputs a new directory with all the clean files

Crawls and Server

  • managed to delete the volume
  • crawls: getting blocked very soon, and so using new code which stops the crawl at 3 days for a 1 day break
    • will put all the crawls on new code
    • will consult with Nat
  • error message about corrupt: disk image malformed -- suspect that it's due to apify itself
    • mentions in error message problem with Apify
    • will consult with Nat about error and also about possibility of copy/paste of Apify folder
      • if Apify folder is present, then the crawler does not re-crawl the URLs but if the Apify folder isn't present, then it will re-do whole crawl

On-going task:

  • check crawl every 2 days - Gy
  • update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
  • push corrected postprocessor code to master - Gy/Fr
  • postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
  • backburner: figure out corruption in small domain crawl

Action Items:

  • troubleshoot the postprocessor results to see why they aren't accurate
  • run postprocessor on Fox News and Washington Post twitter crawls