Aug 11, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • develop script and documentation to remove extra header lines from twitter crawl output as prior to postprocessing - Fr
  • check URL extender to see if most updated - Fr
  • run URL extender on test twitter crawl output (~23,000) and run postprocessor on the resulting output - Fr
  • check results of postprocessor on test data - Al
  • if results work, run URL extender on all twitter crawl (Fox News and Washington Post, keeping separate) and postprocess - Fr
  • check if new IP address created with new instance - Gy
  • pause Israeli domain crawl while testing other crawl technique - Gy
  • set up individual crawls for Israeli domains to test crawl technique, and check regularly to see if multiple errors have cause brake - Gy
  • if new IP address is created with new instance, try NYT archive crawl - Gy

Postprocessor:

  • problem with the postprocessed results: haaretz in crawler output and strange results in postprocessor output
  • developed script and documentation for removal of extra header lines from twitter crawl output
  • fixed the warning messages from running postprocessor
  • cleaned up the postprocessor documentation

servers and crawls

  • created new instance on Graham, and then ran into problem of attaching storage because storage was still connected to deleted instance
  • wrote script for running small domain crawl to ease pressure on domains
  • suspect on some domains, server blocks any crawler

On-going task:

  • check crawl every 2 days - Gy
  • update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
  • push corrected postprocessor code to master - Gy/Fr
  • postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
  • backburner: figure out corruption in small domain crawl

Action Items:

  • troubleshoot the postprocessor results to see why they aren't accurate - Fr
  • add comments to postprocessor - Fr
  • if time allows, run postprocessor on Fox News and Washington Post twitter crawls - Fr
  • document issue of deleting instance without detaching storage - Gy
  • try new crawl techniques as best as possible and experiment with new IP address - Gy
  • with new instance and new code, try NYT archive crawl - Gy
  • attempt to create instance and then attach storage - Gy