Aug 4, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • postprocessor

  • upload NYT archive crawler with brake as separate branch and document what the difference is with the earlier version - Gy

  • speed up small domain crawl a bit - Gy

  • do a count of the Israeli domain crawl - Gy

  • crawl of NYT "Israel" for the years 2006-2009, , and use article filter - Gy

  • continue with the postprocessor - Fr

Postprocessor

  • 2 problems were giving us trouble: one an additional header line and 2nd, copy/paste where the full last line wasn't being copied before running on the postprocessor

Crawls & Server

  • figure out the right url (finally sent by digital alliance) and created a new instance
  • if message comes with "all requests have been processed" with few results, likelihood is that the crawler is being blocked
  • provide separate IP address, if something is flagged
    • might be helpful if creating new address
  • small domain crawl was separated out, and only Jewish Journal wasn't working (same one which is corrupted), but then after five days a couple more stopped working
    • the 2 stopped working (Jewish Currents & Peter Beinart) were put together and using a new IP address
    • on Arbutus server, the Jewish Journal data are not marked as corrupted
  • Israeli domain crawl:
    • really slow: 15,000 a week, some only a few results
    • only 1 with lots of crawler results

On-going task:

  • check crawl every 2 days - Gy
  • update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
  • push corrected postprocessor code to master - Gy/Fr
  • postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
  • backburner: figure out corruption in small domain crawl

Action Items:

  • develop script and documentation to remove extra header lines from twitter crawl output as prior to postprocessing - Fr
  • check URL extender to see if most updated - Fr
  • run URL extender on test twitter crawl output (~23,000) and run postprocessor on the resulting output - Fr
  • check results of postprocessor on test data - Al
  • if results work, run URL extender on all twitter crawl (Fox News and Washington Post, keeping separate) and postprocess - Fr
  • check if new IP address created with new instance - Gy
  • pause Israeli domain crawl while testing other crawl technique - Gy
  • set up individual crawls for Israeli domains to test crawl technique, and check regularly to see if multiple errors have cause brake - Gy
  • if new IP address is created with new instance, try NYT archive crawl - Gy