March 30, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • pay top up
  • Israeli news sites: finish setting up directory and start crawl
  • keep working on postprocessor

Crawler Issues

  • still playing with the numbers
  • jewishjournal giving errors -- not 503s, but 502: general error invalid response
  • change user agent:
  • part of the header: where call is being made
  • can also do curl request or standalone puppeteer request to isolate the problem
  • https://pptr.dev/api/puppeteer.browser.useragent

Twitter postprocessing

  • managed slightly larger csv to process
    • able to process but no output (could be not enough tweets)
    • earlier error: unclear if size a problem or combining function
  • new error with URL expander, but never finishes expanding
  • get error of malformed node or string
    • add function into a try-catch exception and then log with ID and create separate set of things

Israeli news sites crawl

  • currently running
  • initially going to just copy the code from github and start crawling
    • got errors
    • copying an existing crawler already on the server and using the existing crawler

Action Items

  • finish trouble-shooting url expander: add function into a try-catch exception and then log with ID and create separate set of things
  • crawler: can also do curl request or standalone puppeteer request to isolate the problem
  • keep both crawls going

For new work studies:

  • figure out why last line output file in Twitter crawl is being cut out.
  • issue of email when crawler breaks