June 2, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Alejandro: update on KPP & update on NYT archive dataset

    • Met with group and they are enthusiastic about results, although they haven't reviewed them yet.
    • RA has been hired and we are ready to start writing presentation for the fall.
  • crawl updates:

    • NYT archive politics - crawl has finished. NYT crawler has small bugs, but Shensong fixed them. Postprocessing needs to be done.
    • small domain crawl - in progress. Apify stealthy mode works. Puppeteer is being used to send an email if we are locked out. Authentication for emails must be done each week. We discussed this.
    • theguardian & others? (CNN) - Guardian changed their method for 429. Apify caught the error and so there is no way to intercept and add to the fail counts, so what he is doing is make the retry number 0s so if you get a fail request it doesn't immediately retry. Adding back to the queue does not work (marked as 'done' even if re-added to queue). Got to 720,000 crawled over the week or two.

documentation

  • Shensong to add documentation about:
    • crawler numbers for error registering and pausing the crawler
    • brake when queue goes to 0
    • apify crawl in rounds
    • in code documentation is completed, but additional mds to be made.

crawl updates

  • when delete old small domain crawl
  • list of domains in this crawl
  • soon: re-run postprocessor with new terms on NYT archive sets

postprocessor refactoring

  • Postprocessor refactoring probably a little over half done. Lots more to do. Postprocessor will work from .csv instead of JSON which makes things much faster.

Backburner

  • Twitter: embedded tweet issue
  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets

Action Items

  1. Shensong to write documentation on the following: crawler numbers for error registering and pausing the crawler, brake when queue goes to 0 and apify crawl in rounds.
  2. Shensong to send NYT archive politics crawl to Alejandro after postprocessing.
  3. Shensong to comment back to Apify developers so they are aware of limitations of error reporting.
  4. Shensong to continue working on the post-processor refactoring.
  5. Shensong to send Kirsta and Alejandro info re: data structure.

Next Agenda: Discuss crawling progress & action items? review data structure if we don't manage an asynchronous discussion.