June 2, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Alejandro: update on KPP & update on NYT archive dataset
- Met with group and they are enthusiastic about results, although they haven't reviewed them yet.
- RA has been hired and we are ready to start writing presentation for the fall.
crawl updates:
- NYT archive politics - crawl has finished. NYT crawler has small bugs, but Shensong fixed them. Postprocessing needs to be done.
- small domain crawl - in progress. Apify stealthy mode works. Puppeteer is being used to send an email if we are locked out. Authentication for emails must be done each week. We discussed this.
- theguardian & others? (CNN) - Guardian changed their method for 429. Apify caught the error and so there is no way to intercept and add to the fail counts, so what he is doing is make the retry number 0s so if you get a fail request it doesn't immediately retry. Adding back to the queue does not work (marked as 'done' even if re-added to queue). Got to 720,000 crawled over the week or two.

Shensong to add documentation about:
- crawler numbers for error registering and pausing the crawler
- brake when queue goes to 0
- apify crawl in rounds
- in code documentation is completed, but additional mds to be made.

Postprocessor refactoring probably a little over half done. Lots more to do. Postprocessor will work from .csv instead of JSON which makes things much faster.

Twitter: embedded tweet issue
Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
finding language function
image_reference function
dealing with embedded versus cited tweets

Shensong to write documentation on the following: crawler numbers for error registering and pausing the crawler, brake when queue goes to 0 and apify crawl in rounds.
Shensong to send NYT archive politics crawl to Alejandro after postprocessing.
Shensong to comment back to Apify developers so they are aware of limitations of error reporting.
Shensong to continue working on the post-processor refactoring.
Shensong to send Kirsta and Alejandro info re: data structure.

Next Agenda: Discuss crawling progress & action items? review data structure if we don't manage an asynchronous discussion.