Feb 15, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

continue combinging results and document (reading week) - Gy
monitor NYT archive and send email to Alejandro to update - Gy
follow up with IA regarding connection refused error - Ra
ask Nat for meeting about connection refused error - Ra
try another kind of crawl to see if there's a refused error - Ra
try to update version of node to see if that helps - Ra
take a sample of Wa/Po and see if can reproduce the right result - Fr
follow up by email about Wa/Po output number - Fr
unit-testing each function of the postprocessor for IA dataset - Ar
if unit-testing shows accuracy, then request IA Electronic Intifada dataset from Raazia and proceed with postprocessing - Ar

Nat helped with workarounds:
- separate downloading of urls through cdx from crawling
- storing failed responses and try them again (but switching through successful responses)
- randomizing the pagination attempts
- filtering after download of urls, check for duplicate urls before assigning them to the queue for crawling
some of the problem may be the responsisveness of IA servers, so slow down requests

test NYT Mid E Archive crawl with a speed of 2000 results a day, if possible, then continue, if not, then abandon - Gy
continue combinging results and document (reading week) - Gy
integrating Nat's suggestions and testing again the NYT Mid E - Ra