Nov 30, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
IA NYT crawl (1) figure out how many unique urls without landing pages; (2) how many are /world/ and /world/middleeast/; (3) see if there is a time-bound characteristic to 2, and start crawl for those urls - Ra
continue dialogue with IA: (1) bulk download, (2) ask about rate of calls - Ra
update documentation on JSON/CSV conversion - Ar
develop unit testing for foxnews postprocessed results, for example, on text alias & hyperlinks - Ar
send small Wa/Po twitter files into postprocessor to troubleshoot error - Fr
do test on postprocessor with sample IA results (about 36,000) - Ar
Crawls
running Graham - once again error messages and then connection time out -- we will stop using Graham
Arbutus: strange error suddenly -- checking with Raazia
Internet Archive
getting new error, could it be we're blocked on some servers and not others?
Action Items
contact Nat about new kind of error with Arbutus cloud - Gy
check if any Arbutus settings that might have changed - Ra
compare old results of timesofisrael to new results and eliminate duplicates - Gy
continue checking on issue getting electronicintifada from IA - Ra