June 10, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda:
-
Crawler
-
Twinnt
-
NYT Mid E crawl
Crawler ticket:
-
Raiyan implemented naming convention, tested w NYT/Jadaliyya/al Jazeera, about 15,000 links crawled in 36 hours, but al Jazeera started the infinite loop and possible that there is a memory issue;
-
Raiyan still trying to narrow down how to trouble shoot the infinite loop issue, including de-bugging records
-
still thinking through a few options about how to move forward, but not clear; will be in touch with Kirsta if gets blocked
Twinnt:
-
both Raiyan and John read the TWINNT code
-
set up together the TWINNT crawler for the NYT twitter handles
NYT Mid East
-
Raiyan will set this up to test it separately
-
if things go well with infinite loop and missing JSON, then try to do the Mid E section crawl.