June 10, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • Crawler

  • Twinnt

  • NYT Mid E crawl

Crawler ticket:

  • Raiyan implemented naming convention, tested w NYT/Jadaliyya/al Jazeera, about 15,000 links crawled in 36 hours, but al Jazeera started the infinite loop and possible that there is a memory issue;

  • Raiyan still trying to narrow down how to trouble shoot the infinite loop issue, including de-bugging records

  • still thinking through a few options about how to move forward, but not clear; will be in touch with Kirsta if gets blocked

Twinnt:

  • both Raiyan and John read the TWINNT code

  • set up together the TWINNT crawler for the NYT twitter handles

NYT Mid East

  • Raiyan will set this up to test it separately

  • if things go well with infinite loop and missing JSON, then try to do the Mid E section crawl.