July 14, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Twitter crawler diff -- what we found
  • Domain crawler documentation (in-code & higher level to MVP), branch clean up, and pushing to master -- done?
  • Chance to try 1+ instance on 1 server -- how far along?
  • timelining:
  1. meeting of Alejandro & Raiyan to go over folders and instances in Compute Canada
  2. assess whether Raiyan will have time to upgrade to Apify 1.0 in next 2-3 weeks
  3. meeting of Alejandro & Kirsta to come up with work study plans & coop for Fall/Winter
  • Hire a minimum of 1 workstudy for fall/winter
  • Hire a minimum of 1 co-op placement for winter

Twitter crawler diff

  • we didn't do the NYTimes twitter list, Alejandro will get the list from June
  • apparently Danhua may have limited the number of twitter handles at a time, which may account for the missed twitter handles
  • Raiyan is going through code; currently, the non-crawled are currently being crawled
  • creating ticket to identify the issue: https://github.com/UTMediaCAT/mediacat-twitter-crawler/issues/10

Domain crawler documentation:

  • Raiyan did documentation and updated master branch; also updated to MVP
  • Kirsta put together a diagram to show the flow of information for the domain crawler, several items were clarified
  • Crawler is officially working!
  • 160 hours of crawling continuously, no errors : steady 13 links per minute. (100,000+ JSON Files created)

update Apify:

1+ instance on 1 server:

  • Raiyan is working on this presently, to get them reading from the same queue

Meetings

  • Raiyan and Alejandro meeting, and meeting of Kirsta and Alejandro both set