July 14, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Twitter crawler diff -- what we found
Domain crawler documentation (in-code & higher level to MVP), branch clean up, and pushing to master -- done?
Chance to try 1+ instance on 1 server -- how far along?
timelining:

meeting of Alejandro & Raiyan to go over folders and instances in Compute Canada
assess whether Raiyan will have time to upgrade to Apify 1.0 in next 2-3 weeks
meeting of Alejandro & Kirsta to come up with work study plans & coop for Fall/Winter

Hire a minimum of 1 workstudy for fall/winter
Hire a minimum of 1 co-op placement for winter

Twitter crawler diff

we didn't do the NYTimes twitter list, Alejandro will get the list from June
apparently Danhua may have limited the number of twitter handles at a time, which may account for the missed twitter handles
Raiyan is going through code; currently, the non-crawled are currently being crawled
creating ticket to identify the issue: https://github.com/UTMediaCAT/mediacat-twitter-crawler/issues/10

Domain crawler documentation:

Raiyan did documentation and updated master branch; also updated to MVP
Kirsta put together a diagram to show the flow of information for the domain crawler, several items were clarified
Crawler is officially working!
160 hours of crawling continuously, no errors : steady 13 links per minute. (100,000+ JSON Files created)

update Apify:

update Apify and assess crawler: Raiyan to look: https://github.com/UTMediaCAT/mediacat-domain-crawler/issues/34
not sure how long it will take, because things may break
Raiyan will communicate with Compute Canada about the reason for vulnerability, and the update to Apify

1+ instance on 1 server:

Raiyan is working on this presently, to get them reading from the same queue

Meetings

Raiyan and Alejandro meeting, and meeting of Kirsta and Alejandro both set