August 4, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

SWPP for Raiyan
timesheets
apify update
results of crawls now available on MVP?
politics subdomain?
cutting a beta
ideas for running parallel instances for next person?

Apify

udpated to apify 1.3.1, lots of major changes, esp queue
no longer file based JSON but rather DB
it works like it used to
bad news: need to re-start politics and middle east from the beginning
results JSON is the same, so that the input to the postprocessor should work
probably any other updates won't be such a hassle
could be that the multiple instances will be easier now
if John were to look at the possibility of multiple instances:

first read apify documentation
trying to run the crawler in different terminals multiple times, and keeping an eye on the queue to ensure no errors

couple of functions were deprecated, which means that there'll be new functions but not sure when:
e.g., goto (blacklisted domains and videos etc, to help crawl faster) function probably will be removed, and Raiyan will try to remove and use workaround that does essentially the same thing

Current Crawls

NYT twitter crawl accessible and will be udpated to MVP
NYT/middle east & NYT/politics will be restarted, and old data expunged

Cutting a release

we're not sure how to do a release
twitter and domain crawler documentation should be good; we're not sure about the postprocessor documentation

other ideas:

with DB: running the new python script -- "master crawler" -- which runs the crawler (not the bash script) multiple times: might work because Raiyan tested it and there were times that it did work
suggestion: next dev should be in touch with the Apify development team to see if they have suggestions for how to run multiple instances; devs are good about responding

Next meeting

in person at UTSC for week of August 30th?