December 16, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

John is going to monitor crawls and put them through the post-processor. Clean-up post-processor as there are multiple versions on Graham that is well documented. John to see what is involved (or how best to implement in our resources) multiple instances of the crawler running concurrently.
Colin to generate .csv for Al-monitor and look into viable Python crawler. See if he can find a pattern for failure.
- filter out in-domain urls?
question: alternative crawl strategy:
- Middle east on following sites:

found_url script pushed to utils folder in mediacat-backend:
- John pushed and documented
post-processor:
- John cleaned up the multiple versions and pushed to backend, and documented
optimizing resources:
- Shengsong will look at this in new year
- John will write up a readme of overview of where to run stuff, and where the documentation is of different utils/functions/scripts & put link it from the home page of the github docs (and email to Shengsong)
python crawler or optimizing:
- first meeting in January: discuss whether Colin should actually keep working on it, or moving to optimization
Twint problem:
- follow issue here: https://github.com/twintproject/twint/pull/1307
- one possible solution doesn't seem to work: https://github.com/twintproject/twint/pull/1307
- need to look again in new year
crawl strategy:
- possibility of crawling subdomains of large domains (like NYT Middle East): problem is that the subdomain urls don't follow a pattern
- probably better to first attempt to optimize resources

John will finalize documentation and readme overview (including where results are of different crawls)
John will document & push the clean.tmp script
Colin: CSV to finalize; and in new year will look at optimizing