December 16, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
-
John is going to monitor crawls and put them through the post-processor. Clean-up post-processor as there are multiple versions on Graham that is well documented. John to see what is involved (or how best to implement in our resources) multiple instances of the crawler running concurrently.
-
Colin to generate .csv for Al-monitor and look into viable Python crawler. See if he can find a pattern for failure.
- filter out in-domain urls?
-
question: alternative crawl strategy:
crawls update:
-
all the crawls are done except jewishjournal, and the first ones done
-
strategy: fine to do a bunch of crawls on the same instance, and readme gives the instructions about how to run the instances in separate directory
- some notes: there's a script called clean.tmp that periodically cleans out the temp folder which otherwise uses up a ton of storage (only 7GB)
- clean.tmp deletes older stuff first, and John researched and found that the files there are not needed very quickly
- it's possible to set the limit, currently at 2GB, but John will change so that it can be set in command line
- John will commit this script to the domain crawler repository and also add documentation to readme
- with 5 different crawls, we were able to crawl approx 35,000 urls a day, some sites are slower, and no connection between the crawls
- so this crawl strategy can work for smaller sites
- some notes: there's a script called clean.tmp that periodically cleans out the temp folder which otherwise uses up a ton of storage (only 7GB)
-
al-monitor:
- deleting in-domain urls would take some doing, look at in the new year
- preparing the CSV is very time-consuming: processing 1 row can take 1 second
- Colin: will look at speeding it up in the new year (for example, splitting multiple processes)
- using built-in dictionaries right now, and perhaps with pandas or other dictionaries would be much faster
- current requested csv is still being processed
- John changed (liek Twitter crawl) and put in the found_url, and now the post-processor is finding way more relevant references
Utils update
-
found_url script pushed to utils folder in mediacat-backend:
- John pushed and documented
-
post-processor:
- John cleaned up the multiple versions and pushed to backend, and documented
-
optimizing resources:
- Shengsong will look at this in new year
- John will write up a readme of overview of where to run stuff, and where the documentation is of different utils/functions/scripts & put link it from the home page of the github docs (and email to Shengsong)
-
python crawler or optimizing:
- first meeting in January: discuss whether Colin should actually keep working on it, or moving to optimization
-
Twint problem:
- follow issue here: https://github.com/twintproject/twint/pull/1307
- one possible solution doesn't seem to work: https://github.com/twintproject/twint/pull/1307
- need to look again in new year
-
crawl strategy:
- possibility of crawling subdomains of large domains (like NYT Middle East): problem is that the subdomain urls don't follow a pattern
- probably better to first attempt to optimize resources
Action Items:
- John will finalize documentation and readme overview (including where results are of different crawls)
- John will document & push the clean.tmp script
- Colin: CSV to finalize; and in new year will look at optimizing