December 16, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

crawls update:

  • all the crawls are done except jewishjournal, and the first ones done

  • strategy: fine to do a bunch of crawls on the same instance, and readme gives the instructions about how to run the instances in separate directory

    • some notes: there's a script called clean.tmp that periodically cleans out the temp folder which otherwise uses up a ton of storage (only 7GB)
      • clean.tmp deletes older stuff first, and John researched and found that the files there are not needed very quickly
      • it's possible to set the limit, currently at 2GB, but John will change so that it can be set in command line
      • John will commit this script to the domain crawler repository and also add documentation to readme
    • with 5 different crawls, we were able to crawl approx 35,000 urls a day, some sites are slower, and no connection between the crawls
      • so this crawl strategy can work for smaller sites
  • al-monitor:

    • deleting in-domain urls would take some doing, look at in the new year
    • preparing the CSV is very time-consuming: processing 1 row can take 1 second
      • Colin: will look at speeding it up in the new year (for example, splitting multiple processes)
      • using built-in dictionaries right now, and perhaps with pandas or other dictionaries would be much faster
      • current requested csv is still being processed
    • John changed (liek Twitter crawl) and put in the found_url, and now the post-processor is finding way more relevant references

Utils update

  • found_url script pushed to utils folder in mediacat-backend:

    • John pushed and documented
  • post-processor:

    • John cleaned up the multiple versions and pushed to backend, and documented
  • optimizing resources:

    • Shengsong will look at this in new year
    • John will write up a readme of overview of where to run stuff, and where the documentation is of different utils/functions/scripts & put link it from the home page of the github docs (and email to Shengsong)
  • python crawler or optimizing:

    • first meeting in January: discuss whether Colin should actually keep working on it, or moving to optimization
  • Twint problem:

  • crawl strategy:

    • possibility of crawling subdomains of large domains (like NYT Middle East): problem is that the subdomain urls don't follow a pattern
    • probably better to first attempt to optimize resources

Action Items:

  • John will finalize documentation and readme overview (including where results are of different crawls)
  • John will document & push the clean.tmp script
  • Colin: CSV to finalize; and in new year will look at optimizing