February 3, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • link to project?
  • research on user-designation of temp folder location and size in domain crawler
    • need for better documentation strategy on storage
  • update on Graham Cloud config & crawl
  • Twitter API crawl

Tmp Folder

  • increase /tmp folder size instead of moving to larger disk
  • server can be resized to any size so no point in moving it
  • Shengsong will document how to resize tmp, how to recreate instance from back up, and slowly also what data is stored in which instance.
  • /tmp folder is currently 80 GB and with one crawler and jupyter lab it's at 24%, and crawler is now 43,000 urls in 36 hours
  • al-monitor.com crawl should finish in another day
  • Shengsong will rename folders so that we are clear about crawl date and scope, eg 2022/02/03 al-monitor.com crawl
  • for multiple crawls: Shengsong suggests only 1 crawl per instance: with 3 crawlers, considerable speed reduction with multiple crawls
  • multiple thread: one crawler with multiple thread -- manage through code what each thread handles
  • Raiyan: tried multiple threads from same queue but didn't work
  • problem with multi-threading: could be that only someone with high performance server can do?
  • one possibility is to try on a desktop and see how it does with a small site
  • documenting puppeteer config
  • monitoring how puppeteer is working to get a better understanding, and then change puppeteer
  • when crawler runs, how much memory and how much cpu -- to let us know how to move forward
  • optimizing puppeteer config?
  • google ubuntu for packages to measure memory and cpu
  • another possibility is to look at NYT semantic API
  • output the HTOP/TOP info every hour or so to a file, and manually review, high level idea

Twitter API Crawl

  • set up documentation, 500 per call but it does tell you were it finished
  • set up keys etc
  • either http request and python wrapper
  • very easy to reconstruct the tweet's url and can be automated
  • will meet with Shengsong next week to do knowledge transfer and documentation to get crawl started
  • meet Monday