February 3, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- link to project?
- research on user-designation of temp folder location and size in domain crawler
- need for better documentation strategy on storage
- update on Graham Cloud config & crawl
- Twitter API crawl
Tmp Folder
- increase /tmp folder size instead of moving to larger disk
- server can be resized to any size so no point in moving it
- Shengsong will document how to resize tmp, how to recreate instance from back up, and slowly also what data is stored in which instance.
- /tmp folder is currently 80 GB and with one crawler and jupyter lab it's at 24%, and crawler is now 43,000 urls in 36 hours
- al-monitor.com crawl should finish in another day
- Shengsong will rename folders so that we are clear about crawl date and scope, eg 2022/02/03 al-monitor.com crawl
- for multiple crawls: Shengsong suggests only 1 crawl per instance: with 3 crawlers, considerable speed reduction with multiple crawls
- multiple thread: one crawler with multiple thread -- manage through code what each thread handles
- Raiyan: tried multiple threads from same queue but didn't work
- problem with multi-threading: could be that only someone with high performance server can do?
- one possibility is to try on a desktop and see how it does with a small site
- documenting puppeteer config
- monitoring how puppeteer is working to get a better understanding, and then change puppeteer
- when crawler runs, how much memory and how much cpu -- to let us know how to move forward
- optimizing puppeteer config?
- google ubuntu for packages to measure memory and cpu
- another possibility is to look at NYT semantic API
- output the HTOP/TOP info every hour or so to a file, and manually review, high level idea
Twitter API Crawl
- set up documentation, 500 per call but it does tell you were it finished
- set up keys etc
- either http request and python wrapper
- very easy to reconstruct the tweet's url and can be automated
- will meet with Shengsong next week to do knowledge transfer and documentation to get crawl started
- meet Monday