February 3, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

link to project?
research on user-designation of temp folder location and size in domain crawler
- need for better documentation strategy on storage
update on Graham Cloud config & crawl
Twitter API crawl

increase /tmp folder size instead of moving to larger disk
server can be resized to any size so no point in moving it
Shengsong will document how to resize tmp, how to recreate instance from back up, and slowly also what data is stored in which instance.
/tmp folder is currently 80 GB and with one crawler and jupyter lab it's at 24%, and crawler is now 43,000 urls in 36 hours
al-monitor.com crawl should finish in another day
Shengsong will rename folders so that we are clear about crawl date and scope, eg 2022/02/03 al-monitor.com crawl
for multiple crawls: Shengsong suggests only 1 crawl per instance: with 3 crawlers, considerable speed reduction with multiple crawls
multiple thread: one crawler with multiple thread -- manage through code what each thread handles
Raiyan: tried multiple threads from same queue but didn't work
problem with multi-threading: could be that only someone with high performance server can do?
one possibility is to try on a desktop and see how it does with a small site
documenting puppeteer config
monitoring how puppeteer is working to get a better understanding, and then change puppeteer
when crawler runs, how much memory and how much cpu -- to let us know how to move forward
optimizing puppeteer config?
google ubuntu for packages to measure memory and cpu
another possibility is to look at NYT semantic API
output the HTOP/TOP info every hour or so to a file, and manually review, high level idea

set up documentation, 500 per call but it does tell you were it finished
set up keys etc
either http request and python wrapper
very easy to reconstruct the tweet's url and can be automated
will meet with Shengsong next week to do knowledge transfer and documentation to get crawl started
meet Monday