January 27, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- move back to github issues?
- benchmarking and server discussion
- update ubuntu?
- re-do distribution of resources
- re-do of al-monitor crawl & timeline
- probably need to re-do five small sites crawl
- question about ComCan training
- benchmark speed?
- results issues:
- Shengsong's findings doing forensics of errors
- easy to add title to domain crawler?
- question about prefixed subdomain (en.globes.co.il)
- Action items from last day to check:
- ComCan map link to wiki home page (like John's onboarding message)
- Alejandro & Colin meet for Jupyter update
- Maybe have Colin move to Twitter crawl?
moving back to github tickets
server questions
- CC says okay to stick with Ubuntu 18, no need to go to 20 yet
- solution to temp folder storage limitation -- move to larger disk, first need to back up whole OS
- first deal with temp folder storage limit and then see if it's necessary to re-do the large instance of server
- Shengsong will email us with an update
- Shengsong will try to make this a configurable option where the temp is located
- Shengsong will get back to us about whether the documentation and training offered by Com Can will be useful
- no point right now in looking at benchmarks of speed
- softlink didn't work, need to move temp folder
- why didn't work: temp folder doesn't work with softlink
- Shengsong will link the comcan map from the wiki homepage
results issues:
- jsons from the domain crawler have the title
- Colin will look at adding title to postprocessor, Alejandro & Colin will meet to go over
- distinguishing prefixed subdomain -- Shengsong will look into this later
Twitter API
- Colin will start on this as long as hours last
Action Items
- Shengsong will continue move of temp folder and update us by email
- Shengsong will also try and make the location of the temp folder a configurable option
- Shengsong: assuming temp folder issue is solved, re-do al-monitor.com crawl
- Shengsong: look at Com Canada training materials and let us know if it might be worth it
- Colin: start on Twitter API crawling
- Colin: if time allows, look at adding title to postprocessor and renaming other keys