January 12, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
meeting time:
- ask Kirsta and Nat whether 12pm or 4pm can be regular time
Update on server?
- half node, r-sync, nfs
- got the half node
- NFS: makes more sense for us, because we're dealing with modifying basic files, don't have objects and also can have shared file system
- r-sync:
- still running to get the data to Arbutus
- commands ready for weekly or daily back up
- current Graham cloud: little over 2 TB, but a large chunk is called backup (/media/data/backup) & includes 1.4 TB of data
- has many previous crawls both twitter & domains
Twitter results
- possible to run through the post-processor?
- with Israel/Palestine scope
Start Fox news twitter crawl?
Questions for next meeting:
- should we delete the media/data/backup -- it looks like it was created January 5, 2022 perhaps when Shengsong started
- on-going r-sync back up -- from where to where
- set up master key pair so that not devs who pass on information
- ensure we aren't using /dev
- how best to maintain record of crawls
- start the Fox twitter crawl?
- can we tell how far we got on crawls that were in storage?
- question about twitter crawl:
- Thanks for letting me know about Pbump. I wonder if you specifically skip the problematic tweet whether it can pick up again.
- find other users that were missing
Action Items
- write to Irfan about changing storage structure and setting up back up, changing key pair
- start Foxnews twitter crawl
- continue r-sync
- look through logs of domain crawls to see if we can figure out how close they were to being done:
- small domain: /media/data/Domain_crawler/small_domains_2022_05_26
- NYT domain crawl: /media/data/Domain_crawler/nytimes_2022_04_09
- Guardian crawl: /media/data/Domain_crawler/guardian_2022_05_12
Questions for next meeting:
- meeting times
- should we delete the media/data/backup -- it looks like it was created January 5, 2022 perhaps when Shengsong started
- on-going r-sync back up -- from where to where
- set up master key pair so that not devs who pass on information
- ensure we aren't using /dev
- how best to maintain record of crawls
- start the Fox twitter crawl?
- can we tell how far we got on crawls that were in storage?