January 12, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

set up r-sync between Graham storage and Arbutus storage
check whether NFS storage is possible
- also: half-node?
look for missing data: https://docs.google.com/document/d/14yylvd_zl5BaOvD8WbM0AEV8opVIgr_zJ2pXuMFBzcQ/edit#heading=h.rpzfzugep3c5

half node, r-sync, nfs
got the half node
NFS: makes more sense for us, because we're dealing with modifying basic files, don't have objects and also can have shared file system
r-sync:
- still running to get the data to Arbutus
- commands ready for weekly or daily back up
- current Graham cloud: little over 2 TB, but a large chunk is called backup (/media/data/backup) & includes 1.4 TB of data
  - has many previous crawls both twitter & domains

should we delete the media/data/backup -- it looks like it was created January 5, 2022 perhaps when Shengsong started
on-going r-sync back up -- from where to where
set up master key pair so that not devs who pass on information
ensure we aren't using /dev
how best to maintain record of crawls
- see https://docs.google.com/spreadsheets/d/1Uyi53IwyNOp92E4M31Q_aiidxAhRLjU4PMokz2hWQYY/edit#gid=0
start the Fox twitter crawl?
can we tell how far we got on crawls that were in storage?
question about twitter crawl:
- Thanks for letting me know about Pbump. I wonder if you specifically skip the problematic tweet whether it can pick up again.
- find other users that were missing

write to Irfan about changing storage structure and setting up back up, changing key pair
start Foxnews twitter crawl
continue r-sync
look through logs of domain crawls to see if we can figure out how close they were to being done:
- small domain: /media/data/Domain_crawler/small_domains_2022_05_26
- NYT domain crawl: /media/data/Domain_crawler/nytimes_2022_04_09
- Guardian crawl: /media/data/Domain_crawler/guardian_2022_05_12

meeting times
should we delete the media/data/backup -- it looks like it was created January 5, 2022 perhaps when Shengsong started
on-going r-sync back up -- from where to where
set up master key pair so that not devs who pass on information
ensure we aren't using /dev
how best to maintain record of crawls
- see https://docs.google.com/spreadsheets/d/1Uyi53IwyNOp92E4M31Q_aiidxAhRLjU4PMokz2hWQYY/edit#gid=0
start the Fox twitter crawl?
can we tell how far we got on crawls that were in storage?