January 3, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- meeting with Irfan in order to think through server issues
- update on Twitter crawl
server issues:
- we are now able to access the smaller instance on Graham
- moved the storage to the smaller instance and all data were there
- will r-sync over to Arbutus
- still can't access test_2 which is the larger instance
Twitter crawl
- finished Washington Post crawl
- following users no longer exist: annmarieadams
WithEdSimon
WPLyndaRobinson
rbbrenner
MattSchudel
faaawnt
WJuckno
- earlier batch of users:
raulp_213
jooleesah
DamonYoungVSB
leslieagarrettf
Action Items
- set up r-sync between Graham storage and Arbutus storage
- check whether NFS storage is possible
- look for missing data: https://docs.google.com/document/d/14yylvd_zl5BaOvD8WbM0AEV8opVIgr_zJ2pXuMFBzcQ/edit#heading=h.rpzfzugep3c5
- Questions for next meeting:
- on-going r-sync back up -- from where to where
- set up master key pair so that not devs who pass on information
- ensure we aren't using /dev
- how best to maintain record of crawls
- start the Fox twitter crawl?
- can we tell how far we got on crawls that were in storage?
- question about twitter crawl:
- Thanks for letting me know about Pbump. I wonder if you specifically skip the problematic tweet whether it can pick up again.
- find other users that were missing