January 12, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

meeting time:

  • ask Kirsta and Nat whether 12pm or 4pm can be regular time

Update on server?

  • half node, r-sync, nfs
  • got the half node
  • NFS: makes more sense for us, because we're dealing with modifying basic files, don't have objects and also can have shared file system
  • r-sync:
    • still running to get the data to Arbutus
    • commands ready for weekly or daily back up
    • current Graham cloud: little over 2 TB, but a large chunk is called backup (/media/data/backup) & includes 1.4 TB of data
      • has many previous crawls both twitter & domains

Twitter results

  • possible to run through the post-processor?
    • with Israel/Palestine scope

Start Fox news twitter crawl?

Questions for next meeting:

  • should we delete the media/data/backup -- it looks like it was created January 5, 2022 perhaps when Shengsong started
  • on-going r-sync back up -- from where to where
  • set up master key pair so that not devs who pass on information
  • ensure we aren't using /dev
  • how best to maintain record of crawls
  • start the Fox twitter crawl?
  • can we tell how far we got on crawls that were in storage?
  • question about twitter crawl:
    • Thanks for letting me know about Pbump. I wonder if you specifically skip the problematic tweet whether it can pick up again.
    • find other users that were missing

Action Items

  • write to Irfan about changing storage structure and setting up back up, changing key pair
  • start Foxnews twitter crawl
  • continue r-sync
  • look through logs of domain crawls to see if we can figure out how close they were to being done:
    • small domain: /media/data/Domain_crawler/small_domains_2022_05_26
    • NYT domain crawl: /media/data/Domain_crawler/nytimes_2022_04_09
    • Guardian crawl: /media/data/Domain_crawler/guardian_2022_05_12

Questions for next meeting:

  • meeting times
  • should we delete the media/data/backup -- it looks like it was created January 5, 2022 perhaps when Shengsong started
  • on-going r-sync back up -- from where to where
  • set up master key pair so that not devs who pass on information
  • ensure we aren't using /dev
  • how best to maintain record of crawls
  • start the Fox twitter crawl?
  • can we tell how far we got on crawls that were in storage?