January 20, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • move to issues to organize tasks
  • make Com Canada resource map private?
  • finalize any loose ends on Com Canada updates
  • benchmarking
  • look at txt for scope
  • Colin's suggestions about making spreadsheets
  • Twitter crawler estimate
  • If time: small site crawl? needs post-processing?

Compute Canada:

  • Graham Cloud running on latest OS as far as Shengsong can make out.
  • prepare email to CC cloud IT about whether any action needs to be taken to update OS for Graham and Arbutus instances? (seems like we're up to date)
  • figure out if SSH keys will be affected by a change to my CC password

Benchmarking

  • problem with storage -- Nat had a workaround suggestion with a soft linking
  • Colin will move some of the folders

Twitter crawler:

  • comment on TWINT #1295 that the fix from early December isn't working
  • research new Twitter crawlers:
    • question 1: how long would it take for you to integrate a new twitter crawler?
      • is the current twitter config sufficiently modular to easily swap a new (python?) crawler in?
    • question 2: would a javascript crawler that simulates human work better? (Is Apify a javascript crawler?)
    • others that Danhua considered: Twarc & Getoldtweets (see here, scroll down)
    • also look at Apify & Twitter API
      • does the Twitter API have a cost associated?
    • Other new twitter crawlers out there?
  • currently, it seems that none of these non-API twitter crawlers are working
  • academic research account:
    • archive search limit is 500 per request, see here
    • developer TOS here

Producing Spreadsheets

  • file much smaller without plain text
  • we're not sure whether the plain text error is produced by crawler or postprocessor

Action Items:

  • Shengsong: will move Compute Canada resources map to another space
  • Alejandro: send email to Compute Canada
  • Alejandro: look at scope text file
  • Alejandro: update server notes in new google doc & change password for Graham
  • Colin: delete SSH keys from old developers
  • Shengsong: set up firewall for our instance, link here
    • make sure to enable all ports that we are using, for example, for Jupyter & SSH
    • Nat recommends using USW
  • Colin: move folders as agreed
  • Shengsong: will try to re-start the benchmarking once the re-organization and soft-linking is done
  • Colin: will look to see if what's in the postprocessed spreadsheet is the same as in the JSON for plain text to see whether the error is produced by the postprocessor