January 27, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • move back to github issues?
  • benchmarking and server discussion
    • update ubuntu?
    • re-do distribution of resources
    • re-do of al-monitor crawl & timeline
      • probably need to re-do five small sites crawl
    • question about ComCan training
    • benchmark speed?
  • results issues:
    • Shengsong's findings doing forensics of errors
    • easy to add title to domain crawler?
    • question about prefixed subdomain (en.globes.co.il)
  • Action items from last day to check:
    • ComCan map link to wiki home page (like John's onboarding message)
  • Alejandro & Colin meet for Jupyter update
  • Maybe have Colin move to Twitter crawl?

moving back to github tickets

server questions

  • CC says okay to stick with Ubuntu 18, no need to go to 20 yet
  • solution to temp folder storage limitation -- move to larger disk, first need to back up whole OS
    • first deal with temp folder storage limit and then see if it's necessary to re-do the large instance of server
    • Shengsong will email us with an update
    • Shengsong will try to make this a configurable option where the temp is located
  • Shengsong will get back to us about whether the documentation and training offered by Com Can will be useful
  • no point right now in looking at benchmarks of speed
  • softlink didn't work, need to move temp folder
    • why didn't work: temp folder doesn't work with softlink
  • Shengsong will link the comcan map from the wiki homepage

results issues:

  • jsons from the domain crawler have the title
  • Colin will look at adding title to postprocessor, Alejandro & Colin will meet to go over
  • distinguishing prefixed subdomain -- Shengsong will look into this later

Twitter API

  • Colin will start on this as long as hours last

Action Items

  • Shengsong will continue move of temp folder and update us by email
    • Shengsong will also try and make the location of the temp folder a configurable option
  • Shengsong: assuming temp folder issue is solved, re-do al-monitor.com crawl
  • Shengsong: look at Com Canada training materials and let us know if it might be worth it
  • Colin: start on Twitter API crawling
  • Colin: if time allows, look at adding title to postprocessor and renaming other keys