October 21, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  1. John will continue looking at the various approaches to optimizing the postprocessor speed:
  • try solving the race condition when using the same dictionary by adding a locking mechanism
  • when using separate dictionaries, try a formal map produce from an existing python library
  • for one dictionary, try assigning separate keys to each process, with a shared dictionary.
  1. Colin will look at the repo front-end, and esp look at visualizations ticket (#7); will author a new pull request and we'll look at this request next week
  2. John will deprecate Voyage repository
  3. Alejandro will follow up on SSH issue for Compute Canada

Notes

  1. Optimizing Postprocessor speed
  • John has tried several approaches to optimization and discussed with KS and Nat - the first approach is the best one. 2 days to 4 hours.
  • John spent the week looking at the alternatives and verified his original approach was the correct one to take.
  • New problem: Run out of memory at the very end and didn't process the twitter crawler data and needs to re-run the crawler. Working on an approach to monitor size and write to disk if the process is at risk, and is now seeking to re-run.
  • John is having a problem with Graham cloud that he is trying to address (he keeps getting kicked out). He will write Alejandro who will write the Compute Canada stuff.
  1. Problem with SSH into Compute Canada resources
  • Alejandro tried to sort out with Colin but still experiencing the same problem. Adding IP under security groups didn't add anything. Jacqueline added John, so he doesn't know. Documented steps did not work. Alejandro will email Jacqueline to see if she can clarify and if we can get in we need to update documentation.
  1. Visualizations
  • Colin can't run the existing repositories because the data set doesn't have dates on the domain side.
  • https://github.com/UTMediaCAT/mediacat-domain-crawler/issues/35 was created to make sure we look at the date functions and see what's going on.
  • Colin will create a .csv for Alejandro
  • Colin will author a PR (we reviewed the PR and squashed it)
  • Colin will reprocess JSON file to add dates from URL where available so that he can try to run the stacked area graph. Will also try to get network diagram (force vector) if time allows.

Action Items

John

  • Finishing work on size monitoring feature for post-processor and testing
  • Writing Alejandro about the problem with Graham and getting kicked out
  • If time permits, beginning work on the troubleshooting of Metascraper in ticket 35

Colin

  • Providing .csv to Alejandro
  • Reprocessing the output JSON to include the dates that are available in URL for the domain hits
  • Try building the stacked area or network diagrams if time permits
  • Responding with any info requests on the thread that Alejandro starts with Jacqueline

Alejandro

  • Starting thread with Jacqueline - we need to find out how to properly add Colin to the Compute Canada resources and update the documentation to be correct.