October 7, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Update on Action Items
  • SSH into Compute Canada for Colin
  • Github vulnerability

Action Items from last day:

  1. John to feed the post-processor the same data again (the output from the twitter and domain crawler) and feed it a different scope based on the tab "formatted_for_Mediacat" in this spreadsheet: https://docs.google.com/spreadsheets/d/1oYA1dkNvvsz_J5xlhl0_1NrayHbVo2E3W-FTZccL8nA/edit#gid=1838968997. Please let us know what data won't transfer to the post-processing scope. This output should be passed to Colin, or communicate roadblocks.
  • still running since last Thursday: postprocessing the Twitter articles, about 600,000 out of 4.5 million --> estimate of 40 days just to process, and then still needs to write the JSON files.

  • possibility: John could look into modifying postprocessor to write articles' results before continuing to postprocess Twitter

  • possibility: optimize the resource? probably wouldn't work to try and separate out different threads of postprocessing.

  • next step: kill current postprocess, and feed only domain articles to postprocessor, which we should results in 2-4 days

  • potential next step: look into re-write postprocessor to enable multi-processing

  • potential next step: with results of article postprocessing in hand, Colin and John could ensure that there isn't some unnecessary data or duplication in resulting files

  1. John to review the python script here: https://github.com/UTMediaCAT/mediacat-frontend/blob/master/utils/postprocessing_stacked_area_chart_single_domain_crawl.ipynb and try to run it agains the output from the twitter and domain crawler. We assume that June has built an alternative to the post-processor, and want to see the output. This output should be passed to Colin, or communicate roadblocks.
  • John doesn't think the script will help, only uses a dictionary, so had to input scope through another config; organizes by date, but too many 0 dates

  • we will move on from this script

  1. Colin to try making graphs based on output.json (force vector diagram) and interest-output.json (??). Just the twitter output might be an interesting diagram.
  • problem with network diagram: too many individual nodes to process on a laptop; Colin was able to do a pie chart with just the number of references from

  • potential future function that Colin suggests: Mediacat web interface to be interactive.

  • new script: read interest-output.json and converts into CSV/JSON -- question, where should this be stored?

  1. John is also re-running a full crawl with the full scope and taking down some data points about how fast (slow) it is.
  • this is waiting for the end of the postprocessing. Paused for now.

Colin enter Compute Canada?

  • Colin still having issues trying to SSH into Compute Canada; He'll try to find a resource with explanation or else let Alejandro know, and Alejandro will write to [email protected] or [email protected]

Github Vulnerability:

  • Buffer Overflow in Pillow critical severity pillow CVE-2021-34552 UTMediaCAT/Voyage requirements.txt

Action Items

  1. Pending confirmation with Kirsta: John will kill postprocessor and run only domain article crawl output

  2. Pending confirmation with Kirsta: John will look into multiprocessing with the postprocessor

  3. Colin would receive this in about 4 days, and would look it over for potential issues, and look at creating a few kinds of diagrams with it

  4. Colin will find a place to store script for reading large output jsons from crawlers, and add this to the MVP notes.

  5. Colin will figure out the SSH into Comp Canada as above

  6. John will deprecate the Voyage repository