October 7, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

John to feed the post-processor the same data again (the output from the twitter and domain crawler) and feed it a different scope based on the tab "formatted_for_Mediacat" in this spreadsheet: https://docs.google.com/spreadsheets/d/1oYA1dkNvvsz_J5xlhl0_1NrayHbVo2E3W-FTZccL8nA/edit#gid=1838968997. Please let us know what data won't transfer to the post-processing scope. This output should be passed to Colin, or communicate roadblocks.

still running since last Thursday: postprocessing the Twitter articles, about 600,000 out of 4.5 million --> estimate of 40 days just to process, and then still needs to write the JSON files.
possibility: John could look into modifying postprocessor to write articles' results before continuing to postprocess Twitter
possibility: optimize the resource? probably wouldn't work to try and separate out different threads of postprocessing.
next step: kill current postprocess, and feed only domain articles to postprocessor, which we should results in 2-4 days
potential next step: look into re-write postprocessor to enable multi-processing
potential next step: with results of article postprocessing in hand, Colin and John could ensure that there isn't some unnecessary data or duplication in resulting files

John to review the python script here: https://github.com/UTMediaCAT/mediacat-frontend/blob/master/utils/postprocessing_stacked_area_chart_single_domain_crawl.ipynb and try to run it agains the output from the twitter and domain crawler. We assume that June has built an alternative to the post-processor, and want to see the output. This output should be passed to Colin, or communicate roadblocks.

John doesn't think the script will help, only uses a dictionary, so had to input scope through another config; organizes by date, but too many 0 dates
we will move on from this script

Colin to try making graphs based on output.json (force vector diagram) and interest-output.json (??). Just the twitter output might be an interesting diagram.

problem with network diagram: too many individual nodes to process on a laptop; Colin was able to do a pie chart with just the number of references from
potential future function that Colin suggests: Mediacat web interface to be interactive.
new script: read interest-output.json and converts into CSV/JSON -- question, where should this be stored?

John is also re-running a full crawl with the full scope and taking down some data points about how fast (slow) it is.

Colin still having issues trying to SSH into Compute Canada; He'll try to find a resource with explanation or else let Alejandro know, and Alejandro will write to [email protected] or [email protected]

Buffer Overflow in Pillow critical severity pillow CVE-2021-34552 UTMediaCAT/Voyage requirements.txt

Pending confirmation with Kirsta: John will kill postprocessor and run only domain article crawl output
Pending confirmation with Kirsta: John will look into multiprocessing with the postprocessor
Colin would receive this in about 4 days, and would look it over for potential issues, and look at creating a few kinds of diagrams with it
Colin will find a place to store script for reading large output jsons from crawlers, and add this to the MVP notes.
Colin will figure out the SSH into Comp Canada as above
John will deprecate the Voyage repository