October 7, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- Update on Action Items
- SSH into Compute Canada for Colin
- Github vulnerability
Action Items from last day:
- John to feed the post-processor the same data again (the output from the twitter and domain crawler) and feed it a different scope based on the tab "formatted_for_Mediacat" in this spreadsheet: https://docs.google.com/spreadsheets/d/1oYA1dkNvvsz_J5xlhl0_1NrayHbVo2E3W-FTZccL8nA/edit#gid=1838968997. Please let us know what data won't transfer to the post-processing scope. This output should be passed to Colin, or communicate roadblocks.
-
still running since last Thursday: postprocessing the Twitter articles, about 600,000 out of 4.5 million --> estimate of 40 days just to process, and then still needs to write the JSON files.
-
possibility: John could look into modifying postprocessor to write articles' results before continuing to postprocess Twitter
-
possibility: optimize the resource? probably wouldn't work to try and separate out different threads of postprocessing.
-
next step: kill current postprocess, and feed only domain articles to postprocessor, which we should results in 2-4 days
-
potential next step: look into re-write postprocessor to enable multi-processing
-
potential next step: with results of article postprocessing in hand, Colin and John could ensure that there isn't some unnecessary data or duplication in resulting files
- John to review the python script here: https://github.com/UTMediaCAT/mediacat-frontend/blob/master/utils/postprocessing_stacked_area_chart_single_domain_crawl.ipynb and try to run it agains the output from the twitter and domain crawler. We assume that June has built an alternative to the post-processor, and want to see the output. This output should be passed to Colin, or communicate roadblocks.
-
John doesn't think the script will help, only uses a dictionary, so had to input scope through another config; organizes by date, but too many 0 dates
-
we will move on from this script
- Colin to try making graphs based on output.json (force vector diagram) and interest-output.json (??). Just the twitter output might be an interesting diagram.
-
problem with network diagram: too many individual nodes to process on a laptop; Colin was able to do a pie chart with just the number of references from
-
potential future function that Colin suggests: Mediacat web interface to be interactive.
-
new script: read interest-output.json and converts into CSV/JSON -- question, where should this be stored?
- John is also re-running a full crawl with the full scope and taking down some data points about how fast (slow) it is.
-
this is waiting for the end of the postprocessing. Paused for now.
Colin enter Compute Canada?
- Colin still having issues trying to SSH into Compute Canada; He'll try to find a resource with explanation or else let Alejandro know, and Alejandro will write to [email protected] or [email protected]
Github Vulnerability:
- Buffer Overflow in Pillow critical severity pillow CVE-2021-34552 UTMediaCAT/Voyage requirements.txt
Action Items
-
Pending confirmation with Kirsta: John will kill postprocessor and run only domain article crawl output
-
Pending confirmation with Kirsta: John will look into multiprocessing with the postprocessor
-
Colin would receive this in about 4 days, and would look it over for potential issues, and look at creating a few kinds of diagrams with it
-
Colin will find a place to store script for reading large output jsons from crawlers, and add this to the MVP notes.
-
Colin will figure out the SSH into Comp Canada as above
-
John will deprecate the Voyage repository