October 14, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Action Items

Pending confirmation wi1. th Kirsta: John will kill postprocessor and run only domain article crawl output

Complete and passed to Colin.

Pending confirmation with Kirsta: John will look into multiprocessing with the postprocessor

Setting up multiprocessing. Tried two different approaches. 1. was a shared python dictionary from which everything could read and write from, but there were race conditions and data missing, so that didn't work out although it ran very quickly. 2. Each process keep its own dictionary, but then final merging before output. That output seems to be what we expected (the same as running with process). The merging still takes a lot of time, so it only got us 7 minutes. But, if we ran with a bigger scope, we'd likely see a bigger difference. Now he is trying to run the full scope with the new multiprocess version, but it has done about a quarter of the domain issues in about an hour. 10 processes running for each of the domain and twitter (20 altogether).
Cross referencing is separate from merging. Go to each ID for an article and then accumulating the references from each of the processes to the final python dictionary.
Alejandro notes on multiprocessing postprocessor:
1st approach: reading and writing shared -- ran quickly, but missing referrals, race condition that 2 processes were writing to the saem key
To address the race condition: using a queue, but in this case it wouldn’t apply here, and issue is when writing to same key (John thinks)
Nat: is there a locking mechanism? Worth a shot to look at.
2nd approach: each process keep its own dictionary, and then merge: expected output, but merging step takes a lot of time, only 7 minutes faster; but bigger difference in the time with whole scope;
Merging step: each id for an article, accumulating the references for each one ; 10 processes -- need
Reproducibility & documentation: single line of code can split ; to set number ; add a script to give the number ; logic of original script is till there, it’s just a number that needs to be changed
Nat: more formal map produce existing libraries in python
3rd approach: assign keys to each process, but shared dictionary

Colin would receive this in about 4 days, and would look it over for potential issues, and look at creating a few kinds of diagrams with it

looks smaller all around, less redundancy probably, looked at NYT scope domain crawl using old postprocessor and it looked fine. output file is smaller as is interest-output.

Colin will find a place to store script for reading large output jsons from crawlers, and add this to the MVP notes.

Colin will look at the repo front-end, and esp look at visualizations ticket (#7); will author a new pull request and we'll look at this request next week

Colin will figure out the SSH into Comp Canada as above

Alejandro will follow up on this

John will deprecate the Voyage repository

John will do that this week

We will reference the following project: https://github.com/orgs/UTMediaCAT/projects/1

Kirsta & Alejandro: answer Shengsong

Action Items:

John will continue looking at the various approaches to optimizing the postprocessor speed:

try solving the race condition when using the same dictionary by adding a locking mechanism
when using separate dictionaries, try a formal map produce from an existing python library
for one dictionary, try assigning separate keys to each process, with a shared dictionary.

Colin will look at the repo front-end, and esp look at visualizations ticket (#7); will author a new pull request and we'll look at this request next week
John will deprecate Voyage repository
Alejandro will follow up on SSH issue for Compute Canada