September 30, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Postprocessor update

  • any compute canada issues -- Colin SSH?

  • Action items from last day:

    kill the running processes of crawler

re-run postprocessor with full output of both NYT & twitter
what to do with the interest?
Alejandro will communicate with Amy about the "output" & "interest-output" distinction
Colin will attempt to stream (or wheatever its called) the interest.json
if time allows, Colin will attempt a visualization
  • Alejandro question: can we start a large crawl soon?

Postprocessor update

  • John sent it to Colin, and

New Scope Document that we should use in future: https://docs.google.com/spreadsheets/d/1oYA1dkNvvsz_J5xlhl0_1NrayHbVo2E3W-FTZccL8nA/edit#gid=1838968997

Action Items

  1. John to feed the post-processor the same data again (the output from the twitter and domain crawler) and feed it a different scope based on the tab "formatted_for_Mediacat" in this spreadsheet: https://docs.google.com/spreadsheets/d/1oYA1dkNvvsz_J5xlhl0_1NrayHbVo2E3W-FTZccL8nA/edit#gid=1838968997. Please let us know what data won't transfer to the post-processing scope. This output should be passed to Colin, or communicate roadblocks.

  2. John to review the python script here: https://github.com/UTMediaCAT/mediacat-frontend/blob/master/utils/postprocessing_stacked_area_chart_single_domain_crawl.ipynb and try to run it agains the output from the twitter and domain crawler. We assume that June has built an alternative to the post-processor, and want to see the output. This output should be passed to Colin, or communicate roadblocks.

  3. Colin to try making graphs based on output.json (force vector diagram) and interest-output.json (??). Just the twitter output might be an interesting diagram.

  4. John is also re-running a full crawl with the full scope and taking down some data points about how fast (slow) it is.