September 23, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
update on Compute Canada: both John and Colin understanding? Follow up questions to Raiyan?
update on postprocessor: "John will run the postprocessor, and record the size of data from TWINNT & Domain Crawler, and how long it takes, and size of output file"
single-site post-processor?
coop hiring
finalizing paperwork for Colin?
Postprocessor
output of Twinnt wasn't what postprocessor was expecting, so John wrote a bridging function
memory issue: crawled JSON were too big to read, not complete results
10,000 JSONs small ones, 8 were skipped, and all the TWINNT
total time: 3151 seconds, 52 minutes
Output format does seem to meet the expectations of the spreadsheet output format
Problem: can't open the largest JSON (3Gb)
question: output (regular output) & interest-output (outside of scope):
Compute Canada Issues
John followed up with Raiyan and Raiyan said that the chrome browser should be killed when the crawler terminates, not sure not why happening; suggested reboot (John will test this)
there are some new files created in the crawler, but we will focus on the output that
Tasks:
kill the running processes of crawler
re-run postprocessor with full output of both NYT & twitter
what to do with the interest?
Alejandro will communicate with Amy about the "output" & "interest-output" distinction
Colin will attempt to stream (or wheatever its called) the interest.json
if time allows, Colin will attempt a visualization