November 5, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Action Items from last meeting

  • John to finish up work on memory issue and start work on: https://github.com/UTMediaCAT/mediacat-backend/issues/12
  • Colin to build stacked area graph using existing stack area chart builder with twitter data
  • Colin to provide sample date information to Alejandro
  • Alejandro(or RA) to review date information for accuracy
  • Colin to create documentation page around SSH for ComputeCanada resources.
  • John to talk to Jacqueline about the role of Metascraper/how to run and in the next meeting we will update the programmatic flow diagram in Padlet to document this.
  • John to ask Raiyan and confirm that non-scope URLs would still wind up in FoundURLs in the JSON so we can see it's a bug in the domain crawler leading to the behaviour we are seeing in the output.
  • John to start a domain crawl at al-monitor.com if he can confirm that discovered URLs are not thrown out (where they are written).
  • Colin to look at Voyage repository sometime in the next couple of weeks probably (not for the next meeting)
  • Alejandro to reach out to Shensong to see if he can come to some earlier MediaCat meetings

Notes

  • Alejandro got in the ComputeCanada application! (Hooray!) KPP stuff wound up being a great connection to make in the application. Perhaps applying for a new and larger grant. Asked for 16 VCPUs by 64 RAM (will need to figure out multiprocessing).

  • John to finish up work on memory issue Graham issue came up again, but when he went through the same steps to reproduce he is getting a new message about not being able to retrieve projects (eep). Close to a solution, but unsure.

  • John worked on implementing this in the post-processor: https://github.com/UTMediaCAT/mediacat-backend/issues/12 The step of loading the twitter issue it took hours to load.

  • Script that creates new copies of .csv with unfurled URLs. Added a function to add this to the twitter crawler. It seems to work fine, but could be expensive and slow down the crawl. John wrote a script to rewrite the .csv that we could also use to process the data after it was crawled. Updated https://github.com/UTMediaCAT/mediacat-backend/issues/12

  • Raiyan didn't answer John re: empty keys in JSON and foundurl. John also asked him about metascraper. We'll wait another week.

  • Colin to build stacked area graph using existing stack area chart builder with twitter data. Colin refactored code and needs to double check that stacked area chart is correct. Sent it to Alejandro.

  • Colin to provide sample date information to Alejandro - complete!

  • Alejandro(or RA) to review date information for accuracy - complete!

  • Colin to create documentation page around SSH for ComputeCanada resources - added to the wiki! (There's now a setting up SSH section)

  • John to talk to Jacqueline about the role of Metascraper/how to run and in the next meeting we will update the programmatic flow diagram in Padlet to document this.

  • John to ask Raiyan and confirm that non-scope URLs would still wind up in FoundURLs in the JSON so we can see it's a bug in the domain crawler leading to the behaviour we are seeing in the output.

  • John to start a domain crawl at al-monitor.com if he can confirm that discovered URLs are not thrown out (where they are written).

  • Colin to look at Voyage repository sometime in the next couple of weeks probably (not for the next meeting)

  • Alejandro to reach out to Shensong to see if he can come to some earlier MediaCat meetings

  • Alejandro asks: Is it possible to get a CSV with the following columns in addition to what there is: tween author (handle) tweet text, date, specific expanded URL in tweet (where there is one) all mentioned twitter handles with spikes in between.

Action Items

  • John to follow up with new issues in Graham cloud regarding memory.
  • John to update https://github.com/UTMediaCAT/mediacat-backend/issues/1 with progress on memory issues in application (He thinks we've got this resolved, but just need to test)
  • John to set up a twitter crawl with the modified code (that seeks to unfurl each URL) to see what kind of performance hit the crawler takes.
  • John to let Alejandro know if Raiyan hasn't responded by Monday.
  • Alejandro and Colin to check accuracy of stacked area graph.
  • Is Shensong coming to a meeting?
  • Colin to add dates workaround to repository: https://github.com/UTMediaCAT/mediacat-backend
  • John: If we can sort out stuff from Raiyan, great, but we will run al-mointor.com crawl regardless.
  • John to look at https://padlet.com/kirstastapelfeldt/sm88l9hv1rzy5ezx and the definition of the parts of the crawler. Make any suggestions for modifying this description for clarity and accuracy.
  • Colin to generate another .csv for Alejandro with these additional fields: tweet author (handle) tweet text, date, specific expanded URL in tweet (where there is one) all mentioned twitter handles with spikes in between.