October 28, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

TWITTER OUTPUT IS EXCITING!!

  • woot woot!
  • we're in business!!!
  • good work everyone!!!!
  • example of unexpected research idea: found several around wars on Gaza, and will be able to compare to NYT coverage
  • glad we have output-interest to gauge how common (e.g., cnn, etc)

update on John's action items:

  • Finishing work on size monitoring feature for post-processor and testing
  • Writing Alejandro about the problem with Graham and getting kicked out
  • If time permits, beginning work on the troubleshooting of Metascraper in ticket 35

update on Colin's action items:

  • Providing .csv to Alejandro
  • Reprocessing the output JSON to include the dates that are available in URL for the domain hits
  • Try building the stacked area or network diagrams if time permits
  • Responding with any info requests on the thread that Alejandro starts with Jacqueline

issues with summer domain & twitter crawl and/or postprocessor

  • not picking up hyperlinks--still in domain crawl output JSON?
  • not picking up dates & other metadata (author, article, title)
  • seems like domain crawl of nytimes.com/section/world/middleeast & politics
  • difference between csv output for visualizations and csv output as quasi-UI

project questions:

  • develop a second python domain crawler for domains where it will work?
  • do a crawl of a domain that is relatively small, e.g., al-monitor.com looks to be only 69,000 pages (nytimes.com is about 43 million).

Notes

  • Alejandro finds the Twitter output very interesting for research and Alejandro shares thanks.

John notes:

  • Was able to get this to work with entire scope so the issue with memory was solved. Able to write data to disk whenever too big. There is still an issue when it reached the merging step where it ran out of memory. The files it was reaching out to merge were too big. John looked at them and they did seem abnormally large. There's a bunch of duplicates in the referrals. An article ID would be mentioned many times when it was referring to a single article. Cleaned this out manually with a small script which is how he got the results. Documenting and integrating this fix.
  • Writing Alejandro about the problem with Graham and getting kicked out - Couldn't find any memory info in the log, so a dead end. If it happens again, John will go there right away to see if there's a log message. Alejandro notes that we can follow up again with ComputeCanada if necessary. Will leave for now.
  • If time permits, beginning work on the troubleshooting of Metascraper in ticket 35 - not in depth yet, but when John reviewed the issues Alejandro sent, he did notice an entry "key" - metascraper title and metascraper author. No date information at all. Once John has resolved his current issues he'll move to ticket https://github.com/UTMediaCAT/mediacat-backend/issues/12 prior to returning to metascraper.
  • John goes to look and confirm's that twitter files have dates.

Colin notes

  • Provided .csv to Alejandro
  • Reprocessing the output JSON to include the dates that are available in URL for the domain hits. Was able to get NYT articles. The only issue is that whole interest output JSON. Method he uses extracts date from html. Need to double check how the library works, but it's possible that we are bringing back dates of events rather than the date published (other dates mentioned in article).
  • Action Item next meeting: Colin could provide a sample of what he gleaned vis a vis dates for the NYT to see if these are accurate dates against the publication dates of the article.
  • Try building the stacked area or network diagrams if time permits - will do on Twitter data.
  • Responding with any info requests on the thread that Alejandro starts with Jacqueline - resolved.

issues with summer domain & twitter crawl and/or postprocessor

  • not picking up hyperlinks--still in domain crawl output JSON? In twitter and in domain crawl, hyperlinks are not being picked up and shortened links are not being unfurled in twitter results. See for example: https://twitter.com/michaelroston/status/271310986978394112 which is in line 51. We are not unfurling the link so it's not showing up as a match. It seems to fit better into post-processor
  • not picking up dates & other metadata (author, article, title) Attempting to sort this out in the meeting. - Made issue: https://github.com/UTMediaCAT/mediacat-backend/issues/12
  • John checked the example Alejandro provided. Only links to the NY Times found.
  • seems like domain crawl of nytimes.com/section/world/middleeast & politics
  • difference between csv output for visualizations and csv output as quasi-UI

project questions:

  • develop a second python domain crawler for domains where it will work?
  • --> can we add json results to one another?
  • do a crawl of a domain that is relatively small, e.g., al-monitor.com looks to be only 69,000 pages (nytimes.com is about 43 million).
  • Idea for speeding up the full crawl: We could refactor old Python crawler, run first, and then use Javascript crawler where we did not receive results.

Email from Alejandro

Action Items

  • John to finish up work on memory issue and start work on: https://github.com/UTMediaCAT/mediacat-backend/issues/12
  • Colin to build stacked area graph using existing stack area chart builder with twitter data
  • Colin to provide sample date information to Alejandro
  • Alejandro(or RA) to review date information for accuracy
  • Colin to create documentation page around SSH for ComputeCanada resources.
  • John to talk to Jacqueline about the role of Metascraper/how to run and in the next meeting we will update the programmatic flow diagram in Padlet to document this.
  • John to ask Raiyan and confirm that non-scope URLs would still wind up in FoundURLs in the JSON so we can see it's a bug in the domain crawler leading to the behaviour we are seeing in the output.
  • John to start a domain crawl at al-monitor.com if he can confirm that discovered URLs are not thrown out (where they are written).
  • Colin to look at Voyage repository sometime in the next couple of weeks probably (not for the next meeting)
  • Alejandro to reach out to Shensong to see if he can come to some earlier MediaCat meetings