October 28, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

woot woot!
we're in business!!!
good work everyone!!!!
example of unexpected research idea: found several around wars on Gaza, and will be able to compare to NYT coverage
glad we have output-interest to gauge how common (e.g., cnn, etc)

Finishing work on size monitoring feature for post-processor and testing
Writing Alejandro about the problem with Graham and getting kicked out
If time permits, beginning work on the troubleshooting of Metascraper in ticket 35

Providing .csv to Alejandro
Reprocessing the output JSON to include the dates that are available in URL for the domain hits
Try building the stacked area or network diagrams if time permits
Responding with any info requests on the thread that Alejandro starts with Jacqueline

develop a second python domain crawler for domains where it will work?
do a crawl of a domain that is relatively small, e.g., al-monitor.com looks to be only 69,000 pages (nytimes.com is about 43 million).

Alejandro finds the Twitter output very interesting for research and Alejandro shares thanks.

Was able to get this to work with entire scope so the issue with memory was solved. Able to write data to disk whenever too big. There is still an issue when it reached the merging step where it ran out of memory. The files it was reaching out to merge were too big. John looked at them and they did seem abnormally large. There's a bunch of duplicates in the referrals. An article ID would be mentioned many times when it was referring to a single article. Cleaned this out manually with a small script which is how he got the results. Documenting and integrating this fix.
Writing Alejandro about the problem with Graham and getting kicked out - Couldn't find any memory info in the log, so a dead end. If it happens again, John will go there right away to see if there's a log message. Alejandro notes that we can follow up again with ComputeCanada if necessary. Will leave for now.
If time permits, beginning work on the troubleshooting of Metascraper in ticket 35 - not in depth yet, but when John reviewed the issues Alejandro sent, he did notice an entry "key" - metascraper title and metascraper author. No date information at all. Once John has resolved his current issues he'll move to ticket https://github.com/UTMediaCAT/mediacat-backend/issues/12 prior to returning to metascraper.
John goes to look and confirm's that twitter files have dates.

Provided .csv to Alejandro
Reprocessing the output JSON to include the dates that are available in URL for the domain hits. Was able to get NYT articles. The only issue is that whole interest output JSON. Method he uses extracts date from html. Need to double check how the library works, but it's possible that we are bringing back dates of events rather than the date published (other dates mentioned in article).
Action Item next meeting: Colin could provide a sample of what he gleaned vis a vis dates for the NYT to see if these are accurate dates against the publication dates of the article.
Try building the stacked area or network diagrams if time permits - will do on Twitter data.
Responding with any info requests on the thread that Alejandro starts with Jacqueline - resolved.

not picking up hyperlinks--still in domain crawl output JSON? In twitter and in domain crawl, hyperlinks are not being picked up and shortened links are not being unfurled in twitter results. See for example: https://twitter.com/michaelroston/status/271310986978394112 which is in line 51. We are not unfurling the link so it's not showing up as a match. It seems to fit better into post-processor
not picking up dates & other metadata (author, article, title) Attempting to sort this out in the meeting. - Made issue: https://github.com/UTMediaCAT/mediacat-backend/issues/12
John checked the example Alejandro provided. Only links to the NY Times found.
seems like domain crawl of nytimes.com/section/world/middleeast & politics
difference between csv output for visualizations and csv output as quasi-UI

develop a second python domain crawler for domains where it will work?
--> can we add json results to one another?
do a crawl of a domain that is relatively small, e.g., al-monitor.com looks to be only 69,000 pages (nytimes.com is about 43 million).

Idea for speeding up the full crawl: We could refactor old Python crawler, run first, and then use Javascript crawler where we did not receive results.

there are a lot of repeated examples, though some of these seem to be because new updates are added to basically the same page, for example: for JPost (search "Jerusalem Post"): https://www.nytimes.com/live/2021/05/13/world/israel-gaza-news/warnings-of-civil-war-as-arabs-and-jews-face-off-violently-in-israels-streets & https://www.nytimes.com/live/2021/05/13/world/israel-gaza-news/the-number-and-range-of-hamas-rockets-has-caught-israelis-by-surprise likewise for those two JPost examples, not only is the term "Jerusalem Post" found as a text alias but there is an embedded hyperlink to JPost which isn't picked up the date range of articles is very limited, only from 2021 the articles seem to come from a range of different topics that probably were never listed under NYTimes "middle east" or "politics" there are also some false positives due to my own scope input ("NRG" is the name of stadiums and companies, or "Globes" as in Golden Globes)

John to finish up work on memory issue and start work on: https://github.com/UTMediaCAT/mediacat-backend/issues/12
Colin to build stacked area graph using existing stack area chart builder with twitter data
Colin to provide sample date information to Alejandro
Alejandro(or RA) to review date information for accuracy
Colin to create documentation page around SSH for ComputeCanada resources.
John to talk to Jacqueline about the role of Metascraper/how to run and in the next meeting we will update the programmatic flow diagram in Padlet to document this.
John to ask Raiyan and confirm that non-scope URLs would still wind up in FoundURLs in the JSON so we can see it's a bug in the domain crawler leading to the behaviour we are seeing in the output.
John to start a domain crawl at al-monitor.com if he can confirm that discovered URLs are not thrown out (where they are written).
Colin to look at Voyage repository sometime in the next couple of weeks probably (not for the next meeting)
Alejandro to reach out to Shensong to see if he can come to some earlier MediaCat meetings