November 18, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

John: regarding Raiyan's out-of-scope URLs notes, and what can be done in the case of a single domain (or limited domain) crawl given that domain crawler propagates to found_urls and not postprocessor
- thoughts on metascraper?
John: results of new twitter crawl with url unfurler?
- also: new twitter crawl seems to include a lot of info that isn't being propagated to the postprocessed output (like, retweet, etc counts)
  - e.g., having quote tweet signaled important
John: al-monitor crawl?
John: pull requests from Colin
Colin: count with stacked area chart?
Colin: a chance to look at python crawler?
Alejandro: KPP/MediaCAT twitter crawl scope ready - when good time?
Alejandro: RA working on updating scope, where add link? MVP?

memory issue in notes
problem seems to be that crawler writes to found_url, and if do single domain crawl means that nothing gets written there
John will create a utility script to read the html in a single-domain crawl output to populate the found_url based on the complete scope
- then the postprocessor should work without modification

doesn't think we can use the one that's in the crawler now -- need to research this
John will research Jacqueline's version of the metascraper and see if there's something better

seems to be working, and will be added to the twitter crawler
shouldn't affect anything else in crawler, self-contained
there's a variable with the option to unfurl -- JOhn will document (add -e to command line)
already ran script to expand short urls on previous NYT twitter crawl

started on Monday: writing to temp directory that is only 7 GB, John is emptying manually
another issue: error with JSON trying to read, hopefully won't crawl a page twice
can this be checked? John will look into whether it's possible to check non-dupblication based on names
73,000 JSONs so far, should finish soon hopefully
John will update Alejandro on whether it's done or not

John:
- make and run utility for populating found_url before entering into postprocessor
- merge short url expander and document on how to use the variable
- update Alejandro on al-monitor crawl, and check to see if duplicates are created or not (question on whether the name is sufficient to check)
- research the metascraper (esp date) issue
Colin:
- produce csvs from newly-postprocessed NYT twitter data
- likewise new stacked area on re-processed NYT twitter data
- if time look at python crawler
Alejandro: update the MVP with scope section

John:
- utility to add relevant urls from scope to "found_url"
  - question: would we want a whole list of urls for each JSON file/node (article) or each row of csv (tweets)? Shouldn't the postprocessor decide which are relevant and which are not?
- document the twitter crawler feature of expanding short urls, and push to master
- update on the al-monitor crawl on Monday
- do some research on metascraper, looking into what Jacqueline had come up with and seeing if there is some other way
Colin:
- generate csv with information from revised NYT twitter data as discussed (ie with expanded short url)
- generate new stacked area chart (esp with rows in orange in file 2021 KPP-MediaCAT Scope Source Sites)
- if time allows: look at python crawler
Alejandro:
- update MVP with new scopes