November 18, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- John: regarding Raiyan's out-of-scope URLs notes, and what can be done in the case of a single domain (or limited domain) crawl given that domain crawler propagates to found_urls and not postprocessor
- thoughts on metascraper?
- John: results of new twitter crawl with url unfurler?
- also: new twitter crawl seems to include a lot of info that isn't being propagated to the postprocessed output (like, retweet, etc counts)
- e.g., having quote tweet signaled important
- also: new twitter crawl seems to include a lot of info that isn't being propagated to the postprocessed output (like, retweet, etc counts)
- John: al-monitor crawl?
- John: pull requests from Colin
- Colin: count with stacked area chart?
- Colin: a chance to look at python crawler?
- Alejandro: KPP/MediaCAT twitter crawl scope ready - when good time?
- Alejandro: RA working on updating scope, where add link? MVP?
out-of-scope url issue
- memory issue in notes
- problem seems to be that crawler writes to found_url, and if do single domain crawl means that nothing gets written there
- John will create a utility script to read the html in a single-domain crawl output to populate the found_url based on the complete scope
- then the postprocessor should work without modification
metascraper
- doesn't think we can use the one that's in the crawler now -- need to research this
- John will research Jacqueline's version of the metascraper and see if there's something better
Short URL expander
- seems to be working, and will be added to the twitter crawler
- shouldn't affect anything else in crawler, self-contained
- there's a variable with the option to unfurl -- JOhn will document (add -e to command line)
- already ran script to expand short urls on previous NYT twitter crawl
Al-Monitor Crawl:
- started on Monday: writing to temp directory that is only 7 GB, John is emptying manually
- another issue: error with JSON trying to read, hopefully won't crawl a page twice
- can this be checked? John will look into whether it's possible to check non-dupblication based on names
- 73,000 JSONs so far, should finish soon hopefully
- John will update Alejandro on whether it's done or not
Colin's pull requests
- John approved, Colin will merge
scopes for crawl
- Alejandro will add to MVP creating a section with documentation
- Alejandro will meet with John to go over the KPP/MediaCAT crawl
Action Items:
-
John:
- make and run utility for populating found_url before entering into postprocessor
- merge short url expander and document on how to use the variable
- update Alejandro on al-monitor crawl, and check to see if duplicates are created or not (question on whether the name is sufficient to check)
- research the metascraper (esp date) issue
-
Colin:
- produce csvs from newly-postprocessed NYT twitter data
- likewise new stacked area on re-processed NYT twitter data
- if time look at python crawler
-
Alejandro: update the MVP with scope section
Tasks (should be the same as action items):
-
John:
- utility to add relevant urls from scope to "found_url"
- question: would we want a whole list of urls for each JSON file/node (article) or each row of csv (tweets)? Shouldn't the postprocessor decide which are relevant and which are not?
- document the twitter crawler feature of expanding short urls, and push to master
- update on the al-monitor crawl on Monday
- do some research on metascraper, looking into what Jacqueline had come up with and seeing if there is some other way
- utility to add relevant urls from scope to "found_url"
-
Colin:
- generate csv with information from revised NYT twitter data as discussed (ie with expanded short url)
- generate new stacked area chart (esp with rows in orange in file 2021 KPP-MediaCAT Scope Source Sites)
- if time allows: look at python crawler
-
Alejandro:
- update MVP with new scopes