November 18, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • John: regarding Raiyan's out-of-scope URLs notes, and what can be done in the case of a single domain (or limited domain) crawl given that domain crawler propagates to found_urls and not postprocessor
    • thoughts on metascraper?
  • John: results of new twitter crawl with url unfurler?
    • also: new twitter crawl seems to include a lot of info that isn't being propagated to the postprocessed output (like, retweet, etc counts)
      • e.g., having quote tweet signaled important
  • John: al-monitor crawl?
  • John: pull requests from Colin
  • Colin: count with stacked area chart?
  • Colin: a chance to look at python crawler?
  • Alejandro: KPP/MediaCAT twitter crawl scope ready - when good time?
  • Alejandro: RA working on updating scope, where add link? MVP?

out-of-scope url issue

  • memory issue in notes
  • problem seems to be that crawler writes to found_url, and if do single domain crawl means that nothing gets written there
  • John will create a utility script to read the html in a single-domain crawl output to populate the found_url based on the complete scope
    • then the postprocessor should work without modification

metascraper

  • doesn't think we can use the one that's in the crawler now -- need to research this
  • John will research Jacqueline's version of the metascraper and see if there's something better

Short URL expander

  • seems to be working, and will be added to the twitter crawler
  • shouldn't affect anything else in crawler, self-contained
  • there's a variable with the option to unfurl -- JOhn will document (add -e to command line)
  • already ran script to expand short urls on previous NYT twitter crawl

Al-Monitor Crawl:

  • started on Monday: writing to temp directory that is only 7 GB, John is emptying manually
  • another issue: error with JSON trying to read, hopefully won't crawl a page twice
  • can this be checked? John will look into whether it's possible to check non-dupblication based on names
  • 73,000 JSONs so far, should finish soon hopefully
  • John will update Alejandro on whether it's done or not

Colin's pull requests

  • John approved, Colin will merge

scopes for crawl

  • Alejandro will add to MVP creating a section with documentation
  • Alejandro will meet with John to go over the KPP/MediaCAT crawl

Action Items:

  • John:

    • make and run utility for populating found_url before entering into postprocessor
    • merge short url expander and document on how to use the variable
    • update Alejandro on al-monitor crawl, and check to see if duplicates are created or not (question on whether the name is sufficient to check)
    • research the metascraper (esp date) issue
  • Colin:

    • produce csvs from newly-postprocessed NYT twitter data
    • likewise new stacked area on re-processed NYT twitter data
    • if time look at python crawler
  • Alejandro: update the MVP with scope section

Tasks (should be the same as action items):

  • John:

    • utility to add relevant urls from scope to "found_url"
      • question: would we want a whole list of urls for each JSON file/node (article) or each row of csv (tweets)? Shouldn't the postprocessor decide which are relevant and which are not?
    • document the twitter crawler feature of expanding short urls, and push to master
    • update on the al-monitor crawl on Monday
    • do some research on metascraper, looking into what Jacqueline had come up with and seeing if there is some other way
  • Colin:

    • generate csv with information from revised NYT twitter data as discussed (ie with expanded short url)
    • generate new stacked area chart (esp with rows in orange in file 2021 KPP-MediaCAT Scope Source Sites)
    • if time allows: look at python crawler
  • Alejandro:

    • update MVP with new scopes