April 28, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • postprocessor issue with text alias
  • finalize clean up, updating, and documentation of methods of NYT crawl
  • assessment of any updates needed for libraries
  • kpp/mediacat postprocessed results
  • postprocess NYT site crawl - think about NYT -- why cut off?

Postprocessor issues

  • memory issues with larger dataset
    • old instance (16 CPU): couldn't do 900,000, needed to use larger instance (40 CPU)
    • we predict that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is.
    • this is another reason to do smaller crawls
  • text alias issue: simple error, punctuation was being treated as part of the word
    • Shengsong will send re-processed NYT Archive crawl results
  • kpp/mediacat twitter data was processed without a hitch on larger instance

NYT crawl

  • documentation? - github page and domain crawler
  • why stop at 900,000
    • once postprocessing is done we can see if older articles have invalid links

Updates for libraries

  • updates are done: removed unused dependencies, like metascraper
  • apify v 2.30 update done
  • basic language
    • master crawler in python (timing, stopping script) - latest version
    • JS: node.js already updated
  • all major updates are done, could be a few smaller in postprocessor
  • Shengsong will look at these next week

Kpp/Mediacat results

  • needs to be checked
  • Shengsong will put them in groups of 500,000

small domain crawl

  • 900,000 articles and finished postprocessor: 6,000 rows
    • seems low, Shengsong will check
  • had an issue: unhandled error causing it to stop
    • will re-start and if error returns, Shengsong will look

Action Items

  • send re-processed NYT Archive crawl results
  • document the following: that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is
  • finalize post-processing of NYT regular crawl, and consider the earliest articles, looking for invalid URLs
  • look at smaller libraries in postprocessor to see if need updating
  • group KPP/MediaCAT results in groups of 500,000
  • re-start small domain crawl: if error returns, Shengsong will trouble-shoot
  • check to see why postprocessor of small domain crawl only produced 6000 relevant hits

Backburner

  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • how to get multithreading with postprocessor
  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets