May 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • send re-processed NYT Archive crawl results
  • document the following: that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is
  • finalize post-processing of NYT regular crawl, and consider the earliest articles, looking for invalid URLs
  • look at smaller libraries in postprocessor to see if need updating
  • group KPP/MediaCAT results in groups of 500,000
  • re-start small domain crawl: if error returns, Shengsong will trouble-shoot
  • check to see why postprocessor of small domain crawl only produced 6000 relevant hits

NYT Archive

  • sent, Alejandro working through them

Regular NYT crawl

  • only got 60 rows trying to do whole thing as postprocessor - very few of 900,000 are from Middle East section
  • 90% of the URLs are useless
  • until we figure out why the limit to crawling, then we may leave this kind of crawling of NYT for now
  • try cnn.com instead

Postprocessing limit issue

  • one possibility is to try something other than plain python, like https://dask.org/

Small Domain Crawl

KPP/MediaCAT Twitter results

  • problem with recursive citations
  • mostly solved recursive citations

library update

  • complete

Action Items

  • look at 403 - verify that problem is not way of crawling
  • re-send KPP data with tags
  • postprocess the small domain crawl - without the domains that didn't work
  • look at using dask for postprocessing
  • start crawl of cnn.com

Backburner

  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • how to get multithreading with postprocessor
  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets