August 2, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • work on headless browser URL expander
  • Twitter embed issue
  • code cleanup on D3 vector diagram
  • Israeli news site crawl

Israeli news site crawl

  • look at scope together
  • what about news site URLs that no longer exist?
  • what about preceding website or backslash website urls?
  • save these questions for September

Headless Browser URL Expander

  • testing new headless browser -- should be done this week
  • it will slow down the process a lot, but shouldn't be an issue if it's the same speed as the domain crawler

Twitter embed issue

  • not yet

Visualization environment

  • still working on the code cleanup

Crawls & Postprocessing

  • re-do the KPP postprocessing -- not yet
    • need to change the storage distribution: only 1.1 TB on large instance
  • restart WaPo/Foxnews twitter crawl -- paused
  • restart the postprocessing of NYT politics archive -- running, should be done by this week
  • Guardian: paused, at 1.9 million
  • small domain crawl - still running, small instance, 1.7 million -- this is probably best speed
    • doesn't need to be re-started, should be fine running by itself
  • no point in starting new crawl right now

Action Items:

  • talk about conference presentation on last meeting; release: hope to meet with Kirsta and Nat before labor day
  • finish testing of new URL expander (headless browser)
  • finish code cleanup for Visualization environment, documentation is done: last meeting, record a session
  • after NYT politics archive postprocessing is done, next is KPP postprocessing

September Action

  • IMPORTANT FOR Sept: change storage distribution
  • Twitter embed issue
  • Re-start Guardian, WaPo/Foxnews
  • Begin new crawls, Israeli Palestinian

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function