August 9, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • talk about conference presentation on last meeting; release: hope to meet with Kirsta and Nat before labor day
  • finish testing of new URL expander (headless browser)
  • finish code cleanup for Visualization environment, documentation is done: last meeting, record a session
  • after NYT politics archive postprocessing is done, next is KPP postprocessing

Conference

Release

  • make meeting week of Aug 29th?

New URL expander

  • basically done, based on domain crawler, should be good - only test remaining is to see if it will flagged
  • however, GET request isn't that unusual: too many
  • any way to get unshortened url from Twitter API - not a lot of documentation on this
  • create an issue with Twitter API to see what they say

Storage issue:

  • possible to r-sync to smaller instance to free up storage
  • mount the larger storage to the larger instance
  • storage volumes can be attached or de-attached, it's possible to connect both server instances to the larger storage.
  • if this works, re-start Guardian crawl and WaPo/Foxnews twitter crawl

Visualization environment

Crawl

  • small domain crawl: 1.8 million

Action Items

  • attach large instance to large storage, and if works, re-start Guardian crawl and WaPo/Foxnews twitter crawl
  • once a week in August: check that crawls are functioning

September Action

  • IMPORTANT FOR Sept: change storage distribution
  • Twitter embed issue
  • Re-start Guardian, WaPo/Foxnews
  • Begin new crawls, Israeli Palestinian
    • look at scope together
    • what about news site URLs that no longer exist?
    • what about preceding website or backslash website urls?

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function