Oct 5, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • following up with Graham cloud issues - Gy
  • keeping an eye on Arbutus crawls - Gy
  • developing the new crawl strategy with IA as above - Ra
  • working on resolving the pattern-matching issues in the postprocessor (maybe meet with Nat) - Ar/Fr

Crawler/server

  • previous crawls running
  • i24 might be done or blocked

Internet Archive strategy

  • started with python and couldn't find anything parse the html
  • switched to JS: and use metascraper for getting metadata
    • started with smal sample from Mondoweiss
    • metascraper does make a call but to the internet archive version
  • running into problem with parseHelper import
    • might be a dependency issue, different on server?

Postprocessor

  • problem in find twitter citations function
    • for-loop is terminating early - first iteration - raising exception
    • removing checks will lose data
    • issue isn't with pattern matching, so fuzzy library for now not relevant
  • meeting with Nat today to think about using panda series

Action item

  • follow up on Graham cloud - Gy
  • separate email: ask about switching IP address for server - Gy
  • email Nat to check that making a call to IA with metascraper won't lead to being blocked - Ra
  • look at current crawler output to match IA output so that either will work as input for postprocessor - Ra
  • continuing to develop the IA strategy - Ra
  • give sample of postprocessor input to Raazia and prepare for next meeting - Fr
  • pattern-matching and meeting with Nat for the postprocessor - Ar/Fr
  • remove vulnerable files/libraries from archived postprocessor - Fr
  • next meeting: check if Graham server libraries can be updated - Al