Oct 5, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- following up with Graham cloud issues - Gy
- keeping an eye on Arbutus crawls - Gy
- developing the new crawl strategy with IA as above - Ra
- working on resolving the pattern-matching issues in the postprocessor (maybe meet with Nat) - Ar/Fr
Crawler/server
- previous crawls running
- i24 might be done or blocked
Internet Archive strategy
- started with python and couldn't find anything parse the html
- switched to JS: and use metascraper for getting metadata
- started with smal sample from Mondoweiss
- metascraper does make a call but to the internet archive version
- running into problem with parseHelper import
- might be a dependency issue, different on server?
Postprocessor
- problem in find twitter citations function
- for-loop is terminating early - first iteration - raising exception
- removing checks will lose data
- issue isn't with pattern matching, so fuzzy library for now not relevant
- meeting with Nat today to think about using panda series
Action item
- follow up on Graham cloud - Gy
- separate email: ask about switching IP address for server - Gy
- email Nat to check that making a call to IA with metascraper won't lead to being blocked - Ra
- look at current crawler output to match IA output so that either will work as input for postprocessor - Ra
- continuing to develop the IA strategy - Ra
- give sample of postprocessor input to Raazia and prepare for next meeting - Fr
- pattern-matching and meeting with Nat for the postprocessor - Ar/Fr
- remove vulnerable files/libraries from archived postprocessor - Fr
- next meeting: check if Graham server libraries can be updated - Al