Sep 28, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • update on server
  • update on crawler strategies
  • postprocessor
    • reading on dask dataframes

Server

  • graham cloud down,
  • arbutus:
    • 24i news, toi, inn, Jpost still running
    • small domain crawls still paused
  • proxy crawling: too hard and expensive

new crawl strategy: internet archive

  • plan to learn the python framework that wraps on the IA API: https://pypi.org/project/waybackpy/
  • different work flow:
    1. get html from IA
    2. use processor of crawler to extra metadata (including html version of the article, and hyperlinks) into JSON output
  • on point 2, need to check to see if there's a better library to extract metadata, esp (in descending order of importance):
    1. date + time stamp
    2. author
    3. title
  • in cases where an article has been crawled more than once, we want the last crawled version

from morning meeting:

  • Different approach to crawl
    • Expects WARC
    • To download IA as WARC need tooling
    • Crawlbase – free trial – stopping after 1 result
  • IA
  • Different: need to make a call to get author info
  • More likely to crawl top story because it comes from many other sources
  • Wrappers that could help:
    • saveAPI
    • CDXserverAPI
  • Work flow: study the current code and then with library (read***) to get the textual representation
    • Can we use our crawler code to get text output
    • How expecting metadata: author/date/title
    • And how to extract without making another call
  • Are there libraries that can use the html code?
  • Develop and test locally, and then with larger scope, work on server

postprocessor

  • didn't get a meeting with Nat
  • Aryan: reviewing code, pandas and data-cleaning,
    • using pattern-matching, might be resolved with more complex libraries - fuzzy library
    • research libraries to improve on pattern-matching
  • scale dataframe problem is done, now just figure out pattern-matching

Action Items:

  • following up with Graham cloud issues - Gy
  • keeping an eye on Arbutus crawls - Gy
  • developing the new crawl strategy with IA as above - Ra
  • working on resolving the pattern-matching issues in the postprocessor (maybe meet with Nat) - Ar/Fr