Sep 28, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

graham cloud down,
arbutus:
- 24i news, toi, inn, Jpost still running
- small domain crawls still paused
proxy crawling: too hard and expensive

plan to learn the python framework that wraps on the IA API: https://pypi.org/project/waybackpy/
different work flow:
1. get html from IA
2. use processor of crawler to extra metadata (including html version of the article, and hyperlinks) into JSON output
on point 2, need to check to see if there's a better library to extract metadata, esp (in descending order of importance):
1. date + time stamp
2. author
3. title
in cases where an article has been crawled more than once, we want the last crawled version

Different approach to crawl
- Expects WARC
- To download IA as WARC need tooling
- Crawlbase – free trial – stopping after 1 result
IA
- Take small sites:
- download all the URL - https://exposureninja.com/blog/extract-urls-archive-org/
- Ignore all media – filter for html
- Bring down 100 examples
- Then see output
Different: need to make a call to get author info
More likely to crawl top story because it comes from many other sources
Wrappers that could help:
- saveAPI
- CDXserverAPI
Work flow: study the current code and then with library (read***) to get the textual representation
- Can we use our crawler code to get text output
- How expecting metadata: author/date/title
- And how to extract without making another call
Are there libraries that can use the html code?
Develop and test locally, and then with larger scope, work on server

didn't get a meeting with Nat
Aryan: reviewing code, pandas and data-cleaning,
- using pattern-matching, might be resolved with more complex libraries - fuzzy library
- research libraries to improve on pattern-matching
scale dataframe problem is done, now just figure out pattern-matching

following up with Graham cloud issues - Gy
keeping an eye on Arbutus crawls - Gy
developing the new crawl strategy with IA as above - Ra
working on resolving the pattern-matching issues in the postprocessor (maybe meet with Nat) - Ar/Fr