Sep 28, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda:
- update on server
- update on crawler strategies
- postprocessor
- reading on dask dataframes
Server
- graham cloud down,
- arbutus:
- 24i news, toi, inn, Jpost still running
- small domain crawls still paused
- proxy crawling: too hard and expensive
new crawl strategy: internet archive
- plan to learn the python framework that wraps on the IA API: https://pypi.org/project/waybackpy/
- different work flow:
- get html from IA
- use processor of crawler to extra metadata (including html version of the article, and hyperlinks) into JSON output
- on point 2, need to check to see if there's a better library to extract metadata, esp (in descending order of importance):
- date + time stamp
- author
- title
- in cases where an article has been crawled more than once, we want the last crawled version
from morning meeting:
- Different approach to crawl
- Expects WARC
- To download IA as WARC need tooling
- Crawlbase – free trial – stopping after 1 result
- IA
- Different: need to make a call to get author info
- More likely to crawl top story because it comes from many other sources
- Wrappers that could help:
- Work flow: study the current code and then with library (read***) to get the textual representation
- Can we use our crawler code to get text output
- How expecting metadata: author/date/title
- And how to extract without making another call
- Are there libraries that can use the html code?
- Develop and test locally, and then with larger scope, work on server
postprocessor
- didn't get a meeting with Nat
- Aryan: reviewing code, pandas and data-cleaning,
- using pattern-matching, might be resolved with more complex libraries - fuzzy library
- research libraries to improve on pattern-matching
- scale dataframe problem is done, now just figure out pattern-matching
Action Items:
- following up with Graham cloud issues - Gy
- keeping an eye on Arbutus crawls - Gy
- developing the new crawl strategy with IA as above - Ra
- working on resolving the pattern-matching issues in the postprocessor (maybe meet with Nat) - Ar/Fr