May 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

send re-processed NYT Archive crawl results
document the following: that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is
finalize post-processing of NYT regular crawl, and consider the earliest articles, looking for invalid URLs
look at smaller libraries in postprocessor to see if need updating
group KPP/MediaCAT results in groups of 500,000
re-start small domain crawl: if error returns, Shengsong will trouble-shoot
check to see why postprocessor of small domain crawl only produced 6000 relevant hits

only got 60 rows trying to do whole thing as postprocessor - very few of 900,000 are from Middle East section
90% of the URLs are useless
until we figure out why the limit to crawling, then we may leave this kind of crawling of NYT for now
try cnn.com instead

one possibility is to try something other than plain python, like https://dask.org/

adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
how to get multithreading with postprocessor
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server
finding language function
image_reference function
dealing with embedded versus cited tweets