April 28, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

memory issues with larger dataset
- old instance (16 CPU): couldn't do 900,000, needed to use larger instance (40 CPU)
- we predict that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is.
- this is another reason to do smaller crawls
text alias issue: simple error, punctuation was being treated as part of the word
- Shengsong will send re-processed NYT Archive crawl results
kpp/mediacat twitter data was processed without a hitch on larger instance

documentation? - github page and domain crawler
why stop at 900,000
- once postprocessing is done we can see if older articles have invalid links

updates are done: removed unused dependencies, like metascraper
apify v 2.30 update done
basic language
- master crawler in python (timing, stopping script) - latest version
- JS: node.js already updated
all major updates are done, could be a few smaller in postprocessor
Shengsong will look at these next week

900,000 articles and finished postprocessor: 6,000 rows
- seems low, Shengsong will check
had an issue: unhandled error causing it to stop
- will re-start and if error returns, Shengsong will look

send re-processed NYT Archive crawl results
document the following: that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is
finalize post-processing of NYT regular crawl, and consider the earliest articles, looking for invalid URLs
look at smaller libraries in postprocessor to see if need updating
group KPP/MediaCAT results in groups of 500,000
re-start small domain crawl: if error returns, Shengsong will trouble-shoot
check to see why postprocessor of small domain crawl only produced 6000 relevant hits

adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
how to get multithreading with postprocessor
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server
finding language function
image_reference function
dealing with embedded versus cited tweets