April 28, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
postprocessor issue with text alias
finalize clean up, updating, and documentation of methods of NYT crawl
assessment of any updates needed for libraries
kpp/mediacat postprocessed results
postprocess NYT site crawl - think about NYT -- why cut off?
Postprocessor issues
memory issues with larger dataset
old instance (16 CPU): couldn't do 900,000, needed to use larger instance (40 CPU)
we predict that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is.
this is another reason to do smaller crawls
text alias issue: simple error, punctuation was being treated as part of the word
Shengsong will send re-processed NYT Archive crawl results
kpp/mediacat twitter data was processed without a hitch on larger instance
NYT crawl
documentation? - github page and domain crawler
why stop at 900,000
once postprocessing is done we can see if older articles have invalid links
Updates for libraries
updates are done: removed unused dependencies, like metascraper
apify v 2.30 update done
basic language
master crawler in python (timing, stopping script) - latest version
JS: node.js already updated
all major updates are done, could be a few smaller in postprocessor
Shengsong will look at these next week
Kpp/Mediacat results
needs to be checked
Shengsong will put them in groups of 500,000
small domain crawl
900,000 articles and finished postprocessor: 6,000 rows
seems low, Shengsong will check
had an issue: unhandled error causing it to stop
will re-start and if error returns, Shengsong will look
Action Items
send re-processed NYT Archive crawl results
document the following: that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is
finalize post-processing of NYT regular crawl, and consider the earliest articles, looking for invalid URLs
look at smaller libraries in postprocessor to see if need updating
group KPP/MediaCAT results in groups of 500,000
re-start small domain crawl: if error returns, Shengsong will trouble-shoot
check to see why postprocessor of small domain crawl only produced 6000 relevant hits
Backburner
adding to regular postprocessor output:
any non-scope domain hyperlink that ends in .co.il
any link to a tweet or twitter handle
This is a bit outside our normal functionality, so I will put it on the backburner for now.
how to get multithreading with postprocessor
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server