April 21, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
finalize clean up, updating, and documentation of methods of NYT crawl
look at retweet/tweet issue
re-run KPP/MediaCAT twitter crawl
run small domain crawl with information from Alejandro
Alejandro: think through proposals for CDHI conference
Alejandro: find new time for weekly meeting
proposals for CDHI conference
workshop
paper from Alejandro based on NYT crawl
retweet/tweet issue
fixed issue and running on KPP/MediaCAT
plain text: found what comes after RT @user and added it in one key
need to look example with comment before RT
should not include duplication of comment before RT
data should be available for checking in the next day or so.
NYT crawl
from email:
The general domain crawler ended up crawling 703,641 articles from NYTimes and the NYTimes search crawler crawled 251,716 articles (without duplications) from three search URLs you give me.
NYT site crawl stopped at 700,000+ because puppeteer queue ran to 0
what is the earliest date?
With increase in the number of crawled articles and different types of articles, we have some new problems.
Fixed Issues:
The NYTimes search crawler sometimes stopped scrolling down after crawled about 10,000 articles without reporting any error. This is a Puppeteer bug; I fixed this bug by restart the crawler after crawling 5000 articles.
Unresolved Issues:
Post-processor had weird memory error when try to use multithreading to process 251,716 URLs from three NYTimes search URLS. Good news is the single thread post-processor works fine. The output is in the attachment.
There are many different types of URLs from NYTimes general crawler. Some will cause the readability to be stuck when trying to get the plain text. (meta-scraper get plain texts for 120,00/703,641 URLs then stuck) Therefore, I added a timeout of 5s for trying to get plain text for each URL, it skips the URL after 5s then continue with the other URLs (meta-scraper get plain texts for 300,000/703,641 URLs). However, the meta-scraper was still unable to get plain text for all 703,641 articles due to memory error.
now resolved: metascraper was trying to check duplication and the file with urls would get too big and then there was a memory issue; now, it doesn't check for duplication, and if there is a duplicate, then it will overwrite.
now run postprocessor on NYT site crawl
seems like author is better in new archive crawl
update on small domain crawl
update by email
libraries
assessment of any updates needed for libraries
action items
postprocessor issue with text alias
finalize clean up, updating, and documentation of methods of NYT crawl
assessment of any updates needed for libraries
kpp/mediacat postprocessed results
postprocess NYT site crawl - think about NYT -- why cut off?
Backburner
adding to regular postprocessor output:
any non-scope domain hyperlink that ends in .co.il
any link to a tweet or twitter handle
This is a bit outside our normal functionality, so I will put it on the backburner for now.
how to get multithreading with postprocessor
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server