April 21, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • finalize clean up, updating, and documentation of methods of NYT crawl
  • look at retweet/tweet issue
  • re-run KPP/MediaCAT twitter crawl
  • run small domain crawl with information from Alejandro
  • Alejandro: think through proposals for CDHI conference
  • Alejandro: find new time for weekly meeting

proposals for CDHI conference

  • workshop
  • paper from Alejandro based on NYT crawl

retweet/tweet issue

  • fixed issue and running on KPP/MediaCAT
  • plain text: found what comes after RT @user and added it in one key
  • need to look example with comment before RT
  • should not include duplication of comment before RT
  • data should be available for checking in the next day or so.

NYT crawl

  • from email:
    • The general domain crawler ended up crawling 703,641 articles from NYTimes and the NYTimes search crawler crawled 251,716 articles (without duplications) from three search URLs you give me.
    • NYT site crawl stopped at 700,000+ because puppeteer queue ran to 0
    • what is the earliest date?
    • With increase in the number of crawled articles and different types of articles, we have some new problems.
    • Fixed Issues:
      • The NYTimes search crawler sometimes stopped scrolling down after crawled about 10,000 articles without reporting any error. This is a Puppeteer bug; I fixed this bug by restart the crawler after crawling 5000 articles.
    • Unresolved Issues:
      1. Post-processor had weird memory error when try to use multithreading to process 251,716 URLs from three NYTimes search URLS. Good news is the single thread post-processor works fine. The output is in the attachment.
      2. There are many different types of URLs from NYTimes general crawler. Some will cause the readability to be stuck when trying to get the plain text. (meta-scraper get plain texts for 120,00/703,641 URLs then stuck) Therefore, I added a timeout of 5s for trying to get plain text for each URL, it skips the URL after 5s then continue with the other URLs (meta-scraper get plain texts for 300,000/703,641 URLs). However, the meta-scraper was still unable to get plain text for all 703,641 articles due to memory error.
      • now resolved: metascraper was trying to check duplication and the file with urls would get too big and then there was a memory issue; now, it doesn't check for duplication, and if there is a duplicate, then it will overwrite.
      • now run postprocessor on NYT site crawl
  • seems like author is better in new archive crawl

update on small domain crawl

  • update by email

libraries

  • assessment of any updates needed for libraries

action items

  • postprocessor issue with text alias
  • finalize clean up, updating, and documentation of methods of NYT crawl
  • assessment of any updates needed for libraries
  • kpp/mediacat postprocessed results
  • postprocess NYT site crawl - think about NYT -- why cut off?

Backburner

  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • how to get multithreading with postprocessor
  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets