February 24, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • 2 tickets:
    • readability & domain crawler errors
      • question: last spreadsheet only 800 rows
    • heap memory error
  • documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor
  • Puppeteer update?
  • need to do full re-crawl of al-monitor?
  • Twitter API crawl

readability & domain crawler errors

  • errors were readability - plain text from link on url
  • replaced with puppeteer plain text retrieval, which seems to be working nearly perfectly
    • Shengsong added code for filtering required out
      • checked NYT as well as al-monitor.com -- filtering should not be domain specific -- will test
      • simple filter -- update domain crawler & commit to main branch
  • this should speed up our crawl without the extra call to readability

postprocessor updates

  • create a second version since the original one works
  • Shengsong's version will become the master branch, and Amy's will be v1
  • main change is that Shengsong's formally separates the crawl scope from the citation scope, and changes the names of the keys that the postprocessor outputs
  • changes are summarized here
  • citation scope can be split into smaller crawl_scope: and the crawler create outputs, and the postprocessor brings together the various outputs and their cross-referencing
    • adds flexibility to enable multiple recursions of the postprocessing and possibility to update a crawl

heap memory issue:

  • original heap memory 4 GB, yesterday Shengsong increased to 7 GB, can be increased to 16 GB
  • this is sys admin issue: user should allocate heap memory in setting up server
    • Shengsong adjust the script to add a variable for heap memory
  • this should resolve domain crawler issue

puppeteer update: from 1.5 to 2.2

  • could take 2-3 days
  • one depreciate function, but shouldn't affect the structure

Twitter API

  • 2-3 days to get going

Action Items

  • re-do al-monitor.com crawl & benchmarking speed
  • Twitter API
  • testing new puppeteer filter code on 50 domains before documenting and committing
  • creating second version of postprocessor and make master, to preserve Amy's version
  • adjust domain crawler set up script to add a variable for heap memory
    • document in mediacat domain crawler
  • Alejandro will update 2 tickets

Backburner

  • Benchmarking
  • puppeteer update: 2.2 and we have 1.5, take 2-3 days
  • re-do small domain crawl
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor