February 24, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

2 tickets:
- readability & domain crawler errors
  - question: last spreadsheet only 800 rows
- heap memory error
documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor
Puppeteer update?
need to do full re-crawl of al-monitor?
Twitter API crawl

errors were readability - plain text from link on url
replaced with puppeteer plain text retrieval, which seems to be working nearly perfectly
- Shengsong added code for filtering required out
  - checked NYT as well as al-monitor.com -- filtering should not be domain specific -- will test
  - simple filter -- update domain crawler & commit to main branch
this should speed up our crawl without the extra call to readability

create a second version since the original one works
Shengsong's version will become the master branch, and Amy's will be v1
main change is that Shengsong's formally separates the crawl scope from the citation scope, and changes the names of the keys that the postprocessor outputs
changes are summarized here
citation scope can be split into smaller crawl_scope: and the crawler create outputs, and the postprocessor brings together the various outputs and their cross-referencing
- adds flexibility to enable multiple recursions of the postprocessing and possibility to update a crawl

original heap memory 4 GB, yesterday Shengsong increased to 7 GB, can be increased to 16 GB
this is sys admin issue: user should allocate heap memory in setting up server
- Shengsong adjust the script to add a variable for heap memory
this should resolve domain crawler issue

re-do al-monitor.com crawl & benchmarking speed
Twitter API
testing new puppeteer filter code on 50 domains before documenting and committing
creating second version of postprocessor and make master, to preserve Amy's version
adjust domain crawler set up script to add a variable for heap memory
- document in mediacat domain crawler
Alejandro will update 2 tickets

Benchmarking
puppeteer update: 2.2 and we have 1.5, take 2-3 days
re-do small domain crawl
finish documenting where different data are on our server
finding language function
image_reference function
documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor