February 24, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
2 tickets:
readability & domain crawler errors
question: last spreadsheet only 800 rows
heap memory error
documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor
Puppeteer update?
need to do full re-crawl of al-monitor?
Twitter API crawl
readability & domain crawler errors
errors were readability - plain text from link on url
replaced with puppeteer plain text retrieval, which seems to be working nearly perfectly
Shengsong added code for filtering required out
checked NYT as well as al-monitor.com -- filtering should not be domain specific -- will test
simple filter -- update domain crawler & commit to main branch
this should speed up our crawl without the extra call to readability
postprocessor updates
create a second version since the original one works
Shengsong's version will become the master branch, and Amy's will be v1
main change is that Shengsong's formally separates the crawl scope from the citation scope, and changes the names of the keys that the postprocessor outputs
citation scope can be split into smaller crawl_scope: and the crawler create outputs, and the postprocessor brings together the various outputs and their cross-referencing
adds flexibility to enable multiple recursions of the postprocessing and possibility to update a crawl
heap memory issue:
original heap memory 4 GB, yesterday Shengsong increased to 7 GB, can be increased to 16 GB
this is sys admin issue: user should allocate heap memory in setting up server
Shengsong adjust the script to add a variable for heap memory
this should resolve domain crawler issue
puppeteer update: from 1.5 to 2.2
could take 2-3 days
one depreciate function, but shouldn't affect the structure
Twitter API
2-3 days to get going
Action Items
re-do al-monitor.com crawl & benchmarking speed
Twitter API
testing new puppeteer filter code on 50 domains before documenting and committing
creating second version of postprocessor and make master, to preserve Amy's version
adjust domain crawler set up script to add a variable for heap memory
document in mediacat domain crawler
Alejandro will update 2 tickets
Backburner
Benchmarking
puppeteer update: 2.2 and we have 1.5, take 2-3 days
re-do small domain crawl
finish documenting where different data are on our server
finding language function
image_reference function
documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor