track down the function and do an echo, and find the plain text before it gets saved
also see if problem in the queue: use program to open sqlite to see how keys are being stored
also check puppeteer github: and check for updates in puppeteer code
probably separate issue: Error: "Fatal error: ineffective mark-compacts near heap limit allocation failed - Javascript heap out of memory"
al-monitor/original/ - heap memory: memory leak - our code used memory and didn't delete what it stored in the temp memory once it finished
this error only came up once
JS error not puppeteer error
there's a work around to increase the heap memory, but it can still fail
when doing just a shorter crawl (e.g., urls w /FA/ for Farsi), then much lower rate of error.
Postprocessor and CSV creation
Shengsong streamlined and modified the find citation alias
Postprocessor currently
finds URL-articles & Tweets with relevant citations (either text alias or hyperlink) and creates a row for them, but does not store and list the relevant citations -- Shengsong will correct this second part so it does
Shengsong modified the find alias function
includes a column for language, image reference - which we will remove
info about image reference is there, but will put on back burner
does not include a column for article title, which we will include
Changes to postprocessor column names:
change the name of "url or alias text" to "url"
change the name of "name/title" to "name"
change the name of "citation name/title" to "citation name"
New terminology:
MediaCAT takes 2 kinds of scope:
crawl_scope = a set of domains (for the domain crawler) and/or twitter accounts (for the twitter API crawler) to be crawled
citation_scope = the scope of news sites and twitter accounts which the user wants the postprocessor to find in the crawled data; this is inputted into the postprocessor
Crawl scope and citation scope can be different or the same depending on the needs of the user.
Twitter API crawl
Shengsong read through the crawl but has not yet had a chance to begin coding
Action Items
Shengsong will create 2 tickets:
readability and plain text and debugging - priority
JS heap memory error
Alejandro will add langauge about two crawl scopes to MVP
Backburner
Benchmarking
re-do small domain crawl
finish documenting where different data are on our server