March 3, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

re-do al-monitor.com crawl & benchmarking speed
Twitter API
testing new puppeteer filter code on 50 domains before documenting and committing
creating second version of postprocessor and make master, to preserve Amy's version
adjust domain crawler set up script to add a variable for heap memory
- document in mediacat domain crawler
Alejandro will update 2 tickets: readability & JS Heap memory (couldn't find them)

Twitter API

Shengsong managed to get the Twitter API crawler working
will meet with Alejandro to go over which keys to include in output
it may take a few days to coordinate the postprocessor for a combined twitter and domain crawler output
6000 tweets retrieved in 1-2 mins, and it is possible to increase speed with multi-processing, as TWitter API allows for each handle to be processed separately.

Re-Do al-monitor.com & Benchmarking

60,000 per day
JS heap memory issues were resolved for this crawl

testing new puppeteer filter code

filter code: selects html content to find plain text
tested on 20 domains and worked without an issue on 18, and with a little adjustment, Shengsong was able to get the crawler working on the other 2
google chrome: inspect the html content, and see what selector is

change to processing as a result: grab raw_content and postprocessor will determine plain text & hyperlinks

see https://stackoverflow.com/questions/30661650/how-does-firefox-reader-view-operate & https://github.com/mozilla/readability/
Shengsong will delete in the domain crawler the code which attempts to find plain-text from the collected html, and create a new key called raw_content which will be part of the domain crawler output
Shengsong will add the function of finding plain-text from the raw_content key to the postprocessor

making v2 of postprocessor

done
will need an update to reflect new function of finding plain-text

adjust domain crawler set up script to add a variable for heap memory, document:

done

Tickets:

readability: done
JS Heap Memory: done

Action Items

twitter API finalize crawl
need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
get started on puppeteer update: 2.2 and we have 1.5, take 2-3 days
once readability is stripped from domain crawler and domain crawler is updated, run small domain crawl
- Alejandro will provide domain url's for 5 smaller domains
plain text extraction move to postprocessor (as described above)

Backburner

Benchmarking
finish documenting where different data are on our server
finding language function
image_reference function