March 10, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

twitter API finalize crawl
need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
get started on puppeteer update: 2.2 and we have 1.5, take 2-3 days
once readability is stripped from domain crawler and domain crawler is updated, run small domain crawl
- Alejandro will provide domain url's for 5 smaller domains
plain text extraction move to postprocessor (as described above)

difficulty checking results - large files
- automate making csv output files at 1 million for error checking
- script that will allow either 1 single output file from the Twitter API, or else break into multiple output files of maximum 1 million tweets (in order to open in Excel)
url extender
look into the extender that John developed, and think how it should be used: should it be added to the postprocessor, or kept as a separate script
crawl with all options (geolocation etc)
- we are able to crawl with all the public metrics, including geo & withheld

test Twitter API output with small file of around 60,000
moved plain text extraction to postprocessor:
- looking over results, waiting on RA
next step: postprocess twitter api

need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
- documentation completed, couldn't edit padlet - ask Kirsta

looking at depreciate issue: still method from before 1.0
- pre-hook: checks URL and filters it before crawling; has been updated quite a bit, therefore somewhat complicated
so far no other issues