March 10, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
twitter API finalize crawl
need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
get started on puppeteer update: 2.2 and we have 1.5, take 2-3 days
once readability is stripped from domain crawler and domain crawler is updated, run small domain crawl
Alejandro will provide domain url's for 5 smaller domains
plain text extraction move to postprocessor (as described above)
twitter API crawl
difficulty checking results - large files
automate making csv output files at 1 million for error checking
script that will allow either 1 single output file from the Twitter API, or else break into multiple output files of maximum 1 million tweets (in order to open in Excel)
url extender
look into the extender that John developed, and think how it should be used: should it be added to the postprocessor, or kept as a separate script
crawl with all options (geolocation etc)
we are able to crawl with all the public metrics, including geo & withheld
postprocessor update
test Twitter API output with small file of around 60,000
moved plain text extraction to postprocessor:
looking over results, waiting on RA
next step: postprocess twitter api
padlet:
need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet