February 10, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

apologies for lack of action items!
Shengsong will document how to resize tmp, how to recreate instance from back up, and slowly also what data is stored in which instance.
documenting puppeteer
Twitter API crawler
Colin's last day -- best of luck to you!

almost done
documentation of jupyterlab is already in wiki, csv processing will go to Mediacat backend repo

Colin finished authentication and fetching tweets
easy: get scope; hard: how to format for CSV processing, and the question is whether to use format of output.json or output csv directly from Twitter crawler
- probably need to use the current format in order to output data that is relevant to researcher via postprocessor
- Pandas via Jupyterlab can both turn into csv as well as manipulate data
- currently the API crawler is structured to give a format like our json, returned like a dic (dictionary)
- if want quick visualization then directly from API is okay, but otherwise need a format that postprocessor can read
last thing needed is a way to read the scope and then we can start the crawl - Shengsong
need to document and commit to a repo

Shengsong has been reading through postprocessor
Nat: as much as possible: keep standard postprocessing output format; would enable easier analyse; can always add a wrapper later to bind the two together
Colin: agree, for jupyterlab analysis easier if has same output format
problem with postprocessor -
what about linkages between two data sets: when url-article is cited in tweet or tweet in url-article
need to think this question through more