February 10, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
apologies for lack of action items!
Shengsong will document how to resize tmp, how to recreate instance from back up, and slowly also what data is stored in which instance.
documenting puppeteer
Twitter API crawler
Colin's last day -- best of luck to you!
Documentation
almost done
documentation of jupyterlab is already in wiki, csv processing will go to Mediacat backend repo
documenting puppeteer
for next week
when we start a new crawl we will do this
Twitter API Crawler
Colin finished authentication and fetching tweets
easy: get scope; hard: how to format for CSV processing, and the question is whether to use format of output.json or output csv directly from Twitter crawler
probably need to use the current format in order to output data that is relevant to researcher via postprocessor
Pandas via Jupyterlab can both turn into csv as well as manipulate data
currently the API crawler is structured to give a format like our json, returned like a dic (dictionary)
if want quick visualization then directly from API is okay, but otherwise need a format that postprocessor can read
last thing needed is a way to read the scope and then we can start the crawl - Shengsong
need to document and commit to a repo
postprocessor - make 2?
Shengsong has been reading through postprocessor
Nat: as much as possible: keep standard postprocessing output format; would enable easier analyse; can always add a wrapper later to bind the two together
Colin: agree, for jupyterlab analysis easier if has same output format
problem with postprocessor -
what about linkages between two data sets: when url-article is cited in tweet or tweet in url-article
need to think this question through more
Action Items
Shengsong and Alejandro will meet to discuss postprocessor and its goals
Shengsong will develop reader for scope to start Twitter crawl
Colin will commit csv processing to media back-end repo
Backburner
udpating postprocessor category names and adding "title" of URL-article
re-do small domain crawl
finish documenting where different data are on our server