February 10, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • apologies for lack of action items!
  • Shengsong will document how to resize tmp, how to recreate instance from back up, and slowly also what data is stored in which instance.
  • documenting puppeteer
  • Twitter API crawler
  • Colin's last day -- best of luck to you!

Documentation

  • almost done
  • documentation of jupyterlab is already in wiki, csv processing will go to Mediacat backend repo

documenting puppeteer

  • for next week
  • when we start a new crawl we will do this

Twitter API Crawler

  • Colin finished authentication and fetching tweets
  • easy: get scope; hard: how to format for CSV processing, and the question is whether to use format of output.json or output csv directly from Twitter crawler
    • probably need to use the current format in order to output data that is relevant to researcher via postprocessor
    • Pandas via Jupyterlab can both turn into csv as well as manipulate data
    • currently the API crawler is structured to give a format like our json, returned like a dic (dictionary)
    • if want quick visualization then directly from API is okay, but otherwise need a format that postprocessor can read
  • last thing needed is a way to read the scope and then we can start the crawl - Shengsong
  • need to document and commit to a repo

postprocessor - make 2?

  • Shengsong has been reading through postprocessor
  • Nat: as much as possible: keep standard postprocessing output format; would enable easier analyse; can always add a wrapper later to bind the two together
  • Colin: agree, for jupyterlab analysis easier if has same output format
  • problem with postprocessor -
  • what about linkages between two data sets: when url-article is cited in tweet or tweet in url-article
  • need to think this question through more

Action Items

  • Shengsong and Alejandro will meet to discuss postprocessor and its goals
  • Shengsong will develop reader for scope to start Twitter crawl
  • Colin will commit csv processing to media back-end repo

Backburner

  • udpating postprocessor category names and adding "title" of URL-article
  • re-do small domain crawl
  • finish documenting where different data are on our server