March 10, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • twitter API finalize crawl
  • need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
  • get started on puppeteer update: 2.2 and we have 1.5, take 2-3 days
  • once readability is stripped from domain crawler and domain crawler is updated, run small domain crawl
    • Alejandro will provide domain url's for 5 smaller domains
  • plain text extraction move to postprocessor (as described above)

twitter API crawl

  • difficulty checking results - large files
    • automate making csv output files at 1 million for error checking
    • script that will allow either 1 single output file from the Twitter API, or else break into multiple output files of maximum 1 million tweets (in order to open in Excel)
  • url extender
  • look into the extender that John developed, and think how it should be used: should it be added to the postprocessor, or kept as a separate script
  • crawl with all options (geolocation etc)
    • we are able to crawl with all the public metrics, including geo & withheld

postprocessor update

  • test Twitter API output with small file of around 60,000
  • moved plain text extraction to postprocessor:
    • looking over results, waiting on RA
  • next step: postprocess twitter api

padlet:

  • need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
    • documentation completed, couldn't edit padlet - ask Kirsta

puppeteer update -- priority:

  • looking at depreciate issue: still method from before 1.0
    • pre-hook: checks URL and filters it before crawling; has been updated quite a bit, therefore somewhat complicated
  • so far no other issues

Action Items

  • new policy for url extender
  • finalize puppeteer update
  • finalize crawl of timelines for KPP/MediaCAT: 60,001 + tweets
  • make postprocessor able to read twitter API output
  • Alejandro/RA checking output from postprocessor extraction of plain text

Backburner

  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets