July 12, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • finalizing testing of new postprocessor and merge to master if working
  • start NYT Politics Archive postprocessing if postprocessor is done
  • continue learning D3 for edge-node
  • start new crawl with twitter accounts Alejandro will send
  • meet with Alejandro to finalize looking at datasets
  • twitter: embedded tweet issue
  • to discuss next meeting:
    • how to cut a release
    • writing a paper about MediaCAT and architecture

Postprocessor

  • testing of the new postprocessor is complete & found in repo "postprocessor"
  • in GitHub, old postprocessor was in repo "mediacat_backend" and now will be moved to new repo with note that it is no longer in use
    • old backend has a few utils that are now available directly with the new crawler, eg get all urls
  • started NYT politics Archive
  • low input - for domain crawls: have very lengthy plain text, so the input has to be first converted to pandas and then converted to DASK in order to process (DASK giving errors otherwise)
    • Shengsong will consult with Nat about this issue
    • it shouldn't add a lot of processing time
    • Shengsong will complete documentation
  • WaPo/Foxnews twitter crawl: url expander going
    • can add the other tweets when the crawl is finalized after July 19

Crawls

  • Twitter: for WaPo/FoxNews twitter crawl reached 10 million per month max by crawling where each embedded or replied tweet is also counted
    • re-start twitter crawl on July 19 when the quota refreshes
  • small domain: still going, 1.1 million
  • The Guardian: 1.2 million
    • added blacklist for comments section

twitter embed

  • not yet

Visualizations

  • D3 visualizations: in testing

Action Items

  • documentation on the new order of postprocessor input with conversion to pandas before conversion to DASK
    • next week consult with Nat about this
  • on Jul 19: resume the WaPo/Foxnews twitter
  • Alejandro will send scope for Israeli and Palestinian news domains
  • Twitter embedding issue - this week

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function