July 12, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

finalizing testing of new postprocessor and merge to master if working
start NYT Politics Archive postprocessing if postprocessor is done
continue learning D3 for edge-node
start new crawl with twitter accounts Alejandro will send
meet with Alejandro to finalize looking at datasets
twitter: embedded tweet issue
to discuss next meeting:
- how to cut a release
- writing a paper about MediaCAT and architecture

testing of the new postprocessor is complete & found in repo "postprocessor"
in GitHub, old postprocessor was in repo "mediacat_backend" and now will be moved to new repo with note that it is no longer in use
- old backend has a few utils that are now available directly with the new crawler, eg get all urls
started NYT politics Archive
low input - for domain crawls: have very lengthy plain text, so the input has to be first converted to pandas and then converted to DASK in order to process (DASK giving errors otherwise)
- Shengsong will consult with Nat about this issue
- it shouldn't add a lot of processing time
- Shengsong will complete documentation
WaPo/Foxnews twitter crawl: url expander going
- can add the other tweets when the crawl is finalized after July 19

Twitter: for WaPo/FoxNews twitter crawl reached 10 million per month max by crawling where each embedded or replied tweet is also counted
- re-start twitter crawl on July 19 when the quota refreshes
small domain: still going, 1.1 million
The Guardian: 1.2 million
- added blacklist for comments section

documentation on the new order of postprocessor input with conversion to pandas before conversion to DASK
- next week consult with Nat about this
on Jul 19: resume the WaPo/Foxnews twitter
Alejandro will send scope for Israeli and Palestinian news domains
Twitter embedding issue - this week

Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
finding language function
image_reference function