July 12, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
finalizing testing of new postprocessor and merge to master if working
start NYT Politics Archive postprocessing if postprocessor is done
continue learning D3 for edge-node
start new crawl with twitter accounts Alejandro will send
meet with Alejandro to finalize looking at datasets
twitter: embedded tweet issue
to discuss next meeting:
how to cut a release
writing a paper about MediaCAT and architecture
Postprocessor
testing of the new postprocessor is complete & found in repo "postprocessor"
in GitHub, old postprocessor was in repo "mediacat_backend" and now will be moved to new repo with note that it is no longer in use
old backend has a few utils that are now available directly with the new crawler, eg get all urls
started NYT politics Archive
low input - for domain crawls: have very lengthy plain text, so the input has to be first converted to pandas and then converted to DASK in order to process (DASK giving errors otherwise)
Shengsong will consult with Nat about this issue
it shouldn't add a lot of processing time
Shengsong will complete documentation
WaPo/Foxnews twitter crawl: url expander going
can add the other tweets when the crawl is finalized after July 19
Crawls
Twitter: for WaPo/FoxNews twitter crawl reached 10 million per month max by crawling where each embedded or replied tweet is also counted
re-start twitter crawl on July 19 when the quota refreshes
small domain: still going, 1.1 million
The Guardian: 1.2 million
added blacklist for comments section
twitter embed
not yet
Visualizations
D3 visualizations: in testing
Action Items
documentation on the new order of postprocessor input with conversion to pandas before conversion to DASK
next week consult with Nat about this
on Jul 19: resume the WaPo/Foxnews twitter
Alejandro will send scope for Israeli and Palestinian news domains
Twitter embedding issue - this week
Backburner
Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
any non-scope domain hyperlink that ends in .co.il
any link to a tweet or twitter handle
This is a bit outside our normal functionality, so I will put it on the backburner for now.