June 23, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda:
- the guardian crawl: filter out comments urls
- NYT Mid E archive: test on new postprocessor
- postprocessor: adding twitter counts to data structure
- update to metascraper to include db to deal with errors and with re-starting after being stopped
- visualizations: figure out jupyter
- Alejandro: need more examples of embedded tweet issue, and send list of visualizations
Crawls Update:
- small domain: same, pause
- the guardian: same, pause
- NYT Politics Archive postprocessing? not yet started
Postprocessor:
- Twitter: embedded tweet issue:
- NYT Mid E archive: test on new postprocessor
- KPP postprocessor: missing a few thousand, maybe small code error, troubleshooting
- postprocessor: adding twitter counts to data structure
- update to metascraper to include db to deal with errors and with re-starting after being stopped
- Shengsong will work on this in Arbutus while we await Graham to return
- trying to add multithreading to postprocessor but then ended up slower -- why?, DASK supported
- Nat will look at the video of the postprocessor changes
- DASK method to loop through the data was not very fast, for-loop is faster
Visualizations
- making jupyter notebook work
- some of the earlier work from Alice could be helpful hear, by domain and url for vector diagrams
- Shengsong will send the stacked area charts and one way vector diagrams
- can use D3 for visualizations -- can use it in local server
Action Items
- send the stacked area charts and one way vector diagrams from KPP data
- troubleshooting errors on postprocessor discovered with KPP testing
- update to metascraper to include db to deal with errors and with re-starting after being stopped - use arbutus
- dask multithreading for postprocessor - trouble-shooting why slower
- consider D3 or similar for visualizing vector diagrams
- Twitter: embedded tweet issue:
- when Graham back on: * the guardian crawl: filter out comments urls
- for next week: consider borealis to store datasets
Backburner
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function