June 28, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- send the stacked area charts and one way vector diagrams from KPP data
- troubleshooting errors on postprocessor discovered with KPP testing
- update to metascraper to include db to deal with errors and with re-starting after being stopped - use arbutus
- dask multithreading for postprocessor - trouble-shooting why slower
- consider D3 or similar for visualizing vector diagrams
- Twitter: embedded tweet issue:
- when Graham back on:
- the guardian crawl: filter out comments urls
- for next week: consider borealis to store datasets
Visualizations
- stacked area charts and vector diagrams?
- making jupyter notebook work
- some of the earlier work from Alice could be helpful hear, by domain and url for vector diagrams
- can use D3 for visualizations -- can use it in local server
Postprocessor:
- update to metascraper to include db to deal with errors and with re-starting after being stopped - use arbutus
- seems to be working, and Shengsong will test on Graham when return
- dask multithreading for postprocessor - trouble-shooting why slower
- not yet, need large data set
- errors with new postprocessor
- seems to be certain errors with capital letters, and maybe some problem parsing of citation scope
crawls:
- assuming they are on hold
storage:
- make a private repo on Github and use to store our datasets
- Alejandro will make a spreadsheet with list of crawls, and information
action items
- visualizations: try D3 or other for better visualization library
- postprocessor:
- test changes to metascraper
- test changes with dask multithreading
- finalize trouble-shooting with postprocessor difference on KPP data (capital letters, scope issue, etc)
- make a private repo on Github and use to store our datasets
- Alejandro will make a spreadsheet with list of crawls, and information
- Twitter: embedded tweet issue:
Backburner
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function