June 16, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda:
- crawl/postprocessor updates
- documentation and new repo for new postprocessor
- adding twitter counts to data structure
- Twitter: embedded tweet issue
- testing new postprocessor on KPP & old NYT and new NYT data to see if discrepancy
- producing visualizations with KPP data
crawl/postprocessor updates
- small domain crawler still running
- NYT politics archive postprocessing: still on-going as of yesterday, about half way through
- theguardian crawl: going fine, some blocks due to "comments" url's which Shengsong will try to filter out through pre-navigation and regex
- NYT & KPP postprocessing:
- KPP: about 200,000 and got good result -- no issue, and checked against earlier results and they were the same
- NYT Mid E archive: hasn't tried yet
new postprocessor documentation and repo
- metascraper updates:
- if metascraper has errors, then no way to know what they are
- if server stops, no way to know where we were, and then need to re-start from the beginning rather than from where it stopped
- solution proposed: use db to store the data that has been finished, and then can know where to continue; can use pandas to store;
- unlikely to have much effect on the speed of the metascraper
- adding twitter counts to data structure
embedded tweet issue
visualizations with KPP data
- stacked area diagram, also node vector
Action Items:
- the guardian crawl: filter out comments urls
- NYT Mid E archive: test on new postprocessor
- postprocessor: adding twitter counts to data structure
- update to metascraper to include db to deal with errors and with re-starting after being stopped
- visualizations: figure out jupyter
- Alejandro: need more examples of embedded tweet issue, and send list of visualizations
Backburner
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function
- dealing with embedded versus cited tweets