July 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
visualizations: try D3 or other for better visualization library
postprocessor:
test changes to metascraper
test changes with dask multithreading
finalize trouble-shooting with postprocessor difference on KPP data (capital letters, scope issue, etc)
make a private repo on Github and use to store our datasets
Alejandro will make a spreadsheet with list of crawls, and information
Twitter: embedded tweet issue:
Visualization
still looking through documentation of D3
Postprocessor:
fix bugs with postprocessor connected to capital letters, question of inconsistency of scope
for twitter handle, never use capital letters in defining scope
dask multithreading: doesn't seem to work properly, not worth fiddling with it
metascraper: it's all working
after KPP data question, we'll test NYT Middle East Archive search and then if postprocess results aren't substantially different, new postprocessor will be merged to master
Crawls:
theguardian & small domain crawls now working again
theguardian: at about 1000,000
small domain: 1,000,000
Alejandro will send twitter handles
storage of datasets:
started to move
also possible to run metascraper on old datasets
Action Items
finalizing testing of new postprocessor and merge to master if working
start NYT Politics Archive postprocessing if postprocessor is done
continue learning D3 for edge-node
start new crawl with twitter accounts Alejandro will send
meet with Alejandro to finalize looking at datasets
twitter: embedded tweet issue
to discuss next meeting:
how to cut a release
writing a paper about MediaCAT and architecture
Backburner
Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
any non-scope domain hyperlink that ends in .co.il
any link to a tweet or twitter handle
This is a bit outside our normal functionality, so I will put it on the backburner for now.