July 28, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
flagging issue -- any insights?
documentation on the new order of postprocessor input with conversion to pandas before conversion to DASK
next week consult with Nat about this
on Jul 19: resume the WaPo/Foxnews twitter
visualization
Alejandro will send scope for Israeli and Palestinian news domains
Twitter embedding issue - this week
to discuss next meeting:
how to cut a release
writing a paper about MediaCAT and architecture
text alias issue
problem with text alias
re-run postprocessing on KPP/MediaCAT date
Flagging
the URL expander seems to have set off alarm bells, question is if there's something we could do different
information leakage flag: attempt to mimic one of their client's websites
very difficult to say what is getting us flagged
python request library: like a crawler, trying to get the URL as best as possible, not using a headless browser
not a real person
how much slower with headless browser?
slightly faster than domain crawler
headless browser in python - yes, but easier to re-use the headless browser we have
how long to develop headless browser URL expander?
probably a week
flag when up due to automated function
Twitter embedding issue
this is complicated: need to download a lot of tweets in order to look at the problem
need to get to this week
documentation
file system documentation - to show where every files are
explaining the compute canada - Shengsong updated: restart instance, transfer files, back up files, compute canada map, temp issue and how to completely rebuild an instance if something goes wrong, setting up SSH
cutting a release
update documentation
in every repository, need same version number
need release policy or strategy
need automated environment to download everything
the versioning guidelines determine how long it will take
could release as different parts
document what is needed to run the entire thing
need to find the stale branches and remove
can get a DOI - for each repo
we will evaluate next week, probably do early september
Visualization
error for stacked area graph: simple label problem, and produced correct graphs
D3 vector diagram: produces a html file, and then it's interactive
when A is back in Toronto, Shengsong and A will record session about how to set up the environment
Crawls
WaPo/Foxnews: re-start: today will re-start
postprocessing NYT archive politics: stopped, will be re-started, didn't lose what was done
small domain crawler: still running, 1.6 million
the Guardian, still running
Action Items
work on headless browser URL expander
Twitter embed issue
code cleanup on D3 vector diagram
re-do the KPP postprocessing
restart WaPo/Foxnews twitter crawl
restart the postprocessing of NYT politics archive
send new graphs for KPP data
Backburner
Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
any non-scope domain hyperlink that ends in .co.il
any link to a tweet or twitter handle
This is a bit outside our normal functionality, so I will put it on the backburner for now.