June 1, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

logistics
- communications
- hours
server and installation
- able to access current crawls?
- reading code?
- postprocessor
- other questions for Shawn
nytimes archive crawl issues
visualization environment - soon

Logistics

Server & Basic operations

access granted to Arbutus, Gy able to log in, Francisco will confirm
Shawn didn't show how to set up instance, showed different repositories and readmes

nytimes

nytimes archive crawl (Mid E/Israel/Palestinians) has a couple of years (1979-1981, 2006-2011) with lower results, unclear if this is a crawl or postprocessor error

Action Items

do the two mandatory training
set up instance of mediacat domain crawler
- Gy write email to Shawn to ask for demo on setting up crawl (domain and twitter) and to show how to run postprocessor on Monday at 5:30pm
check on running crawls every 2-3 days - Gy
- figure out how to count total URLs crawled
look into above NYT archive crawl to see if crawl errors - Gy
start looking at Shengsong (Charles Xu) jupyterlab environment - Francisco