January 13, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Move back to tickets?
Colin: setting up a jupyter hub on our resources - update
Shengsong: Compute Canada updates and mapping -- update
- shd Compute Canada map be made private?
Shengsong: setting up an instance of the domain crawler for nytimes.com -- update
- with benchmark for speed, can think about optimizing crawler
Shengsong: Documentation for setting up crawler was confusing, and there is an older documentation (delete?)
Alejandro: list of commands needed first for jupyter
next steps for Shengsong:
- looking to see if Twint issues are resolved and a crawl can be done
- forensics on issues with al-monitor crawl (see link above)

organize by column
export to spreadsheet format
include actual hyperlink or text alias found (see below)
filter by certain results (e.g., Haaretz or other)
import (modified) spreadsheet for visualization or other
visualization
some of these commands would be better to do as exported in excel or other application:
- scroll through results
- delete rows with particular regex (e.g., see the landing pages of al-monitor crawl)

Colin has ported over the old script to convert postprocessed results to spreadsheet (either csv or excel, other formats too) for export
Colin will also port over the visualization scripts
Colin produced a short document to connect to Jupyter
Colin also produced an excel spread sheet with all the relevant citations into a column called citations, should also include text alias
Colin will adapt the script to allow for either enumerating citations in distinct rows for a given URL-story (such that the URL-story will appear in multiple rows in the results), or else aggregating all citations into one cell for that given URL-story (such that the URL-story will appear only in one row in the results)

some of the strange rows that Alejandro noted are probably due to large spaces within the "plain text" of the results, which are being interpreted as separate rows in conversion process

Shengsong produced a map of resources
benchmark speed for new domain crawler: 12,314 urls in 24 hours
- will see how it works over a few days
tried to run two crawlers on one instance -- ends up using a lot of secondary resources
- we aren't able to apportion the processes to the volumes, it is allocated probably by something in the underlying server framework
- Alejandro will ask Kirsta if Shengsong can meet with the systems person to rethink the volume allocations
- Shengsong will attempt to set up two crawlers and see if we are able to maintain the speed, one nytimes and one cnn
Shengsong will research updating OS and security groups

Shengsong will try the Twitter crawler on Friday with the KPP/MediaCAT integrated scope to see if working, and let Alejandro know by the end of the day if it is returning results

Colin: finish porting commands and visualizations for Jupyter hub, and making the commands
Shengsong: continue with the benchmarking and looking at two crawlers simultaneously
Shengsong: research OS updates and security groupings for Compute Canada instances, and carry those out
Shengsong: hope to get a meeting with the DSU systems admin to talk about ComCan resources
Shengsong: try the TWint crawler to see if working again
Alejandro: write to Kirsta about a meeting for Shengsong