January 13, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Move back to tickets?
  • Colin: setting up a jupyter hub on our resources - update
  • Shengsong: Compute Canada updates and mapping -- update
    • shd Compute Canada map be made private?
  • Shengsong: setting up an instance of the domain crawler for nytimes.com -- update
    • with benchmark for speed, can think about optimizing crawler
  • Shengsong: Documentation for setting up crawler was confusing, and there is an older documentation (delete?)
  • Alejandro: list of commands needed first for jupyter
  • next steps for Shengsong:
    • looking to see if Twint issues are resolved and a crawl can be done
    • forensics on issues with al-monitor crawl (see link above)

List of commands for Jupyter:

  • organize by column

  • export to spreadsheet format

  • include actual hyperlink or text alias found (see below)

  • filter by certain results (e.g., Haaretz or other)

  • import (modified) spreadsheet for visualization or other

  • visualization

  • some of these commands would be better to do as exported in excel or other application:

    • scroll through results
    • delete rows with particular regex (e.g., see the landing pages of al-monitor crawl)

Jupyter hub on compute canada:

  • Colin has ported over the old script to convert postprocessed results to spreadsheet (either csv or excel, other formats too) for export
  • Colin will also port over the visualization scripts
  • Colin produced a short document to connect to Jupyter
  • Colin also produced an excel spread sheet with all the relevant citations into a column called citations, should also include text alias
  • Colin will adapt the script to allow for either enumerating citations in distinct rows for a given URL-story (such that the URL-story will appear in multiple rows in the results), or else aggregating all citations into one cell for that given URL-story (such that the URL-story will appear only in one row in the results)

Note on results:

  • some of the strange rows that Alejandro noted are probably due to large spaces within the "plain text" of the results, which are being interpreted as separate rows in conversion process

Compute Canada:

  • Shengsong produced a map of resources
  • benchmark speed for new domain crawler: 12,314 urls in 24 hours
    • will see how it works over a few days
  • tried to run two crawlers on one instance -- ends up using a lot of secondary resources
    • we aren't able to apportion the processes to the volumes, it is allocated probably by something in the underlying server framework
    • Alejandro will ask Kirsta if Shengsong can meet with the systems person to rethink the volume allocations
    • Shengsong will attempt to set up two crawlers and see if we are able to maintain the speed, one nytimes and one cnn
  • Shengsong will research updating OS and security groups

Twitter crawler:

  • Shengsong will try the Twitter crawler on Friday with the KPP/MediaCAT integrated scope to see if working, and let Alejandro know by the end of the day if it is returning results

Action Items:

  • Colin: finish porting commands and visualizations for Jupyter hub, and making the commands
  • Shengsong: continue with the benchmarking and looking at two crawlers simultaneously
  • Shengsong: research OS updates and security groupings for Compute Canada instances, and carry those out
  • Shengsong: hope to get a meeting with the DSU systems admin to talk about ComCan resources
  • Shengsong: try the TWint crawler to see if working again
  • Alejandro: write to Kirsta about a meeting for Shengsong