November 10, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Questions from Research Team

  • Can we prioritize a subsection of the scope? (ie those with allsides or pew slant, as well as Israeli and Palestinian?)
  • We currently have tags across different columns – can we leave these tags like this, or do they have to be combined into one column separated by pike?
  • Is it okay to leave some cells as “N/A” or should those be blank?

  • Will the domain crawler do en.africanmanager.com if the scope reads africanmanager.com

Ticket Review Any progress on installation and running of the crawler? Scheduling installation of the application on ComputeCanada resources and starting crawl Nov 12-13th (due to planned outages)

Meeting Notes

  • Clarifications on input data formatting
    • Tags should be combined into one column separated by pipes
    • N/As should be switched to blanks
  • First attempt will be with the whole scope rather than a subset, based on the output we see we can re-evaluate whether to do a subset crawl
  • Modify filter to permit the storage of twitter URLs
    • not completed yet, Alex will be taking on this one
  • MediaCat Domain Crawler
    • bugs found by Jacqueline and Raiyan
    • changes to create the PDF production may have affected the performance of the domain crawler, but there are still occurrences where links are being filtered out where they shouldn't be
  • Twitter Crawler Code
    • catches errors so it will return list of error twitter handles, won't exit due to errors
    • adding time dimension to manage crawling - allows to specify time window examined for tweet collection
    • keyword search can also be incorporated
  • Modification of crawler to gather plain text version of crawled articles
    • completed and code is to be pushed after review
  • Test stack as it stands and start preliminary crawl (on ComputeCanada resources)
    • Jacqueline has created the instance to be run at the end of the week
  • PDF capture
    • to be integrated as part of post-processing step
  • Integrate date detection into crawler
    • to be reviewed but has been completed by Jacqueline and Alex
    • output to be concatenated to JSON
  • Post-processor framework
    • column has been added for URL to be stored
    • cross-matching of twitter and domain data is being developed against dummy data