October 13, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Notes Start Here

  • New project in Github to track issues: MediaCat Refactor 2020

  • Twitter crawler - Danhua

    • owner of getoldtwitter Python library has updated for new Twitter rules, will have a session to learn how to use this for our project
  • Researching date recognition - Amy & Jacqueline

    • two pathways
      • estimating date from Google indexing would require use of Google API; limit to how many searches per day can be performed (max 100)
      • Python DateGuesser library to retrieve dates, also checking out Javascript libraries because it seems they are more maintained
    • Jacqueline is writing tests for existing date-retrieving libraries to see what proportion of dates are captured. Best results come from the bigger sites that have the dates in the URL, but in multi-language there are more struggles to find the date.
    • where does the time identifier belong?
      • if in Javascript can be part of the crawler
      • if in Python it will be a separate tool
    • Jacqueline & Raiyan will make a decision about which method to use for date capture by next week
  • MediaCat Domain Crawler - Raiyan & Alex

    • Alex's filter function checks: makes sure URL is in scope, is not the domain URL, and gets rid of repeat URLs
    • crawl run (using filter function) for two domains - the IDF and Al Jazeera
      • retrieves text content, title, and html content for articles successfully
      • went 5 articles deep, and ended up on an Al Jazeera homepage (homepage URLs are not always a complete match to the default domain URL)
      • ignored urls that were outside of the domain were collected, grouped by domain name
      • definition of pseudoURL is how the links are determined as qualifying to be crawled - ex. aljazeera.com/news as pseudourl will not retrieve aljazeera.com
  • Postprocessor

    • new issue created to address postprocessor development
  • Scope parser demo - Danhua

    • created functions for checks (checks for valid url, valid twitter handle, type of source, valid csv)