October 27, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

  • Ticket Review
  • Return count of url/tweets crawled for each item (domain/twitter handle) in the crawl

Meeting Notes

  • Twitter crawler update from Danhua

    • Twint library was updated, and used to perform the crawl
    • for each twitter handle, generates an output CSV of all their tweets - including information like date, time, timezone, URLs included, mentions of other twitter handles, replies count, retweets count, likes count, hashtags
    • should be run through SciNet resources
  • Integrating date detection into crawler from Alex

    • two methods being explored
    • one approach is doing things recursively, so each time metascraper finds a date it recursively goes through each domain, asynchronous nature means it is slow with a wider scope, and may run into stack overflow issues
    • other approach is using loops, and performs faster but other issues arise - will just stop(/time out?) - this likely is also a sign callbacks are needed
    • Jacqueline's suggestion of incorporating callbacks
  • Domain crawler update from Raiyan

    • Raiyan tested some URLs, output looks like what is expected
    • tends to find main topic landing pages first (ex. /politics main landing page) instead of news articles
    • ex. crawling with a depth of 20, 16 of them will be these main pages and 4 will be actual articles
    • will need to increase number of articles to be crawled to reach more articles
    • pseudoURLs need to be created for every URL provided to the crawler, in the process of building this to be generated automatically
    • Nat, Alex and Raiyan will need to consult with each other to make sure that outputs can be consolidated & that making extra calls is avoided
  • Post-processor update from Amy

    • linking citations is implemented
    • out-of-scope URL sources need to be stored and their mentions counted, in order to save potential new links to crawl
    • 'Sample of Potential Interest' template for out-of-scope URLs in MediaCat Data for Testing sheet
    • next steps include extracting twitter handles, identify matches using text aliases, creating the JSON/CSV output for these "out-of-scope" potential sources
    • generating UUID as key, associated with referring record IDs, in the output of post-processor