October 20, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Ticket Review
  • Matching algorithm in post-processor (Danhua's work)
  • How to pass data from crawlers to post-processor
  • Deployment Timeline (follow up with SciNet resources)

Meeting notes

  • Twitter crawler is working using sncrape

    • encoding issue
    • work in delay mechanism to limit requests before block
    • current output is CSV file
  • Date recognition

    • metascraper is decided on as the tool for date recognition
    • tests for date verification have been created
  • Domain crawler

    • filtering out-of-scope crawls is defined
    • date data from metascraper will be incorporated by taking links from the output JSON of the domain crawler and adding the metascraper date data to this JSON
    • so result data in JSON will include: title, author, clean text content of article, html content of article body with links, list of all links in article, date from metascraper data, length of article in characters
  • Scope parser complete

  • Creation of post-processor framework

    • Raiyan providing sample output from domain crawler
    • Danhua's focus will be on twitter output first
  • Process of ID linking for references to scope handles and articles

    • Extract all citations possible from tweet or article
    • If hyperlink or twitter handle is in dataset, keep the citation