March 04, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting notes

  • Output format - Sample Output_v2
    • Entry type will be described by the type column (twitter handle, tweet, domain, article, text alias)
    • Text aliases can be coalesced with general domain mentions
  • Problematic URLs sheet has been added as well
  • Crawler performance
    • Batching changes seem to improve the crawl for some sites but not all
    • There are unidentified articles that are gathered as null - may or may not belong to a domain
    • With Apify queue, need to restart when it crashes (can be picked up where it stopped)
    • Raiyan continues to look into automating batching with Apify
    • An email should be sent to notify user when Apify crashes (currently doesn't seem to work)
    • Further work on batching, and email notification implementation
  • Crawler next steps
    • NYTimes to be run as a batch
    • Exploring gathering the content of a tag instead of title, to retrieve anchor text