March 04, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting notes

Output format - Sample Output_v2
- Entry type will be described by the type column (twitter handle, tweet, domain, article, text alias)
- Text aliases can be coalesced with general domain mentions
Problematic URLs sheet has been added as well
Crawler performance
- Batching changes seem to improve the crawl for some sites but not all
- There are unidentified articles that are gathered as null - may or may not belong to a domain
- With Apify queue, need to restart when it crashes (can be picked up where it stopped)
- Raiyan continues to look into automating batching with Apify
- An email should be sent to notify user when Apify crashes (currently doesn't seem to work)
- Further work on batching, and email notification implementation
Crawler next steps
- NYTimes to be run as a batch
- Exploring gathering the content of a tag instead of title, to retrieve anchor text