October 27, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Ticket Review
Return count of url/tweets crawled for each item (domain/twitter handle) in the crawl

Meeting Notes

Twitter crawler update from Danhua
- Twint library was updated, and used to perform the crawl
- for each twitter handle, generates an output CSV of all their tweets - including information like date, time, timezone, URLs included, mentions of other twitter handles, replies count, retweets count, likes count, hashtags
- should be run through SciNet resources
Integrating date detection into crawler from Alex
- two methods being explored
- one approach is doing things recursively, so each time metascraper finds a date it recursively goes through each domain, asynchronous nature means it is slow with a wider scope, and may run into stack overflow issues
- other approach is using loops, and performs faster but other issues arise - will just stop(/time out?) - this likely is also a sign callbacks are needed
- Jacqueline's suggestion of incorporating callbacks
Domain crawler update from Raiyan
- Raiyan tested some URLs, output looks like what is expected
- tends to find main topic landing pages first (ex. /politics main landing page) instead of news articles
- ex. crawling with a depth of 20, 16 of them will be these main pages and 4 will be actual articles
- will need to increase number of articles to be crawled to reach more articles
- pseudoURLs need to be created for every URL provided to the crawler, in the process of building this to be generated automatically
- Nat, Alex and Raiyan will need to consult with each other to make sure that outputs can be consolidated & that making extra calls is avoided
Post-processor update from Amy
- linking citations is implemented
- out-of-scope URL sources need to be stored and their mentions counted, in order to save potential new links to crawl
- 'Sample of Potential Interest' template for out-of-scope URLs in MediaCat Data for Testing sheet
- next steps include extracting twitter handles, identify matches using text aliases, creating the JSON/CSV output for these "out-of-scope" potential sources
- generating UUID as key, associated with referring record IDs, in the output of post-processor