October 20, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- Ticket Review
- Matching algorithm in post-processor (Danhua's work)
- How to pass data from crawlers to post-processor
- Deployment Timeline (follow up with SciNet resources)
Meeting notes
-
Twitter crawler is working using sncrape
- encoding issue
- work in delay mechanism to limit requests before block
- current output is CSV file
-
Date recognition
- metascraper is decided on as the tool for date recognition
- tests for date verification have been created
-
Domain crawler
- filtering out-of-scope crawls is defined
- date data from metascraper will be incorporated by taking links from the output JSON of the domain crawler and adding the metascraper date data to this JSON
- so result data in JSON will include: title, author, clean text content of article, html content of article body with links, list of all links in article, date from metascraper data, length of article in characters
-
Scope parser complete
-
Creation of post-processor framework
- Raiyan providing sample output from domain crawler
- Danhua's focus will be on twitter output first
-
Process of ID linking for references to scope handles and articles
- Extract all citations possible from tweet or article
- If hyperlink or twitter handle is in dataset, keep the citation