October 20, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Twitter crawler is working using sncrape
- encoding issue
- work in delay mechanism to limit requests before block
- current output is CSV file
Date recognition
- metascraper is decided on as the tool for date recognition
- tests for date verification have been created
Domain crawler
- filtering out-of-scope crawls is defined
- date data from metascraper will be incorporated by taking links from the output JSON of the domain crawler and adding the metascraper date data to this JSON
- so result data in JSON will include: title, author, clean text content of article, html content of article body with links, list of all links in article, date from metascraper data, length of article in characters
Scope parser complete
Creation of post-processor framework
- Raiyan providing sample output from domain crawler
- Danhua's focus will be on twitter output first
Process of ID linking for references to scope handles and articles
- Extract all citations possible from tweet or article
- If hyperlink or twitter handle is in dataset, keep the citation