October 13, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Ticket Review In Project
Post-Processing

Notes Start Here

New project in Github to track issues: MediaCat Refactor 2020
Twitter crawler - Danhua
- owner of getoldtwitter Python library has updated for new Twitter rules, will have a session to learn how to use this for our project
Researching date recognition - Amy & Jacqueline
- two pathways
  - estimating date from Google indexing would require use of Google API; limit to how many searches per day can be performed (max 100)
  - Python DateGuesser library to retrieve dates, also checking out Javascript libraries because it seems they are more maintained
- Jacqueline is writing tests for existing date-retrieving libraries to see what proportion of dates are captured. Best results come from the bigger sites that have the dates in the URL, but in multi-language there are more struggles to find the date.
- where does the time identifier belong?
  - if in Javascript can be part of the crawler
  - if in Python it will be a separate tool
- Jacqueline & Raiyan will make a decision about which method to use for date capture by next week
MediaCat Domain Crawler - Raiyan & Alex
- Alex's filter function checks: makes sure URL is in scope, is not the domain URL, and gets rid of repeat URLs
- crawl run (using filter function) for two domains - the IDF and Al Jazeera
  - retrieves text content, title, and html content for articles successfully
  - went 5 articles deep, and ended up on an Al Jazeera homepage (homepage URLs are not always a complete match to the default domain URL)
  - ignored urls that were outside of the domain were collected, grouped by domain name
  - definition of pseudoURL is how the links are determined as qualifying to be crawled - ex. aljazeera.com/news as pseudourl will not retrieve aljazeera.com
Postprocessor
- new issue created to address postprocessor development
Scope parser demo - Danhua
- created functions for checks (checks for valid url, valid twitter handle, type of source, valid csv)