November 17, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

To keep better track of PRs - name after ticket
Crawler has been started! Twitter crawler working smoothly, domain crawler returning errors we need to address
Scope parser validation - "url is alive" function times out the script
- checking format of urls for validation
- Nat's suggestion: check headers in response to validate
- adding "http" when making the call if missing - will likely need to be in Raiyan's script as well
Accepting a .csv file - to be integrated by Raiyan
Modify filter to permit storage of urls
- completed & reviewed by Alex and Raiyan
Test stack as it stands - first crawl has been started, and errors need to be dealt with
- once a full run of the scope has been completed this ticket will be closed
Twitter crawler code - completed by Danhua
- now accepts variables for time & keywords
Integrate date detection in crawler - completed & needs to be tested
MediaCat Domain crawler
- as domain crawler goes through links, domain not updated properly
  - ex. if crawling CNN, checks against CNN but then after starting to crawl NYTimes, checks against CNN still and dismisses entries as out of scope
  - addressed by loop that checks current link against scope domain links (slows performance)
- UUID for each JSON node needs to be re-added for the post-processor's use as key
Constructing framework for application
- framework to be constructed
Modification of crawler to gather plain text version of the crawled articles
- completed
meeting on Friday to see if we can relaunch crawl