November 17, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- Reminder of Coding standards
- Name pull requests/branches after ticket as per coding standards document
- Check if your build is passing
- Ticket Review
Meeting Notes
-
To keep better track of PRs - name after ticket
-
Crawler has been started! Twitter crawler working smoothly, domain crawler returning errors we need to address
-
Scope parser validation - "url is alive" function times out the script
- checking format of urls for validation
- Nat's suggestion: check headers in response to validate
- adding "http" when making the call if missing - will likely need to be in Raiyan's script as well
-
Accepting a .csv file - to be integrated by Raiyan
-
Modify filter to permit storage of urls
- completed & reviewed by Alex and Raiyan
-
Test stack as it stands - first crawl has been started, and errors need to be dealt with
- once a full run of the scope has been completed this ticket will be closed
-
Twitter crawler code - completed by Danhua
- now accepts variables for time & keywords
-
Integrate date detection in crawler - completed & needs to be tested
-
MediaCat Domain crawler
- as domain crawler goes through links, domain not updated properly
- ex. if crawling CNN, checks against CNN but then after starting to crawl NYTimes, checks against CNN still and dismisses entries as out of scope
- addressed by loop that checks current link against scope domain links (slows performance)
- UUID for each JSON node needs to be re-added for the post-processor's use as key
- as domain crawler goes through links, domain not updated properly
-
Constructing framework for application
- framework to be constructed
-
Modification of crawler to gather plain text version of the crawled articles
- completed
-
meeting on Friday to see if we can relaunch crawl