November 17, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Reminder of Coding standards
  • Name pull requests/branches after ticket as per coding standards document
  • Check if your build is passing
  • Ticket Review

Meeting Notes

  • To keep better track of PRs - name after ticket

  • Crawler has been started! Twitter crawler working smoothly, domain crawler returning errors we need to address

  • Scope parser validation - "url is alive" function times out the script

    • checking format of urls for validation
    • Nat's suggestion: check headers in response to validate
    • adding "http" when making the call if missing - will likely need to be in Raiyan's script as well
  • Accepting a .csv file - to be integrated by Raiyan

  • Modify filter to permit storage of urls

    • completed & reviewed by Alex and Raiyan
  • Test stack as it stands - first crawl has been started, and errors need to be dealt with

    • once a full run of the scope has been completed this ticket will be closed
  • Twitter crawler code - completed by Danhua

    • now accepts variables for time & keywords
  • Integrate date detection in crawler - completed & needs to be tested

  • MediaCat Domain crawler

    • as domain crawler goes through links, domain not updated properly
      • ex. if crawling CNN, checks against CNN but then after starting to crawl NYTimes, checks against CNN still and dismisses entries as out of scope
      • addressed by loop that checks current link against scope domain links (slows performance)
    • UUID for each JSON node needs to be re-added for the post-processor's use as key
  • Constructing framework for application

    • framework to be constructed
  • Modification of crawler to gather plain text version of the crawled articles

    • completed
  • meeting on Friday to see if we can relaunch crawl