November 24, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Progress of the crawl
  • Domain names to crawler function needs to be pushed and tested
  • Crawler is only bringing back homepage
  • Date crawler needs to be re-written
  • Demo of post processor
  • OS updates
  • Progress on visualizations
  • Ticket Review

Meeting notes

  • Twitter crawl was completed, full historical data collected! This took ~1 week

    • the alternate Twitter API-using version will be polished and provided publicly as part of MediaCAT
  • Crawler only bringing back homepage and exiting

    • likely due to the limit of linkcrawling set at 20, reaching limit before actually getting to crawl articles
    • defining this limit - since currently JSON is outputted after completed crawl, how do we define limits? or do we write to individual JSONs while crawl is conducted?
    • do we want to use databases then (ex. MongoDB)?
    • asynchronous nature of crawls & making the crawl "infinite" means that we may be writing to JSON as we are reading from them - this gives reason to consider using databases
    • currently, one JSON file output for everything - this will create tracking issues in the future - this gives reason to change to one JSON for each link
      • JSONs can be written to a directory as crawl is happening
    • individual JSONs need to be created, but database is not a priority at this point
      • this will affect the crawl output, as well as the postprocessing and date crawler
  • Postprocessor update

    • the matching between domain and twitter data is working
    • right now the code could read all the individual JSONs at once
    • next: incorporating defined interval (ex. every 3 hours) updates on the # of URLs per domain in given timeframe that have been crawled
    • to be re-run (manually triggered) at different intervals to create additional linkages
  • Ticket review

    • Add a function to get the date from the commandline and validate the date arguments given
      • this was implemented to verify the crawler dates
    • Scope fix merged
    • Add pre_processor for twitter output - to be reviewed
    • Upgrade OS merged
    • Scope parser validation - error checking to ensure http:// or https:// beginning needs to be added to the initial validation
    • Accepting a .csv file from the parser to populate initial queue - in progress
    • MediaCat Domain Crawler to-dos
      • make multiple individual JSON files
      • add UUIDs
      • provide sample of individual JSONs to Amy for testing
    • Create post-processor framework - in-progress and will need testing
  • New tickets

    • task added to MediaCat Domain Crawler - write crawler progress (domain and # of URLs crawled) to CSV ex. every 3 hours
    • New repository with API based code - for the Twitter API-supported version of twitter crawler
  • Resources