November 24, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Progress of the crawl
Domain names to crawler function needs to be pushed and tested
Crawler is only bringing back homepage
Date crawler needs to be re-written
Demo of post processor
OS updates
Progress on visualizations
Ticket Review

Meeting notes

Twitter crawl was completed, full historical data collected! This took ~1 week
- the alternate Twitter API-using version will be polished and provided publicly as part of MediaCAT
Crawler only bringing back homepage and exiting
- likely due to the limit of linkcrawling set at 20, reaching limit before actually getting to crawl articles
- defining this limit - since currently JSON is outputted after completed crawl, how do we define limits? or do we write to individual JSONs while crawl is conducted?
- do we want to use databases then (ex. MongoDB)?
- asynchronous nature of crawls & making the crawl "infinite" means that we may be writing to JSON as we are reading from them - this gives reason to consider using databases
- currently, one JSON file output for everything - this will create tracking issues in the future - this gives reason to change to one JSON for each link
  - JSONs can be written to a directory as crawl is happening
- individual JSONs need to be created, but database is not a priority at this point
  - this will affect the crawl output, as well as the postprocessing and date crawler
Postprocessor update
- the matching between domain and twitter data is working
- right now the code could read all the individual JSONs at once
- next: incorporating defined interval (ex. every 3 hours) updates on the # of URLs per domain in given timeframe that have been crawled
- to be re-run (manually triggered) at different intervals to create additional linkages
Ticket review
- Add a function to get the date from the commandline and validate the date arguments given
  - this was implemented to verify the crawler dates
- Scope fix merged
- Add pre_processor for twitter output - to be reviewed
- Upgrade OS merged
- Scope parser validation - error checking to ensure http:// or https:// beginning needs to be added to the initial validation
- Accepting a .csv file from the parser to populate initial queue - in progress
- MediaCat Domain Crawler to-dos
  - make multiple individual JSON files
  - add UUIDs
  - provide sample of individual JSONs to Amy for testing
- Create post-processor framework - in-progress and will need testing
New tickets
- task added to MediaCat Domain Crawler - write crawler progress (domain and # of URLs crawled) to CSV ex. every 3 hours
- New repository with API based code - for the Twitter API-supported version of twitter crawler
Resources
- Raiyan's helpful link on webscraping & blocks