September 16, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Welcome Colin!
  • Meeting scheduling going forward
  • Where is the data for the NYT that needs to be processed for MediaCat? The twitter crawler has completed its crawl of the NYTimes twitter handles. The results of the crawl are stored in '/media/data/twitter_crawl/mediacat-twitter-crawler/csv-nytimes/'.

The domain crawler is currently running (as of August 6th, 2021), under two different screens, one for the NYTimes Politics section and another for the NYTimes Middle East section. To check which screens are alive, use the screen -ls command. This will display the screens as well as their ID numbers. To reattach one of the screens, use the command screen -r <ID>. The domain crawlers are stored in the '/media/data/batch-crawl' directory. The results folder containing the JSON files are stored as follows: - NYTimes Politics: '/media/data/batch-crawl/nytimes-politics/mediacat-domain-crawler/newCrawler/Results/' - NYTimes Middle East: '/media/data/batch-crawl/nytimes-middle-east/mediacat-domain-crawler/newCrawler/Results/' The log files are found under: - NYTimes Politics: '/media/data/batch-crawl/nytimes-politics/mediacat-domain-crawler/newCrawler/logs/' - NYTimes Middle East: '/media/data/batch-crawl/nytimes-middle-east/mediacat-domain-crawler/newCrawler/logs/' The log files are stored in the newCrawler folder and once the time period has ended and the crawler is to be restarted, the log file is moved to the logs folder.

Update

  • John has managed to run the postprocessor.

Next steps:

  • Colin & John will meet so that John can introduce Compute Canada
  • Any compute canada resource issues, Alejandro will paste in agenda
  • reach out to Raiyan to see if can find recording about Compute Canada
  • John will run the postprocessor, and record the size of data from TWINNT & Domain Crawler, and how long it takes, and size of output file
  • John sees a memory error, it could be that the crawl processes never finished and now are taking up more memory. We will write to Raiyan to figure out if we can kill these processes to give space for John to run the postprocessor.
  • Next issue is to see what the output is from the postprocessor, and see if we can do a "single-site" post-processor on that output or if some change needs to happen.