September 16, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Welcome Colin!
Meeting scheduling going forward
Where is the data for the NYT that needs to be processed for MediaCat? The twitter crawler has completed its crawl of the NYTimes twitter handles. The results of the crawl are stored in '/media/data/twitter_crawl/mediacat-twitter-crawler/csv-nytimes/'.

The domain crawler is currently running (as of August 6th, 2021), under two different screens, one for the NYTimes Politics section and another for the NYTimes Middle East section. To check which screens are alive, use the screen -ls command. This will display the screens as well as their ID numbers. To reattach one of the screens, use the command screen -r <ID>. The domain crawlers are stored in the '/media/data/batch-crawl' directory. The results folder containing the JSON files are stored as follows: - NYTimes Politics: '/media/data/batch-crawl/nytimes-politics/mediacat-domain-crawler/newCrawler/Results/' - NYTimes Middle East: '/media/data/batch-crawl/nytimes-middle-east/mediacat-domain-crawler/newCrawler/Results/' The log files are found under: - NYTimes Politics: '/media/data/batch-crawl/nytimes-politics/mediacat-domain-crawler/newCrawler/logs/' - NYTimes Middle East: '/media/data/batch-crawl/nytimes-middle-east/mediacat-domain-crawler/newCrawler/logs/' The log files are stored in the newCrawler folder and once the time period has ended and the crawler is to be restarted, the log file is moved to the logs folder.

Update

John has managed to run the postprocessor.

Next steps:

Colin & John will meet so that John can introduce Compute Canada
Any compute canada resource issues, Alejandro will paste in agenda
reach out to Raiyan to see if can find recording about Compute Canada
John will run the postprocessor, and record the size of data from TWINNT & Domain Crawler, and how long it takes, and size of output file
John sees a memory error, it could be that the crawl processes never finished and now are taking up more memory. We will write to Raiyan to figure out if we can kill these processes to give space for John to run the postprocessor.
Next issue is to see what the output is from the postprocessor, and see if we can do a "single-site" post-processor on that output or if some change needs to happen.