June 30, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Status of Twitter crawl
Project Overview

Crawler Update: I restarted the crawler twice while working on a script to restart the crawler automatically. There have been 41,940 JSON files created so far over around 100 hours. That’s approximately 7 links/minute. I have been working on a Python script that is able to run the crawler and automatically restart it after a specified amount of time. So, if we give 24 hours as the time limit, the script will continuously restart the crawler every 24 hours without a user having to do anything. It also preserves the logs separately which is important as previously every subsequent run would just overwrite the log files and we would lose them. The script is mostly done, just needs some more testing with longer times to ensure that it keeps working. I will also have to analyze the logs further to figure out what the best time would be for restarting the crawler. Once I have a good idea of when the errors heavily start popping up and become a problem, I will set that as the default restart time. For now, I just have it set to 24 hours.
Twitter Crawler Update: I have run the twitter crawler and it finished crawling in around 29 hours for 320 twitter handles. I looked through some of the CSV files that were generated for the twitter handles, and some only had a couple of tweets (the filesizes were small, only around 4.9kb) while some of the others had a very large amount of tweets (the filesizes were larger, around 443.4mb). There were no critical errors or bugs and the crawler stopped once it was complete.