June 17, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Status of Twitter Crawl:

  • running as expected, and documentation is sufficient to set up

  • not sure what the status is

  • anything in the documentation about monitoring? doesn't seem to be; each handle has a specific CSV; number of CSV's? Raiyan will check

  • ? is it possible to pick up where we left off? Raiyan will check

Crawler issues:

  • meeting scheduled for next week with Nat & other devs

  • ran NYT alone w/ some modifications, w/ error monitoring, but that didn't seem to work properly; next idea: adding time stamps to errors

  • checking the crawler yesterday: 14,902 links, 121,000+ links to crawl, 10,697 JSON files created, not yet having all JSON files created

  • ?: any chance to have automatic change of name to prevent over-riding; time-stamp addition

  • with post-processor: should be able to know which JSON's aren't produced: gather all url's and find which don't have a JSON as a diff; maybe this is worth not worrying about JSON's not being produced, once the time-stamp implementation is complete to ensure that JSON aren't over-written. Probable that time-stamp implementation has already avoided this, and now working on some feature that will tell user to re-start when there's a infinite loop error.

  • another issue: whether we can create a larger Compute Caanada instance: Raiyan will take a day to see how Com Can works to see if the next NYT crawl can be put on a bigger instance

  • ?: question about the process of 28.9 GB;

  • it's probable that the space issue is actually from the results (created by us) and queue JSON files:

  • the queue JSON File is not just a list of URLs, but rather each URL in the list is a JSON File that is created by Apify

  • 3 things after 24 hours of crawling: (1) the size of typical article JSON file that has been crawled (2) check after 24 hours the size of the all folders, analysis of storage, and (3) the size of 10,000 JSON files that correspond to crawled articles