June 03, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

  • issue 19: how is crawler running with Raiyan's update:

** ran the whole scope: 2 days in, infinite loop problem started again

** equal importance given to all domains that worked, memory issue however causing most domains not to work

** as queue does more and more links: it uses the call stack, and there is a limit to how large the call stack can be; once it fails to crawl a bunch of links a bunch of time, goes into infinite loop

** same problem both locally and on server

** make smaller batches to try and solve memory issue

** tested out nyt, and alone it worked: created a small batch of 3: worked

** checked some yesterday: crawling well, eg NYT: 23000+ links, 3,600+ JSON over 2 days: 40,421 over 3 domains, 9507 JSON files

** Raiyan will add a feature that the program will have a text when it is done, so we can know better what the speed is.

** why JSON are less than links crawled: each JSON opened as a file than written into, could be that they previous ones are over-written if have same number; some of the URLs are similar and thus the hash could be generating an identical name; UUIDs are being used but not clear how the URL is being used to create the UUID, Raiyan will take a look to ensure that they don't end up identical; could append date/time after UUID

** Raiyan will also check out the bash script, but for now

  • Re: NYT crawl: set up separately, and babysit. Need to circulate back regarding the TWINT crawl.

** we will start with https://www.nytimes.com/section/world/middleeast

  • John: managed to get the instance running locally, and now need to ensure that the documented version is the main instance/branch

** Raiyan has opened a pull request open, and John will comment out and then Raiyan will commit

  • John: getting accessed denied when trying to enter Graham cloud

** SSH keys already added, John shows problem

  • NEXT STEPS

** we will revisit how John should work with MediaCAT next week, probably get to work on Twint crawler next with Raiyan too