January 28, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting Notes

  • Post-processor framework
    • Ran post-processor on 10 twitter users, which took 2 days
    • Began getting output based on the whole scope - ran into CSV issue due to newline in the plaintext
    • This quotation problem exists ~20 times in 5 million items
    • Output is at 600 000/5 000 000 since the weekend
  • Metascraper integration - merged and closed!
  • Crawler performance
    • Jacqueline implemented a manual queue so that more than 1 link is returned for the affected domains, this has created a lot of bugs
      • Is the crawler not able to mark links properly when manually queueing?
    • Error code - says either it is an error within API or user error misconfiguring the crawler
    • Jacqueline wrote a script that should re-start the crawl once stopped
      • Also built in an email notification tool to alert user once crawl has stopped
    • The above is based on Cheerio - but Cheerio and Puppeteer had the same issue
    • Nat's advice: get some of the sites you know are going to fail, find and isolate the pieces of code that are of note in the block, and communicate to the library developers/maintainers with the relevant information
  • Raiyan is still working on selective rendering