March 18, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Housekeeping
  • Hours check in (How many do folks have left)
  • Meeting schedule going forward
  • Script from June?

Meeting notes

  • Jacqueline has worked on pruning the branches of our project, and adding test files
    • Mediacat-backend needs some attention as there are 6 branches
  • Notes on our instances
    • We currently max out all our resources when we create an instance, so the instance cannot be backed up as resources are already maxed out
    • In order to back up files, all the instances and backups running together have to be under the total resources we have been given
    • Making smaller instances
  • Raiyan attempted batching with limits at 5 pages for every domain, and this seemed to work (5 pages were gathered from most domains)
    • Want to keep batches small, but able to crawl
    • There is likely an optimized point of how many domains in a batch, and how many pages crawled per batch - so Raiyan is working on determining this optimized point
  • Post-processor framework
    • Amy completed refactor for the output and tested on the small output to confirm it is working
      • Text aliases are now in a list format instead of separated by pipe
    • If node is type "domain" or "twitter article" or "text alias" and has no referrals, it is excluded from the post-processor output
    • If homepage is crawled, then this crawled node will have type 'article', and will be prioritized over the original static entry from the source input with type 'domain'