March 18, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Jacqueline has worked on pruning the branches of our project, and adding test files
- Mediacat-backend needs some attention as there are 6 branches
Notes on our instances
- We currently max out all our resources when we create an instance, so the instance cannot be backed up as resources are already maxed out
- In order to back up files, all the instances and backups running together have to be under the total resources we have been given
- Making smaller instances
Raiyan attempted batching with limits at 5 pages for every domain, and this seemed to work (5 pages were gathered from most domains)
- Want to keep batches small, but able to crawl
- There is likely an optimized point of how many domains in a batch, and how many pages crawled per batch - so Raiyan is working on determining this optimized point
Post-processor framework
- Amy completed refactor for the output and tested on the small output to confirm it is working
  - Text aliases are now in a list format instead of separated by pipe
- If node is type "domain" or "twitter article" or "text alias" and has no referrals, it is excluded from the post-processor output
- If homepage is crawled, then this crawled node will have type 'article', and will be prioritized over the original static entry from the source input with type 'domain'