Jacqueline has worked on pruning the branches of our project, and adding test files
Mediacat-backend needs some attention as there are 6 branches
Notes on our instances
We currently max out all our resources when we create an instance, so the instance cannot be backed up as resources are already maxed out
In order to back up files, all the instances and backups running together have to be under the total resources we have been given
Making smaller instances
Raiyan attempted batching with limits at 5 pages for every domain, and this seemed to work (5 pages were gathered from most domains)
Want to keep batches small, but able to crawl
There is likely an optimized point of how many domains in a batch, and how many pages crawled per batch - so Raiyan is working on determining this optimized point
Post-processor framework
Amy completed refactor for the output and tested on the small output to confirm it is working
Text aliases are now in a list format instead of separated by pipe
If node is type "domain" or "twitter article" or "text alias" and has no referrals, it is excluded from the post-processor output
If homepage is crawled, then this crawled node will have type 'article', and will be prioritized over the original static entry from the source input with type 'domain'