March 11, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
Meeting Notes
- Crawler update
- Batching setup created by Raiyan with Apify to split scope into smaller batches, are run one after the other
- Marking batches as completed when they are processed
- Estimates of how long a batch takes to run?
- Jacqueline has set up virtual machine, but we need to set up multiple instances on separate VMs
- In order to run the separate VMs concurrently
- NYTimes data is being run, can we gather references without running the referenced domains as instances?
- Yes, as a separate script from June that will extract the referenced domains data
- Post-processor
- Saving and picking up post-processing is working
- Amy's work on refactoring output
- A source may have multiple text aliases, or multiple twitter handles
- Twitter handles are not to be grouped, they can be accounted for separately
- Text aliases will be grouped together with a pipe separator in output, as lists cannot be used as keys
- Library uuid to create unique ids