March 11, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Crawler update
- Batching setup created by Raiyan with Apify to split scope into smaller batches, are run one after the other
  - Marking batches as completed when they are processed
  - Estimates of how long a batch takes to run?
- Jacqueline has set up virtual machine, but we need to set up multiple instances on separate VMs
  - In order to run the separate VMs concurrently
NYTimes data is being run, can we gather references without running the referenced domains as instances?
- Yes, as a separate script from June that will extract the referenced domains data
Post-processor
- Saving and picking up post-processing is working
- Amy's work on refactoring output
  - A source may have multiple text aliases, or multiple twitter handles
  - Twitter handles are not to be grouped, they can be accounted for separately
  - Text aliases will be grouped together with a pipe separator in output, as lists cannot be used as keys
- Library uuid to create unique ids