March 11, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Meeting Notes

  • Crawler update
    • Batching setup created by Raiyan with Apify to split scope into smaller batches, are run one after the other
      • Marking batches as completed when they are processed
      • Estimates of how long a batch takes to run?
    • Jacqueline has set up virtual machine, but we need to set up multiple instances on separate VMs
      • In order to run the separate VMs concurrently
  • NYTimes data is being run, can we gather references without running the referenced domains as instances?
    • Yes, as a separate script from June that will extract the referenced domains data
  • Post-processor
    • Saving and picking up post-processing is working
    • Amy's work on refactoring output
      • A source may have multiple text aliases, or multiple twitter handles
      • Twitter handles are not to be grouped, they can be accounted for separately
      • Text aliases will be grouped together with a pipe separator in output, as lists cannot be used as keys
    • Library uuid to create unique ids