Sep 21, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • forms
  • new time
  • email Nat about NPM error and also history issue with restarting crawl
  • Aryan will work with Francisco to figure out what trouble-shooting needs to be done
  • Gy will help Raazia learn how to set up an instance

Postprocessor

  • today running postprocessor on foxnews to see if it would work for large dataset
  • still need to trouble-shoot accuracy

Crawler/server

  • worked on running the crawler
  • reading on webcrawling in general and some resources
  • reading the NYT crawl and the batch
  • meeting with Nat tmrw

Action Items

  • Aryan and Francisco: still learning code base
    • learning the dask dataframe with Nat in meeting first
    • then turn to fixing accuracy of the postprocessor
    • Francisco will send some resources for Aryan about Dask and Pandas
  • Raazia: continue reading about webcrawling, meet with Nat and Gy, and look at web archive API