February 9, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Nat any new ideas about counting URLs?
  • key pair -- Alejandro and Irfan comms
  • Shawn -- 100 URLs per website?
  • Shawn -- checking numbers for new small domain crawl
  • no answer yet about /nearline
  • any answer from Shengsong or news about json-csv conversion
  • postprocessor

old small domain

  • problem with conversion json-csv is an issue
    • Shengsong couldn't find the converter
    • Nat helped
  • Shengsong said that with the same call, should re-start the crawl
    • let's see if it works - 2 weeks and then check
    • Shawn will document and a bit of code review of this function

other ways to count URLs

  • just use internet archive summary page

postprocessor

  • has count for both Twitter crawls and will update on crawl list

Action Items

  • add documentation about the json-csv conversion and re-starting crawl
  • do a bit of code review of the re-start function