Nov 23, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Action Items from last day

  • use r-sync to transfer data from Arbutus to Graham: apify storage folder inside small domain folder to prevent re-crawling same urls - Gy
  • nytimes archive crawl with keyword "Middle East"- try to run on Graham cloud - Gy
  • review 2 pull request before merging - Gy
  • if we don't hear from Nat today, write to IA email address and ask them about error and limits to speed for crawls, and if they can refer us to where documentation exists for crawling - Ra
  • (1) figure out how many unique urls without landing pages; (2) how many are /world/ and /world/middleeast/; (3) see if there is a time-bound characteristic to 2, and start crawl for those urls - Ra

Gy update

  • Currently, I am transferring data from the Arbutus server to the Graham server using rsync. This process seems slower than previous transfers from Graham to Arbutus, and I am investigating the cause. Once this data transfer completes, I will resume the small domain crawls and initiate the Middle East NYT crawl.
  • Regarding our ongoing Israel news domain crawls, we have gathered 137,245 results from Israel National News, 33,971 from Jerusalem Post, and 67,230 from Times of Israel so far.

Internet Archive crawl

  • we're unblocked as of today

Postprocessor

  • finished JSON to CSV conversion (or vv)
  • running postprocessor on WaPo due to dataframe problem.
    • file too large to find the problematic line manually
    • cutting it up into smaller files to locate the problematic line

Action Items

  • IA NYT crawl (1) figure out how many unique urls without landing pages; (2) how many are /world/ and /world/middleeast/; (3) see if there is a time-bound characteristic to 2, and start crawl for those urls - Ra
  • continue dialogue with IA: (1) bulk download, (2) ask about rate of calls - Ra
  • update documentation on JSON/CSV conversion - Ar
  • develop unit testing for foxnews postprocessed results, for example, on text alias & hyperlinks - Ar
  • send small Wa/Po twitter files into postprocessor to troubleshoot error - Fr
  • do test on postprocessor with sample IA results (about 36,000) - Ar