Aug 4, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

postprocessor
upload NYT archive crawler with brake as separate branch and document what the difference is with the earlier version - Gy
speed up small domain crawl a bit - Gy
do a count of the Israeli domain crawl - Gy
crawl of NYT "Israel" for the years 2006-2009, , and use article filter - Gy
continue with the postprocessor - Fr

2 problems were giving us trouble: one an additional header line and 2nd, copy/paste where the full last line wasn't being copied before running on the postprocessor

figure out the right url (finally sent by digital alliance) and created a new instance
if message comes with "all requests have been processed" with few results, likelihood is that the crawler is being blocked
provide separate IP address, if something is flagged
- might be helpful if creating new address
small domain crawl was separated out, and only Jewish Journal wasn't working (same one which is corrupted), but then after five days a couple more stopped working
- the 2 stopped working (Jewish Currents & Peter Beinart) were put together and using a new IP address
- on Arbutus server, the Jewish Journal data are not marked as corrupted
Israeli domain crawl:
- really slow: 15,000 a week, some only a few results
- only 1 with lots of crawler results

check crawl every 2 days - Gy
update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
push corrected postprocessor code to master - Gy/Fr
postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
backburner: figure out corruption in small domain crawl

develop script and documentation to remove extra header lines from twitter crawl output as prior to postprocessing - Fr
check URL extender to see if most updated - Fr
run URL extender on test twitter crawl output (~23,000) and run postprocessor on the resulting output - Fr
check results of postprocessor on test data - Al
if results work, run URL extender on all twitter crawl (Fox News and Washington Post, keeping separate) and postprocess - Fr
check if new IP address created with new instance - Gy
pause Israeli domain crawl while testing other crawl technique - Gy
set up individual crawls for Israeli domains to test crawl technique, and check regularly to see if multiple errors have cause brake - Gy
if new IP address is created with new instance, try NYT archive crawl - Gy