July 28, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

take a look at the brake in the domain crawler and read through - Gy
look at date range for NYT - Israel & Palestine and send email to Alejandro about what is included - Gy
look at which folder is taking up most memory on server - Gy
attempt to combine stealthy crawl on NYT archive - Gy
continue rebuild Graham instance - Gy
vizualization: trouble shoot error where stacked area graph isn't filled in - Fr
visualization: add buttons to simplify using different types of charts and graphs and simplify jupyterlab files for users - Fr
D3: looking at how to convert csv to JSON - Fr
look at Colin's and Shengsong's instructions about how Alejandro (user) connects to server and decide whether still best way - Fr

Brake:
- wasn't on NYT crawl, still flaws, looking at proxy rotation, could be there were changes on the website
- IP rotation is next step
Israeli domains crawl continues well

after meeting with Nat, could get small selection but stopped reading after 500 selection
could be an issue with formating, maybe make a custom dataframe rather than Pandas
making changes to postprocessor, and developer friendly changes with documentation
error: tokenizing error for end of line, reading csv file, could be one record that isn't formatted properly or hidden character
- instead of default pandas reader, and customizing from scratch
- challenge: not sure how much work to producing custom dataframe
- another approach is to modify the files: tool that properly format and cleanup of csv format before giving to postprocessor
- challenge: will this modify the data

check crawl every 2 days - Gy
update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
push corrected postprocessor code to master - Gy/Fr
postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
backburner: figure out corruption in small domain crawl

upload NYT archive crawler with brake as separate branch and document what the difference is with the earlier version - Gy
speed up small domain crawl a bit - Gy
do a count of the Israeli domain crawl - Gy
crawl of NYT "Israel" for the years 2006-2009, , and use article filter - Gy
continue with the postprocessor - Fr