July 28, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- take a look at the brake in the domain crawler and read through - Gy
- look at date range for NYT - Israel & Palestine and send email to Alejandro about what is included - Gy
- look at which folder is taking up most memory on server - Gy
- attempt to combine stealthy crawl on NYT archive - Gy
- continue rebuild Graham instance - Gy
- vizualization: trouble shoot error where stacked area graph isn't filled in - Fr
- visualization: add buttons to simplify using different types of charts and graphs and simplify jupyterlab files for users - Fr
- D3: looking at how to convert csv to JSON - Fr
- look at Colin's and Shengsong's instructions about how Alejandro (user) connects to server and decide whether still best way - Fr
Crawl
- Brake:
- wasn't on NYT crawl, still flaws, looking at proxy rotation, could be there were changes on the website
- IP rotation is next step
- Israeli domains crawl continues well
Postprocessor
- after meeting with Nat, could get small selection but stopped reading after 500 selection
- could be an issue with formating, maybe make a custom dataframe rather than Pandas
- making changes to postprocessor, and developer friendly changes with documentation
- error: tokenizing error for end of line, reading csv file, could be one record that isn't formatted properly or hidden character
- instead of default pandas reader, and customizing from scratch
- challenge: not sure how much work to producing custom dataframe
- another approach is to modify the files: tool that properly format and cleanup of csv format before giving to postprocessor
- challenge: will this modify the data
On-going task:
- check crawl every 2 days - Gy
- update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
- push corrected postprocessor code to master - Gy/Fr
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
- backburner: figure out corruption in small domain crawl
Action Items:
- upload NYT archive crawler with brake as separate branch and document what the difference is with the earlier version - Gy
- speed up small domain crawl a bit - Gy
- do a count of the Israeli domain crawl - Gy
- crawl of NYT "Israel" for the years 2006-2009, , and use article filter - Gy
- continue with the postprocessor - Fr