June 30, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- ask Shawn about NYT archive (politics) - Gy
- set up domain crawler on server - Fr
- make a new tab in the crawl index spreadsheet to track the number for the small domain crawl - Al
- add numbers for each domain to new tab - Gy
- postprocessor: continue to troubleshoot & especially look at formatting - and ask Shawn - Gy
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
- assess whether MatPlot has features that enable UI faster - Fr
- use existing data to test on jupyterlab platform - FR
- research whether there are existing libraries with friendlier UI than Jupyterlab - Fr
- write to Kirsta and Nat about the use of Jupyterlab - Al
- meet on MOnday June 26th 11am toronto to talk about postprocessor - Gy/Fr
Crawler/Server
- no answers from Shawn
- managed to solve problem of bug in the output python file, and compiler python bug; Shawn might have done it a different way
- still getting zero results with test data
- once all the bugs are solved and tested, push to master
- NYT archive crawls: Mid E/Palestinian/Israel
- Palestinian archive crawl has expected number of 60,000 , but Mid E (less than half) and Israel (about half) were not the expected number
- Francisco managed to get domain crawler to work on national post
Visualization
- researching UI and visualization platforms - plotly integration that works with Matplot and Jupyterlab for user interface
- run through Jupyterlab environment, and once on that environment through plotly, can use user interface
- using existing data
On-going task:
- check crawl every 2 days - Gy
- update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
- push corrected postprocessor code to master - Gy/Fr
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
Action Items:
- try to restart the Mid E and then Israel NYT Archive crawls, and if that doesn't lead to expected result, re-do each one sequentially - Gy
- check March 31 NYT archive results against NYT Archive postprocess results, as well as new NYT Archive results - Gy
- keep trouble-shooting postprocessor including meeting together - Gy/Fr
- find list of possible graphics through matplot and email to Alejandro - Fr
- continue research into plotly integration in jupyterlab environment - Fr
- research d3 graph for possible use for network graph - Fr
- if time, consider KPP dataset from sharepoint - Fr