July 6, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
try to restart the Mid E and then Israel NYT Archive crawls, and if that doesn't lead to expected result, re-do each one sequentially - Gy
check March 31 NYT archive results against NYT Archive postprocess results, as well as new NYT Archive results - Gy
keep trouble-shooting postprocessor including meeting together - Gy/Fr
find list of possible graphics through matplot and email to Alejandro - Fr
continue research into plotly integration in jupyterlab environment - Fr
research d3 graph for possible use for network graph - Fr
if time, consider KPP dataset from sharepoint - Fr
Postprocessor
able to get output in postprocessor, but not with the large dataset of real twitter output
is it loading twitter dataframe properly? yes
something wrong with getting the input
are there headers in the input and consistent with headers in the code?
make sure able to load the data first, 1st milestone
make sure that no errors about loading the dataset
need to learn about pandas and dask dataframes; dask is being used to handle the large volume, if you dont' use dask, then complain about large dataset in python,
even work with the chunk of the code that is trying to load the dataset in the dask, and see how it is loading
line 23 in post_processor: read csv should work, rather than single csv, shouldn't matter how many twitter_df should have the number of rows expected; something splits every 10 or 100,000
server
issue has come up with Graham cloud that unable to upload folder quickly, and after trying other possible ways, permission denied to SSH into Graham
crawls
NYT archive (Mid E) for March 31, 2022 does have results for years where there seem to be zeros, like around 1980s and 2008-9
also: 1989 has less total results even though results with hits doesn't decrease
could be a combination of crawler and postprocessor errors.
need to troubleshoot new crawler problem
need to troubleshoot on-going crawls
Visualization
Plotly is separate from Matpolot - different utilities
Matplot: can't do buttons, Plotly: manage to do one
On-going task:
check crawl every 2 days - Gy
update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
push corrected postprocessor code to master - Gy/Fr
postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
Action Items:
write to cloud support at Digital Alliance - Gy
need to troubleshoot on-going crawls - Gy
troubleshoot crawler problems to set up NYT archive crawl - Gy
re-run NYT archive Mid E & Israel crawl using Arbutus and very slow - Gy
if time, download sets of crawled data in both JSON and CSV for FoxNews & WaPo twitter crawls and send link to Alejandro - Gy
meet Monday at 1pm to talk about postprocessor - Gy/Fr
continue sorting the different aspects of the button in plotly, and then try with different datasets - Fr
research d3 graph for possible use for network graph - Fr
if time, consider KPP dataset from sharepoint - Fr