July 6, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • try to restart the Mid E and then Israel NYT Archive crawls, and if that doesn't lead to expected result, re-do each one sequentially - Gy
  • check March 31 NYT archive results against NYT Archive postprocess results, as well as new NYT Archive results - Gy
  • keep trouble-shooting postprocessor including meeting together - Gy/Fr
  • find list of possible graphics through matplot and email to Alejandro - Fr
  • continue research into plotly integration in jupyterlab environment - Fr
  • research d3 graph for possible use for network graph - Fr
    • if time, consider KPP dataset from sharepoint - Fr

Postprocessor

  • able to get output in postprocessor, but not with the large dataset of real twitter output
    • is it loading twitter dataframe properly? yes
    • something wrong with getting the input
    • are there headers in the input and consistent with headers in the code?
    • make sure able to load the data first, 1st milestone
    • make sure that no errors about loading the dataset
    • need to learn about pandas and dask dataframes; dask is being used to handle the large volume, if you dont' use dask, then complain about large dataset in python,
    • even work with the chunk of the code that is trying to load the dataset in the dask, and see how it is loading
    • line 23 in post_processor: read csv should work, rather than single csv, shouldn't matter how many twitter_df should have the number of rows expected; something splits every 10 or 100,000

server

  • issue has come up with Graham cloud that unable to upload folder quickly, and after trying other possible ways, permission denied to SSH into Graham

crawls

  • NYT archive (Mid E) for March 31, 2022 does have results for years where there seem to be zeros, like around 1980s and 2008-9
    • also: 1989 has less total results even though results with hits doesn't decrease
    • could be a combination of crawler and postprocessor errors.
  • need to troubleshoot new crawler problem
  • need to troubleshoot on-going crawls

Visualization

  • Plotly is separate from Matpolot - different utilities
    • Matplot: can't do buttons, Plotly: manage to do one

On-going task:

  • check crawl every 2 days - Gy
  • update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
  • push corrected postprocessor code to master - Gy/Fr
  • postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy

Action Items:

  • write to cloud support at Digital Alliance - Gy
  • need to troubleshoot on-going crawls - Gy
  • troubleshoot crawler problems to set up NYT archive crawl - Gy
  • re-run NYT archive Mid E & Israel crawl using Arbutus and very slow - Gy
  • if time, download sets of crawled data in both JSON and CSV for FoxNews & WaPo twitter crawls and send link to Alejandro - Gy
  • meet Monday at 1pm to talk about postprocessor - Gy/Fr
  • continue sorting the different aspects of the button in plotly, and then try with different datasets - Fr
  • research d3 graph for possible use for network graph - Fr
    • if time, consider KPP dataset from sharepoint - Fr