Graph analysis - googleinterns/data-dependency-graph-analysis GitHub Wiki
Introduction
Two modules analyse a graph stored as an graphml format.
Graph is read using networkx, and features are extracted into Pandas dataframes. Visualisations are created, using dataframes with Altair - Python library for visualisations based on Vega & Vega-lite. The chart is converted to json and rendered in a Flask app. There are two separate apps, one for analysis of two connected datasets in a graph (A->B analysis), and another for graph structure understanding.
A->B analysis
In this module we take two datasets - upstream and downstream, and analyse paths between them.
Usage
python3 graph_analysis/a_b_app.py \
--dataset_A 1069 \
--dataset_B 8770 \
--graph_path "graph_1_10.graphml"
Parameters:
- dataset_A - id of an upstream dataset
- dataset_B - id of a downstream dataset
- graph_path - path to a graph in graphml format
Paths between two nodes
Between two datasets we find all possible paths of datasets and systems, and take top n shortest paths to visualise.
SLO
For the shortest path we look for slo of each dataset processing time and visualise the slo for the whole path.
Data integrity
To understand data integrity metrics for the shortest path we create graphs to understand reconstruction, regeneration and backup restoration times.
Graph structure analysis
In this module we look into graph structure, more specifically cycles in a graph, and nodes input and outputs.
Usage
python3 graph_analysis/graph_app.py \
--graph_path "graph_1_10.graphml"
Parameters:
- graph_path - path to a graph in graphml format
Cycles
Here we get all the cycles in a graph, count them, save to a file. We also visualise count of unique nodes of a type in cycles.
Nodes input / output
We count different elements in groups, and visualise density. You can observe that most collections have up to 5 elements.