Graph analysis - googleinterns/data-dependency-graph-analysis GitHub Wiki

Introduction

Two modules analyse a graph stored as an graphml format.

Graph is read using networkx, and features are extracted into Pandas dataframes. Visualisations are created, using dataframes with Altair - Python library for visualisations based on Vega & Vega-lite. The chart is converted to json and rendered in a Flask app. There are two separate apps, one for analysis of two connected datasets in a graph (A->B analysis), and another for graph structure understanding.

A->B analysis

In this module we take two datasets - upstream and downstream, and analyse paths between them.

Usage

python3 graph_analysis/a_b_app.py \
     --dataset_A 1069 \
     --dataset_B 8770 \
     --graph_path "graph_1_10.graphml"

Parameters:

  • dataset_A - id of an upstream dataset
  • dataset_B - id of a downstream dataset
  • graph_path - path to a graph in graphml format

Paths between two nodes

Between two datasets we find all possible paths of datasets and systems, and take top n shortest paths to visualise.


SLO

For the shortest path we look for slo of each dataset processing time and visualise the slo for the whole path.


Data integrity

To understand data integrity metrics for the shortest path we create graphs to understand reconstruction, regeneration and backup restoration times.

Graph structure analysis

In this module we look into graph structure, more specifically cycles in a graph, and nodes input and outputs.

Usage

python3 graph_analysis/graph_app.py \
     --graph_path "graph_1_10.graphml"

Parameters:

  • graph_path - path to a graph in graphml format

Cycles

Here we get all the cycles in a graph, count them, save to a file. We also visualise count of unique nodes of a type in cycles.


Nodes input / output

We count different elements in groups, and visualise density. You can observe that most collections have up to 5 elements.

Anomaly detection