# Graph analysis - googleinterns/data-dependency-graph-analysis GitHub Wiki

## Introduction

Two modules analyse a graph stored as an graphml format.

Graph is read using networkx, and features are extracted into Pandas dataframes. Visualisations are created, using dataframes with Altair - Python library for visualisations based on Vega & Vega-lite. The chart is converted to json and rendered in a Flask app. There are two separate apps, one for analysis of two connected datasets in a graph (A->B analysis), and another for graph structure understanding.

### A->B analysis

In this module we take two datasets - upstream and downstream, and analyse paths between them.

**Usage**

```
python3 graph_analysis/a_b_app.py \
--dataset_A 1069 \
--dataset_B 8770 \
--graph_path "graph_1_10.graphml"
```

Parameters:

- dataset_A - id of an upstream dataset
- dataset_B - id of a downstream dataset
- graph_path - path to a graph in graphml format

**Paths between two nodes**

Between two datasets we find all possible paths of datasets and systems, and take top n shortest paths to visualise.

**SLO**

For the shortest path we look for slo of each dataset processing time and visualise the slo for the whole path.

**Data integrity**

To understand data integrity metrics for the shortest path we create graphs to understand reconstruction, regeneration and backup restoration times.

### Graph structure analysis

In this module we look into graph structure, more specifically cycles in a graph, and nodes input and outputs.

**Usage**

```
python3 graph_analysis/graph_app.py \
--graph_path "graph_1_10.graphml"
```

Parameters:

- graph_path - path to a graph in graphml format

**Cycles**

Here we get all the cycles in a graph, count them, save to a file. We also visualise count of unique nodes of a type in cycles.

**Nodes input / output**

We count different elements in groups, and visualise density. You can observe that most collections have up to 5 elements.