Data Access for ML - clumsyspeedboat/Decision-Tree-Neo4j GitHub Wiki

Description of neo4j focussing on data access for machine learning

In Layman terms, applying machine learning to graph data is termed as graph ML. To our surprise, ML tasks are defined much differently on graphs. These tasks can be categorized into 4 types: node classification, link prediction, Graph classification(learning over the whole graph), and community detection.

A big question arises as to why a Graph data and what's there for an Ml in it? Humans respond faster to data visualization. Given a list of an increasing sequence of data, we won’t notice the steep increase as immediately as compared to the data when plotted on the same sequence of numbers. Similarly, computers can find interesting patterns hidden in huge graph data representations. The main aim is to use graph-structured data and build functions that operate over graphs and achieve greater efficiency.

Accessing data using a Graph Data Science Library:

Neo4J Graph Data Science Plugin can access the database directly and run Cypher queries on them. Imagine it like a graph view that our algorithm runs on but we don't have to change the extra graph data to create a structure. Consider it to be a virtual graph that we want to run our algorithm on from aggregation to summarisation and from copying to contraction of the graph, Thus giving us the result that can be used as a dataset for training purposes. One important point that has to be taken into account is data leakage, in order to avoid it we have to split our graph into training and test subgraphs. This result can be turned in Pandas Dataframe( Converting to and from other data formats: https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.convert_matrix.from_pandas_dataframe.html), which can, in turn, be loaded into NetworkX(NetworkX is a streamlined software library for python that offers a lot of graph operations, from the creation of graphs to mutation of graphs, visualization of graphs, etc) as a graph and be used with other libraries apart from GDS.

Link:https://neo4j.com/developer/graph-data-science/link-prediction/scikit-learn/

Using Py2Neo to connect to Neo4j database from Jupyter:

By calling pip install py2neo. The py2neo.Database() method searches the local ports to find Neo’s default browser.

Using StellarGraph library for machine learning on graphs and networks for data access:

It supports loading data from sources via Pandas DataFrames, NumPy arrays, Neo4j, and NetworkX graphs. https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/connector/neo4j/undirected-graphsage-on-cora-neo4j-example.ipynb#scrollTo=SYwDGD2MgXKcThis example shows the application of GraphSAGE on the Cora dataset stored in Neo4j. In this Subgraphs are sampled directly from Neo4j, which eliminates the need to store the whole graph structure in NetworkX.