Training GDS Financial Fraud - tomgeudens/practical-neo4j GitHub Wiki

Context: Cut-and-paste commands for the Graph Data Science - Financial Fraud session.

Prerequisite: This document will assume you have a Neo4j instance running and are connected to it with the Neo4j Browser. You also need to have the Payment Simulation database loaded.

Setup

Make sure you're in the correct database

:use paysim

Preliminary Data Analysis

What does the model look like?

CALL db.schema.visualization();

Full graph counts

CALL apoc.meta.stats();

Node (label) frequencies

CALL db.labels() YIELD label
CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as freq', {})
YIELD value
WITH label, value.freq AS freq
CALL apoc.meta.stats() YIELD nodeCount
WITH *, 3 AS precision
WITH *, 10^precision AS factor, toFloat(freq)/toFloat(nodeCount) AS relFreq
RETURN label AS nodeLabel, freq AS frequency,
round(relFreq*factor)/factor AS relativeFrequency
ORDER BY freq DESC;

Relationship (type) frequencies

CALL db.relationshipTypes() YIELD relationshipType as type
CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as freq', {})
YIELD value
WITH type AS relationshipType, value.freq AS freq
CALL apoc.meta.stats() YIELD relCount
WITH *, 3 AS precision
WITH *, 10^precision AS factor, toFloat(freq)/toFloat(relCount) as relFreq
RETURN relationshipType, freq AS frequency,
round(relFreq*factor)/factor AS relativeFrequency
ORDER BY freq DESC;

Property types

CALL apoc.meta.data() YIELD label, property, type, elementType
WHERE type<>'RELATIONSHIP'
RETURN elementType,label, property, type
ORDER BY elementType, label, property;

Value proposition

MATCH (t:Transaction)
WITH sum(t.amount) AS globalSum, count(t) AS globalCnt
WITH *, 10^3 AS scaleFactor
UNWIND ['CashIn', 'CashOut', 'Payment', 'Debit', 'Transfer'] AS txType
  CALL apoc.cypher.run('MATCH (t:' + txType + ')
    RETURN sum(t.amount) as txAmount, count(t) AS txCnt', {})
  YIELD value
RETURN txType,value.txAmount AS TotalMarketValue,
  100*round(scaleFactor*(toFloat(value.txAmount)/toFloat(globalSum)))
    /scaleFactor AS `%MarketValue`,
  100*round(scaleFactor*(toFloat(value.txCnt)/toFloat(globalCnt)))
    /scaleFactor AS `%MarketTransactions`,
  toInteger(toFloat(value.txAmount)/toFloat(value.txCnt)) AS AvgTransactionValue,
  value.txCnt AS NumberOfTransactions
ORDER BY `%MarketTransactions` DESC;

First Party Fraud

Identify pairs of clients sharing Personal Identifiable Information

MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(n)<-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
RETURN c1.id, c2.id, count(*) AS freq
ORDER BY freq DESC;

How many potential clients with issues do we have?

MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->()<-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
RETURN count(DISTINCT c1.id) AS potential;

Create a new relationship (and then explain in your own words what this does)

MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(n)<-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
WITH c1, c2, count(*) as cnt
MERGE (c1)-[:SHARED_IDENTIFIERS {count: cnt}]->(c2);

Visualize what you just created ...

MATCH p=(c:Client)-[s:SHARED_IDENTIFIERS]-() 
WHERE s.count >= 2 
RETURN p limit 25;

WCC - Step 1 - Memory Estimation

CALL gds.graph.create.cypher.estimate(
'MATCH (c:Client) RETURN id(c) AS id',
'MATCH (c1:Client)-[r:SHARED_IDENTIFIERS]-(c2:Client)
WHERE c1.id<>c2.id
RETURN id(c1) AS source, id(c2) AS target, r.count AS weight')
YIELD requiredMemory,nodeCount,relationshipCount;

WCC - Step 2 - Any leftovers to clean up

CALL gds.graph.list();
CALL gds.graph.drop('name-of-your-graph');

WCC - Step 3 - Native Projection

CALL gds.graph.create('WCC', 'Client',
	{
    	SHARED_IDENTIFIERS:{
        	type: 'SHARED_IDENTIFIERS',
        	properties: {
            	count: {
                	property: 'count'
                }
            }
        }
	}
) YIELD graphName,nodeCount,relationshipCount,createMillis;

WCC - Step 4 - Estimate the memory usage of the algorithm

CALL gds.wcc.stream.estimate('WCC');

WCC - Step 5 - Do the stats make sense?

CALL gds.wcc.stats('WCC');

WCC - Step 6 - Run in streaming mode

CALL gds.wcc.stream('WCC')
YIELD componentId, nodeId
WITH componentId AS cluster, gds.util.asNode(nodeId) AS client
WITH cluster, collect(client.id) AS clients
WITH *, size(clients) AS clusterSize
WHERE clusterSize > 1
RETURN cluster, clusterSize,clients
ORDER by clusterSize DESC;

WCC - Step 7 - Write back to the graph (using streaming???)

CALL gds.wcc.stream('WCC')
YIELD componentId, nodeId
WITH componentId AS cluster, gds.util.asNode(nodeId) AS client
WITH cluster, collect(client.id) AS clients
WITH *, size(clients) AS clusterSize
WHERE clusterSize > 1
UNWIND clients AS client
MATCH (c:Client)
WHERE c.id = client
SET c.firstPartyFraudGroup=cluster;

WCC - Step 8 - Visualization

MATCH (c:Client)
WITH c.firstPartyFraudGroup AS fpGroupID, collect(c.id) AS fGroup
WITH *, size(fGroup) AS groupSize WHERE groupSize >= 9
WITH collect(fpGroupID) AS fraudRings
MATCH p=(c:Client)-[:HAS_SSN|HAS_EMAIL|HAS_PHONE]->()
WHERE c.firstPartyFraudGroup IN fraudRings
RETURN p

Similarity - Step 1 - Cypher Projection

CALL gds.graph.create.cypher('Similarity',
'MATCH(c:Client)
    WHERE exists(c.firstPartyFraudGroup)
    RETURN id(c) AS id,labels(c) AS labels
UNION
MATCH(n)
    WHERE n:Email OR n:Phone OR n:SSN
    RETURN id(n) AS id,labels(n) AS labels',
'MATCH(c:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(ids)
WHERE exists(c.firstPartyFraudGroup)
RETURN id(c) AS source,id(ids) AS target')
YIELD graphName,nodeCount,relationshipCount,createMillis;

Similarity - Step 2 - Stream

CALL gds.nodeSimilarity.stream('Similarity', {topK:15})
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).id AS client1,
    gds.util.asNode(node2).id AS client2, similarity
ORDER BY similarity DESC;

Similarity - Step 3 - Mutate

CALL gds.nodeSimilarity.mutate('Similarity',{topK:15,
  mutateProperty:'jaccardScore', mutateRelationshipType:'SIMILAR_TO'});

Similarity - Step 3 - Verify mutation ...

CALL gds.graph.list('Similarity');

Similarity - Step 4 - Write back to the graph

CALL gds.graph.writeRelationship('Similarity','SIMILAR_TO','jaccardScore');

Similarity - Step 5 - Visualization

MATCH (c:Client)
WITH c.firstPartyFraudGroup AS fpGroupID, collect(c.id) AS fGroup
WITH *, size(fGroup) AS groupSize WHERE groupSize >= 9
WITH collect(fpGroupID) AS fraudRings
MATCH p=(c:Client)-[:SIMILAR_TO]->()
WHERE c.firstPartyFraudGroup IN fraudRings
RETURN p

Degree Centrality - Step 1 - Streaming mode and reusing what we already have

CALL gds.degree.stream('Similarity',
    {
        nodeLabels: ['Client'],
        relationshipTypes: ['SIMILAR_TO'],
        relationshipWeightProperty:'jaccardScore'
    }
) YIELD nodeId,score
RETURN gds.util.asNode(nodeId).id AS client,score
ORDER BY score DESC;

Degree Centrality - Step 2 - Write back to the graph

CALL gds.degree.write('Similarity',{nodeLabels:['Client'],
    relationshipTypes:['SIMILAR_TO'],
    relationshipWeightProperty:'jaccardScore',
    writeProperty:'firstPartyFraudScore'});

Degree Centrality - Step 3 - Enhance the graph (notice the trick?)

MATCH(c:Client)
WHERE exists(c.firstPartyFraudScore)
WITH percentileCont(c.firstPartyFraudScore, 0.95)
    AS firstPartyFraudThreshold
MATCH(c:Client)
WHERE c.firstPartyFraudScore>firstPartyFraudThreshold
SET c:FirstPartyFraudster;

Cleanup

CALL gds.graph.list()
YIELD graphName AS namedGraph
WITH namedGraph
CALL gds.graph.drop(namedGraph)
YIELD graphName
RETURN graphName;

Second Party Fraud

Clients that were not identified but do interact with First Party Fraudsters

MATCH p=(:Client:FirstPartyFraudster)-[]-(:Transaction)-[]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;

And what types of transactions are we talking about?

MATCH (:Client:FirstPartyFraudster)-[]-(txn:Transaction)-[]-(c:Client)
WHERE NOT c:FirstPartyFraudster
UNWIND labels(txn) AS transactionType
RETURN transactionType, count(*) AS freq;

Create the (aggregated) MonoPartite graph ...

MATCH (c1:FirstPartyFraudster)-[]->(t:Transaction)-[]->(c2:Client)
WHERE NOT c2:FirstPartyFraudster
WITH c1,c2,sum(t.amount) AS totalAmount
SET c2:SecondPartyFraudSuspect
CREATE (c1)-[:TRANSFER_TO {amount:totalAmount}]->(c2);

Same, in the other direction

MATCH (c1:FirstPartyFraudster)<-[]-(t:Transaction)<-[]-(c2:Client)
WHERE NOT c2:FirstPartyFraudster
WITH c1,c2,sum(t.amount) AS totalAmount
SET c2:SecondPartyFraudSuspect
CREATE (c1)<-[:TRANSFER_TO {amount:totalAmount}]-(c2);

Visualize

MATCH p=(:Client:FirstPartyFraudster)-[:TRANSFER_TO]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;

Native projection ... because now we can of course

CALL gds.graph.create('SecondPartyFraudNetwork','Client','TRANSFER_TO',{relationshipProperties:'amount'});

Streaming WCC to find relevant clusters (is this starting to look familiar?)

CALL gds.wcc.stream('SecondPartyFraudNetwork')
YIELD nodeId,componentId
WITH gds.util.asNode(nodeId) AS client,componentId AS clusterId
WITH clusterId,collect(client.id) AS cluster
WITH clusterId,size(cluster) AS clusterSize,cluster
WHERE clusterSize>1
RETURN clusterId,clusterSize
ORDER BY clusterSize DESC;

Same write back trick ... why did we do this again?

CALL gds.wcc.stream('SecondPartyFraudNetwork')
YIELD nodeId,componentId
WITH gds.util.asNode(nodeId) AS client,componentId AS clusterId
WITH clusterId,collect(client.id) AS cluster
WITH clusterId,size(cluster) AS clusterSize,cluster
WHERE clusterSize>1
UNWIND cluster AS client
MATCH(c:Client {id:client})
SET c.secondPartyFraudGroup=clusterId;

Pagerank is going to tell us this time

CALL gds.pageRank.stream('SecondPartyFraudNetwork',{relationshipWeightProperty:'amount'})
YIELD nodeId,score
WITH gds.util.asNode(nodeId) AS client,score AS pageRankScore
RETURN client.secondPartyFraudGroup,client.name,labels(client),pageRankScore
ORDER BY client.secondPartyFraudGroup,pageRankScore DESC;

And another write back trick ...

CALL gds.pageRank.stream('SecondPartyFraudNetwork',{relationshipWeightProperty:'amount'})
YIELD nodeId,score
WITH gds.util.asNode(nodeId) AS client,score AS pageRankScore
WHERE exists(client.secondPartyFraudGroup)
	AND pageRankScore >1 AND NOT client:FirstPartyFraudster
MATCH(c:Client {id:client.id})
SET c:SecondPartyFraud
SET c.secondPartyFraudScore=pageRankScore;

Visualize

MATCH p=(:Client:FirstPartyFraudster)-[:TRANSFER_TO]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;

Cleanup ... and done ...

CALL gds.graph.list()
YIELD graphName AS namedGraph
WITH namedGraph
CALL gds.graph.drop(namedGraph)
YIELD graphName
RETURN graphName;