Training GDS Financial Fraud - tomgeudens/practical-neo4j GitHub Wiki
Context: Cut-and-paste commands for the Graph Data Science - Financial Fraud session.
Prerequisite: This document will assume you have a Neo4j instance running and are connected to it with the Neo4j Browser. You also need to have the Payment Simulation database loaded.
Setup
Make sure you're in the correct database
:use paysim
Preliminary Data Analysis
What does the model look like?
CALL db.schema.visualization();
Full graph counts
CALL apoc.meta.stats();
Node (label) frequencies
CALL db.labels() YIELD label
CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as freq', {})
YIELD value
WITH label, value.freq AS freq
CALL apoc.meta.stats() YIELD nodeCount
WITH *, 3 AS precision
WITH *, 10^precision AS factor, toFloat(freq)/toFloat(nodeCount) AS relFreq
RETURN label AS nodeLabel, freq AS frequency,
round(relFreq*factor)/factor AS relativeFrequency
ORDER BY freq DESC;
Relationship (type) frequencies
CALL db.relationshipTypes() YIELD relationshipType as type
CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as freq', {})
YIELD value
WITH type AS relationshipType, value.freq AS freq
CALL apoc.meta.stats() YIELD relCount
WITH *, 3 AS precision
WITH *, 10^precision AS factor, toFloat(freq)/toFloat(relCount) as relFreq
RETURN relationshipType, freq AS frequency,
round(relFreq*factor)/factor AS relativeFrequency
ORDER BY freq DESC;
Property types
CALL apoc.meta.data() YIELD label, property, type, elementType
WHERE type<>'RELATIONSHIP'
RETURN elementType,label, property, type
ORDER BY elementType, label, property;
Value proposition
MATCH (t:Transaction)
WITH sum(t.amount) AS globalSum, count(t) AS globalCnt
WITH *, 10^3 AS scaleFactor
UNWIND ['CashIn', 'CashOut', 'Payment', 'Debit', 'Transfer'] AS txType
CALL apoc.cypher.run('MATCH (t:' + txType + ')
RETURN sum(t.amount) as txAmount, count(t) AS txCnt', {})
YIELD value
RETURN txType,value.txAmount AS TotalMarketValue,
100*round(scaleFactor*(toFloat(value.txAmount)/toFloat(globalSum)))
/scaleFactor AS `%MarketValue`,
100*round(scaleFactor*(toFloat(value.txCnt)/toFloat(globalCnt)))
/scaleFactor AS `%MarketTransactions`,
toInteger(toFloat(value.txAmount)/toFloat(value.txCnt)) AS AvgTransactionValue,
value.txCnt AS NumberOfTransactions
ORDER BY `%MarketTransactions` DESC;
First Party Fraud
Identify pairs of clients sharing Personal Identifiable Information
MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(n)<-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
RETURN c1.id, c2.id, count(*) AS freq
ORDER BY freq DESC;
How many potential clients with issues do we have?
MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->()<-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
RETURN count(DISTINCT c1.id) AS potential;
Create a new relationship (and then explain in your own words what this does)
MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(n)<-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
WITH c1, c2, count(*) as cnt
MERGE (c1)-[:SHARED_IDENTIFIERS {count: cnt}]->(c2);
Visualize what you just created ...
MATCH p=(c:Client)-[s:SHARED_IDENTIFIERS]-()
WHERE s.count >= 2
RETURN p limit 25;
WCC - Step 1 - Memory Estimation
CALL gds.graph.create.cypher.estimate(
'MATCH (c:Client) RETURN id(c) AS id',
'MATCH (c1:Client)-[r:SHARED_IDENTIFIERS]-(c2:Client)
WHERE c1.id<>c2.id
RETURN id(c1) AS source, id(c2) AS target, r.count AS weight')
YIELD requiredMemory,nodeCount,relationshipCount;
WCC - Step 2 - Any leftovers to clean up
CALL gds.graph.list();
CALL gds.graph.drop('name-of-your-graph');
WCC - Step 3 - Native Projection
CALL gds.graph.create('WCC', 'Client',
{
SHARED_IDENTIFIERS:{
type: 'SHARED_IDENTIFIERS',
properties: {
count: {
property: 'count'
}
}
}
}
) YIELD graphName,nodeCount,relationshipCount,createMillis;
WCC - Step 4 - Estimate the memory usage of the algorithm
CALL gds.wcc.stream.estimate('WCC');
WCC - Step 5 - Do the stats make sense?
CALL gds.wcc.stats('WCC');
WCC - Step 6 - Run in streaming mode
CALL gds.wcc.stream('WCC')
YIELD componentId, nodeId
WITH componentId AS cluster, gds.util.asNode(nodeId) AS client
WITH cluster, collect(client.id) AS clients
WITH *, size(clients) AS clusterSize
WHERE clusterSize > 1
RETURN cluster, clusterSize,clients
ORDER by clusterSize DESC;
WCC - Step 7 - Write back to the graph (using streaming???)
CALL gds.wcc.stream('WCC')
YIELD componentId, nodeId
WITH componentId AS cluster, gds.util.asNode(nodeId) AS client
WITH cluster, collect(client.id) AS clients
WITH *, size(clients) AS clusterSize
WHERE clusterSize > 1
UNWIND clients AS client
MATCH (c:Client)
WHERE c.id = client
SET c.firstPartyFraudGroup=cluster;
WCC - Step 8 - Visualization
MATCH (c:Client)
WITH c.firstPartyFraudGroup AS fpGroupID, collect(c.id) AS fGroup
WITH *, size(fGroup) AS groupSize WHERE groupSize >= 9
WITH collect(fpGroupID) AS fraudRings
MATCH p=(c:Client)-[:HAS_SSN|HAS_EMAIL|HAS_PHONE]->()
WHERE c.firstPartyFraudGroup IN fraudRings
RETURN p
Similarity - Step 1 - Cypher Projection
CALL gds.graph.create.cypher('Similarity',
'MATCH(c:Client)
WHERE exists(c.firstPartyFraudGroup)
RETURN id(c) AS id,labels(c) AS labels
UNION
MATCH(n)
WHERE n:Email OR n:Phone OR n:SSN
RETURN id(n) AS id,labels(n) AS labels',
'MATCH(c:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(ids)
WHERE exists(c.firstPartyFraudGroup)
RETURN id(c) AS source,id(ids) AS target')
YIELD graphName,nodeCount,relationshipCount,createMillis;
Similarity - Step 2 - Stream
CALL gds.nodeSimilarity.stream('Similarity', {topK:15})
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).id AS client1,
gds.util.asNode(node2).id AS client2, similarity
ORDER BY similarity DESC;
Similarity - Step 3 - Mutate
CALL gds.nodeSimilarity.mutate('Similarity',{topK:15,
mutateProperty:'jaccardScore', mutateRelationshipType:'SIMILAR_TO'});
Similarity - Step 3 - Verify mutation ...
CALL gds.graph.list('Similarity');
Similarity - Step 4 - Write back to the graph
CALL gds.graph.writeRelationship('Similarity','SIMILAR_TO','jaccardScore');
Similarity - Step 5 - Visualization
MATCH (c:Client)
WITH c.firstPartyFraudGroup AS fpGroupID, collect(c.id) AS fGroup
WITH *, size(fGroup) AS groupSize WHERE groupSize >= 9
WITH collect(fpGroupID) AS fraudRings
MATCH p=(c:Client)-[:SIMILAR_TO]->()
WHERE c.firstPartyFraudGroup IN fraudRings
RETURN p
Degree Centrality - Step 1 - Streaming mode and reusing what we already have
CALL gds.degree.stream('Similarity',
{
nodeLabels: ['Client'],
relationshipTypes: ['SIMILAR_TO'],
relationshipWeightProperty:'jaccardScore'
}
) YIELD nodeId,score
RETURN gds.util.asNode(nodeId).id AS client,score
ORDER BY score DESC;
Degree Centrality - Step 2 - Write back to the graph
CALL gds.degree.write('Similarity',{nodeLabels:['Client'],
relationshipTypes:['SIMILAR_TO'],
relationshipWeightProperty:'jaccardScore',
writeProperty:'firstPartyFraudScore'});
Degree Centrality - Step 3 - Enhance the graph (notice the trick?)
MATCH(c:Client)
WHERE exists(c.firstPartyFraudScore)
WITH percentileCont(c.firstPartyFraudScore, 0.95)
AS firstPartyFraudThreshold
MATCH(c:Client)
WHERE c.firstPartyFraudScore>firstPartyFraudThreshold
SET c:FirstPartyFraudster;
Cleanup
CALL gds.graph.list()
YIELD graphName AS namedGraph
WITH namedGraph
CALL gds.graph.drop(namedGraph)
YIELD graphName
RETURN graphName;
Second Party Fraud
Clients that were not identified but do interact with First Party Fraudsters
MATCH p=(:Client:FirstPartyFraudster)-[]-(:Transaction)-[]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;
And what types of transactions are we talking about?
MATCH (:Client:FirstPartyFraudster)-[]-(txn:Transaction)-[]-(c:Client)
WHERE NOT c:FirstPartyFraudster
UNWIND labels(txn) AS transactionType
RETURN transactionType, count(*) AS freq;
Create the (aggregated) MonoPartite graph ...
MATCH (c1:FirstPartyFraudster)-[]->(t:Transaction)-[]->(c2:Client)
WHERE NOT c2:FirstPartyFraudster
WITH c1,c2,sum(t.amount) AS totalAmount
SET c2:SecondPartyFraudSuspect
CREATE (c1)-[:TRANSFER_TO {amount:totalAmount}]->(c2);
Same, in the other direction
MATCH (c1:FirstPartyFraudster)<-[]-(t:Transaction)<-[]-(c2:Client)
WHERE NOT c2:FirstPartyFraudster
WITH c1,c2,sum(t.amount) AS totalAmount
SET c2:SecondPartyFraudSuspect
CREATE (c1)<-[:TRANSFER_TO {amount:totalAmount}]-(c2);
Visualize
MATCH p=(:Client:FirstPartyFraudster)-[:TRANSFER_TO]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;
Native projection ... because now we can of course
CALL gds.graph.create('SecondPartyFraudNetwork','Client','TRANSFER_TO',{relationshipProperties:'amount'});
Streaming WCC to find relevant clusters (is this starting to look familiar?)
CALL gds.wcc.stream('SecondPartyFraudNetwork')
YIELD nodeId,componentId
WITH gds.util.asNode(nodeId) AS client,componentId AS clusterId
WITH clusterId,collect(client.id) AS cluster
WITH clusterId,size(cluster) AS clusterSize,cluster
WHERE clusterSize>1
RETURN clusterId,clusterSize
ORDER BY clusterSize DESC;
Same write back trick ... why did we do this again?
CALL gds.wcc.stream('SecondPartyFraudNetwork')
YIELD nodeId,componentId
WITH gds.util.asNode(nodeId) AS client,componentId AS clusterId
WITH clusterId,collect(client.id) AS cluster
WITH clusterId,size(cluster) AS clusterSize,cluster
WHERE clusterSize>1
UNWIND cluster AS client
MATCH(c:Client {id:client})
SET c.secondPartyFraudGroup=clusterId;
Pagerank is going to tell us this time
CALL gds.pageRank.stream('SecondPartyFraudNetwork',{relationshipWeightProperty:'amount'})
YIELD nodeId,score
WITH gds.util.asNode(nodeId) AS client,score AS pageRankScore
RETURN client.secondPartyFraudGroup,client.name,labels(client),pageRankScore
ORDER BY client.secondPartyFraudGroup,pageRankScore DESC;
And another write back trick ...
CALL gds.pageRank.stream('SecondPartyFraudNetwork',{relationshipWeightProperty:'amount'})
YIELD nodeId,score
WITH gds.util.asNode(nodeId) AS client,score AS pageRankScore
WHERE exists(client.secondPartyFraudGroup)
AND pageRankScore >1 AND NOT client:FirstPartyFraudster
MATCH(c:Client {id:client.id})
SET c:SecondPartyFraud
SET c.secondPartyFraudScore=pageRankScore;
Visualize
MATCH p=(:Client:FirstPartyFraudster)-[:TRANSFER_TO]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;
Cleanup ... and done ...
CALL gds.graph.list()
YIELD graphName AS namedGraph
WITH namedGraph
CALL gds.graph.drop(namedGraph)
YIELD graphName
RETURN graphName;