11.8 BN Scalability Lessons – Wu et al. 2018 - ravkorsurv/kor-ai-core GitHub Wiki

12.3 BN Scalability Lessons – Wu et al. 2018

This page summarizes key insights from Wu et al. (2018) on Dynamic Bayesian Networks (DBNs) for spoofing detection, and outlines how Kor.ai can apply these principles when designing Bayesian models for market abuse risk.

⚠️ Scalability Concerns in Wu et al. (2018)

Concern	Description
High Dimensionality	Modeling raw trade/order data at high frequency leads to explosive node counts across time slices.
Large CPTs	Nodes with many parents produce large Conditional Probability Tables (CPTs), which are computationally expensive.
Real-Time Constraints	Exact inference in large DBNs is too slow for sub-second alerting.
Overfitting with Sparse Data	Few labeled abuse cases make parameter learning unstable.
Temporal Slice Depth	Longer temporal horizons multiply nodes; must limit time-slice history.

✅ Kor.ai Design Principles Based on Wu et al.

1. Event Abstraction Instead of Raw Tick Data

Avoid modeling every millisecond/tick event.
Create behavioral abstraction nodes, e.g.:
- AggressiveOrder
- CancelSurge
- PriceImpactCluster

2. Limit Node Fan-In Using Latent Intermediates

Use intermediate nodes like:
- OrderAggressiveness
- AbuseLikelihood
- MarketVolatilityLevel
Cap parent nodes to 3–4 to avoid exponential CPTs.
Use noisy-OR or canonical forms for simplifying conditional probability definitions.

3. Tiered Inference Strategy

Use Tier 1 rules/stats for pre-filtering high-risk events.
Run Tier 2 BN inference only for flagged candidates.
Precompute partial inference graphs where possible.

4. Temporal Granularity Optimization

Use 3–5 behavior steps for spoofing (e.g., Place > Cancel > Trade).
Use 1–3 day slices for insider trading.
Represent transitions between behavioral states, not absolute time intervals.

5. Robust Parameterization Under Sparse Labels

Use:
- Expert priors for key nodes (IntentToManipulate, AccessToMNPI).
- Synthetic labeled abuse cases for model testing.
- Unsupervised signals (e.g., Z-scores) as BN input nodes.
Apply Dirichlet priors or pseudo-counts for CPT stability.

🧠 Summary Table

Design Element	Recommendation
Node Design	Use behavioral abstractions, not raw features.
CPT Construction	Apply latent layers + canonical forms.
Time Slice Depth	Max 3–5 behavior events (spoofing) or 1–3 days (insider).
Inference Strategy	Tiered (pre-filter → BN) or batch mode.
Data Strategy	Combine SME priors + synthetic examples + weak labels.

📚 Reference

Wu, Y., Wang, H., Zhang, J., & Yu, P. S. (2018). “Detecting Spoofing Trades Using Dynamic Bayesian Networks.” IEEE International Conference on Machine Learning and Applications (ICMLA).