[25.04.03] KAN: Kolmogorov‐Arnold Networks - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

  • Paper Title: KAN: KOLMOGOROV-ARNOLD NETWORKS
  • Authors: Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, Max Tegmark
  • Published In: arXiv preprint (as discussed, likely submitted elsewhere)
  • Year: 2024 (Inferred based on discussion context and typical preprint timing)
  • Link: [Not provided in inputs]
  • Date of Discussion: 2025.04.03

Summary

  • Research Problem: Traditional Multi-Layer Perceptrons (MLPs) act as "black boxes," lacking interpretability, which hinders their use in scientific discovery where understanding the underlying function is crucial.
  • Key Contributions: The paper proposes Kolmogorov-Arnold Networks (KANs) as an interpretable alternative to MLPs. Inspired by the Kolmogorov-Arnold representation theorem, KANs feature learnable activation functions on the edges of the network graph, rather than fixed activations on nodes like MLPs. They claim KANs can be more accurate, parameter-efficient, and interpretable, potentially rediscovering mathematical and physical laws.
  • Methodology/Approach: KANs parameterize the 1D learnable activation functions on edges using B-splines. Nodes simply sum incoming signals. Techniques like grid updating/extension are used for accuracy, while sparsity regularization, pruning, and symbolification (fitting learned splines to known functions like sin, exp, x^2) are used to enhance interpretability and extract symbolic formulas.
  • Results: The paper demonstrates KANs achieving better accuracy and scaling laws than MLPs on function fitting and PDE solving tasks. Interpretability is shown through examples like knot theory and physics equations. A major acknowledged limitation is slower training speed compared to MLPs.

Discussion Points

  • Strengths:

    • Novelty & Interpretability: The core concept (KAT inspiration, edge-based learnable activations) was found highly novel and promising for interpretability (0:35, 3:06). The potential to extract symbolic formulas is a key advantage, especially for scientific applications (27:47, 51:29).
    • Accuracy/Efficiency: Appeared more accurate and parameter-efficient than MLPs for tasks with underlying structure (function fitting, PDEs) (32:14, Knot Theory example 28:29).
    • Symbolification Process: The automated process of pruning non-contributing parts and suggesting/fitting symbolic functions to the learned splines was considered particularly impressive ("기가 막힌") (50:59).
    • Theoretical Basis: Grounded in the Kolmogorov-Arnold theorem, offering a different theoretical foundation than MLP's UAT (0:08, Fig 1).
  • Weaknesses:

    • Training Speed: This was repeatedly identified as the most significant drawback, potentially hindering practical adoption (34:49, 1:04:19). Attributed to spline computations and lack of optimized matrix operations like MLPs benefit from on GPUs (34:57, 1:00:23).
    • Theoretical Complexity: The approximation theorem (Thm 2.1) and the term "residue rate" were found confusing, although the implication of potentially beating the curse of dimensionality was noted (14:41, 22:12, 25:42).
    • Stability/Practicality: Concerns about stability with deeper KANs or finer grids in some contexts (PDEs, continual learning) (55:42, Appendix B.4). Requires careful tuning and specific techniques (skip connections, sparsity) to work well (41:00, 45:33).
    • Interpretability Nuance: Discussion on whether KANs truly find the "ground truth" symbolic form or just a very good, structured approximation, especially for complex/image data where no simple formula exists (3:43 - 5:03).
  • Key Questions:

    • Can the training speed bottleneck be effectively addressed through software/hardware optimizations? (1:04:19)
    • How well do KANs scale to truly large datasets and model sizes common in deep learning? (Paper focuses on smaller scale)
    • What are the precise theoretical benefits and potential pathologies of deep KANs compared to the original 2-layer KAT? (2.6, E)
    • How robust are KANs to noise or non-smooth functions compared to MLPs?
  • Applications:

    • Scientific Discovery: Primary focus – identifying symbolic equations from data in physics, mathematics, etc. (27:47, 39:35, Examples U, T).
    • PDE Solving: Demonstrated strong performance (Section 4, Appendix B).
    • Interpretable Regression: Useful in any domain where understanding the learned function is important.
    • Component Replacement: Potential to replace MLP blocks within larger architectures (e.g., Transformers -> "Kansformers") (Application aspects).
  • Connections:

    • Kolmogorov-Arnold Theorem (Direct inspiration).
    • Universal Approximation Theorem (Contrast).
    • Fourier Analysis (Analogy for function decomposition) (1:13).
    • Spline Theory (Implementation mechanism).
    • Interpretable ML (GAMs, Symbolic Regression).
    • Neural Scaling Laws (Offers a potentially faster scaling perspective) (Appendix K).

Notes and Reflections

  • Interesting Insights:

    • The architectural shift from node-centric (MLP) to edge-centric (KAN) computation is fundamental (Fig 1).
    • Interpretability can be built into the architecture design, enabling interaction (human collaboration) during model refinement (Fig 25).
    • The combination of learnable splines and sparsity allows the network to discover relevant functional forms and simplify itself (50:09).
    • The necessity of residual/skip connections (SiLU base) suggests optimization plays a key role alongside representation power (41:00).
  • Lessons Learned:

    • Mathematical theorems can inspire novel and potentially powerful ML architectures.
    • Practical deep learning requires significant engineering beyond theoretical representation guarantees (e.g., optimization, regularization, efficient computation).
    • Interpretability and accuracy are not always mutually exclusive; architectures can be designed to promote both.
  • Future Directions:

    • Efficiency: Overcoming the training speed limitation is paramount.
    • Scaling: Testing KANs on large-scale benchmark datasets.
    • Theory: Deeper understanding of deep KANs, smoothness, and approximation capabilities.
    • Hybrid Models: Exploring combinations of KAN and MLP components.
    • Robustness: Investigating performance on noisy or less structured data.

transcript.txt