HPC AI Sparse Matrix System - yizhihenpidehou/yzhpdh-s-bookcase GitHub Wiki

Sparse matrix analysis

Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. SIGPLAN Not. 48, 6 (June 2013), 117–126. https://doi.org/10.1145/2499370.2462181 [Users only need to prepare their sparse matrices in CSR format as the input, and SMAT automatically determines the optimal storage format and implementation on a given architecture.]PDF
Anik, Md Saidul Hoque, et al. "iSpLib: A Library for Accelerating Graph Neural Networks using Auto-tuned Sparse Operations." arXiv preprint arXiv:2403.14853 (2024). 【预测当前环境下最优embedding大小】
Zhao, Yue, et al. "Bridging the gap between deep learning and sparse matrix format selection." Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 2018. 【稀疏矩阵representatin + CNN（模型） + 迁移学习(针对cross-architecture)】

Automatic Selection of Sparse Matrix Representation on GPUs (2015 PLDI) PDF
SMAT: An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication.(2013 ICS)

System

1. Kyrola, Aapo, Guy Blelloch, and Carlos Guestrin. "{GraphChi}:{Large-Scale} graph computation on just a {PC}." 10th USENIX symposium on operating systems design and implementation (OSDI 12). 2012. PDF

【单机/考虑从潜在的问题/替代方案的不足之处/Analysis of the I/O Costs/code行数或者较少的code实现复杂功能/Main execution flow/Applications（具体介绍实现了什么功能）】介绍了一个名为Pregel的系统,用于处理大规模图数据计算问题。

Pregel采用了基于顶点的计算模型,程序以一系列迭代(称为超步)的形式执行,每个顶点在每个超步中执行相同的用户定义函数,可以接收上一超步的消息、发送消息给其他顶点,并修改自身状态和出边。这种同步的计算模型有助于程序的正确性推理。 Pregel提供了一套C++ API,允许用户定义顶点值、边值和消息值的类型,并实现自定义的计算逻辑。API还提供了聚合器、组合器等机制来优化性能。（介绍模块的用途 Pregel的实现针对大规模分布式环境进行了优化,包括分区、容错等机制,可以在成千上万台普通计算机集群上高效运行。

具体实现方式

文章展示了几个在Pregel上实现的图算法,如PageRank、最短路径等,并给出了性能结果。（application
Pregel的计算模型相比MapReduce等更适合图处理,能更好地表达图算法并获得更好的性能。(performance

2. Gonzalez, Joseph E., et al. "{PowerGraph}: distributed {Graph-Parallel} computation on natural graphs." 10th USENIX symposium on operating systems design and implementation (OSDI 12). 2012.PDF

【分布式】论文介绍了一种名为PowerGraph的新型图并行计算抽象,用于解决自然图(power-law分布)上的大规模图计算问题。自然图通常具有高度偏斜的幂律度分布,给现有的图并行抽象(如Pregel和GraphLab)带来了挑战,限制了它们的性能和可扩展性。

PowerGraph针对幂律分布图的特点进行了设计优化:

将顶点程序显式地分解为gather、sum、apply和scatter四个阶段,以更好地利用图结构并提高并行度。
引入增量缓存机制,可动态维护计算状态,避免重复计算。
提出新的分布式图数据布局方法,可高效表示和存储幂律图。
理论分析和实验评估表明,与现有系统相比,PowerGraph在网络通信和存储开销,以及算法收敛速度等方面都有显著提升。

论文给出了PowerGraph的三种不同实现策略,并在大规模EC2集群上进行了评估,展示了PowerGraph在实际应用中取得的巨大性能提升。

3. Nguyen, Donald, Andrew Lenharth, and Keshav Pingali. "A lightweight infrastructure for graph analytics." Proceedings of the twenty-fourth ACM symposium on operating systems principles. 2013. PDF

单机多线程

Background:

Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently.
Improvement for Galois system.

4. Ching, Avery, et al. "One trillion edges: Graph processing at facebook-scale." Proceedings of the VLDB Endowment 8.12 (2015): 1804-1815.PDF

论文介绍了Facebook在使用Apache Giraph处理超大规模图数据(高达1万亿条边)时进行的系统优化和扩展工作。社交网络和网络公司需要分析大规模图数据,以获得内容排名和推荐等洞见。现有的图处理系统通常只能处理最多66亿条边的基准图,无法应对数据规模两个数量级更大的实际工业图。

论文描述了在Apache Giraph上进行的以下改进（ Platform improvements）:

支持顶点数据和边数据分别从不同数据源加载的灵活输入模型,消除了预处理开销。
增加了多线程并行处理,提升了计算效率。
优化了内存使用,解决了内存溢出和垃圾回收问题。
扩展了Pregel模型,增加了主计算、分片聚合器等特性,支持了更广泛的生产级图应用。
论文展示了在Facebook实际生产环境中运行的大规模图处理应用,如PageRank、K-Means等,并给出了性能测试结果,显示了这些优化的显著效果。

论文总结了作者在使用Giraph处理超大规模图数据过程中的实操经验,供其他从事相同工作的研究者参考。

背景：

Analyzing large graphs provides valuable insights for social networking and web companies in content ranking and recommendations
Graph analysis is an emerging and important application area.

挑战：scalability challenges/real-world applications often require much more complex graph processing workflows than previously evaluated.

• Present usability, performance, and scalability improvements for Apache Giraph that enable trillion edge graph computations.

• Describe extensions to the Pregel model and why we found them useful for our graph applications.

• Real-world applications and their performance on the Facebook graph.

• Share operational experiences running large-scale production applications on existing MapReduce infrastructure.

• Contribution of production code (including all extensions described in this paper) into the open-source Apache Giraph project.

5. Han, Wook-Shin, et al. "TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013.PDF

单机

Background:

Graphs are used to model many real objects such as social networks, web graphs, chemical compounds, and biological structures.
The size of real graphs is very large. For example, Facebook reached one billion users on Oct.

6. Cheng, Jiefeng, et al. "VENUS: Vertex-centric streamlined graph computation on a single PC." 2015 IEEE 31st International Conference on Data Engineering. IEEE, 2015.PDF

单机 Background: Recent studies show that disk-based graph computation on just a single PC can be as highly competitive as cluster-based computing systems on large-scale problems

7. Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing." Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010. PDF

Background: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs—in some cases billions of vertices, trillions of edges—poses challenges to their efficient processing

8. P. Faldu, J. Diamond and B. Grot, "Domain-Specialized Cache Management for Graph Analytics," 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 2020, pp. 234-248, doi: 10.1109/HPCA47549.2020.00028. PDF code

A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. 【图的幂律特性】

In response, we propose GRASP, domain-specialized cache management at the last-level cache for graph analytics. GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. 【优化现有cache方案】

GRASP keeps hardware cost negligible by leveraging lightweight software support to pinpoint hot vertices【轻量化】

GRASP augments existing cache insertion and hit-promotion policies to provide preferential treatment to cache blocks containing hot vertices to shield them from thrashing. To cater to the irregular access patterns, GRASP policies are designed to be flexible to cache other blocks exhibiting reuse. Unlike pinning, GRASP maximizes cache efficiency based on observed access patterns.

9. Hu, Yuwei, et al. "Featgraph: A flexible and efficient backend for graph neural network systems." SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020. PDF

10. Yan, Mingyu, et al. "Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach." Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 2019. PDF

11. Liu, Hang, and H. Howie Huang. "Graphene:{Fine-Grained}{IO} Management for Graph Computing." 15th USENIX Conference on File and Storage Technologies (FAST 17). 2017. PDF code

12. Pandey, Prashant, et al. "Terrace: A hierarchical graph container for skewed dynamic graphs." Proceedings of the 2021 international conference on management of data. 2021. PDF

13. Basak, Abanti, et al. "Analysis and optimization of the memory hierarchy for graph processing workloads." 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019. PPT

14. Malicevic, Jasmina, Baptiste Lepers, and Willy Zwaenepoel. "Everything you always wanted to know about multicore graph processing but were afraid to ask." 2017 USENIX Annual Technical Conference (USENIX ATC 17). 2017. PDF

15. Vora, Keval. "{LUMOS}:{Dependency-Driven} disk-based graph processing." 2019 USENIX Annual Technical Conference (USENIX ATC 19). 2019.

PDF

16. Zhou, Yangjie, et al. "Ugrapher: High-performance graph operator computation via unified abstraction for graph neural networks." Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 2023. PDF

Dataset

Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1, Article 1 (November 2011), 25 pages. https://doi.org/10.1145/2049662.2049663 [PDF](The University of Florida sparse matrix collection](https://dl.acm.org/doi/abs/10.1145/2049662.2049663)

python interface：https://sparse.tamu.edu/interfaces