Overview of PaddlePaddle 3.0 Official Version - PaddlePaddle/Paddle GitHub Wiki

Declaration: This document is translated by Baidu Translate

As China's first independently developed industrial-grade deep learning platform, PaddlePaddle has always adhered to the open-source path, supporting the intelligent upgrade of industries. The PaddlePaddle framework version 3.0 not only continues the characteristics of the PaddlePaddle framework 2.0 series, which unifies static and dynamic operations and integrates training and inference, but also achieves breakthroughs in automatic parallelism, neural network compilers, and high-order automatic differentiation, providing strong support for technological innovation and industrial applications in the era of large models, and creating a one-stop, high-performance deep learning development experience for developers. Whether it is cutting-edge algorithm research or the implementation of industrial-grade large models, PaddlePaddle framework version 3.0 will become the preferred tool for developers. Key features are described as follows:

  • Unified Static and Dynamic Automatic Parallelism: This feature significantly reduces the cost of industrial development and training. Users only need to perform a small amount of tensor slicing marking on a single card, and the PaddlePaddle framework will automatically derive the distributed slicing information and add communication operators to ensure logical correctness. At the same time, based on the model structure and cluster information, combined with the optimization of memory and scheduling layers, PaddlePaddle can automatically find the most efficient distributed parallel strategy, thereby significantly reducing the development cost of hybrid parallel training and enabling developers to focus more on model and algorithm innovation. The automatic parallel architecture has undergone in-depth verification and polishing to better support the pre-training + fine-tuning process for common large model scenarios such as pure dense models, pure sparse models (MoE), and multi-modal understanding models. It improves the slicing derivation rules of operators and supports converting automatic parallel training parameters into manual parallel parameters for downstream inference, achieving comprehensive usability and helping users reduce the development cost of large model parallel programs. Additionally, to further simplify the user's distributed development process, a new paddle.distributed.parallel interface is introduced. Based on the encapsulation of distributed tensor marking syntax, it supports users in non-intrusively configuring common parallel strategies such as data parallelism, model parallelism, and pipeline parallelism outside of the model networking. Furthermore, the static graph automatic parallel architecture has undergone a comprehensive upgrade based on PIR, with the underlying basic components, core modules, parallel strategies, and performance optimization strategies all implemented uniformly based on the extended PIR DistDialect, further enhancing the consistency of automatic parallelism between static and dynamic states, and achieving performance levels on the Llama series models that are on par with or even surpass manual parallel methods.
  • Integrated Training and Inference for Large Models: Since version 2.0, PaddlePaddle has adopted the design philosophy of "unified dynamic and static, integrated training and inference," and version 3.0 will continue to uphold this philosophy. Thanks to the unified dynamic and static architecture and interface design, PaddlePaddle fully supports both dynamic and static graph modes, and possesses excellent whole-graph export capabilities. The success rate of whole-graph export from dynamic to static in PaddlePaddle is as high as 95%, surpassing PyTorch's 62%. "Integrated training and inference" means being able to reuse training and inference code, especially model networking code, within the same framework. After completing the development and training of the model, only a small amount of development work is required to achieve rapid inference deployment. This feature provides an ultimate development experience for the industry. It enables the reuse of training and inference capabilities, providing a unified development experience and ultimate training efficiency for the entire process of large models. Through the work of transitioning from dynamic to static, the training and inference tasks can be seamlessly connected. It supports multiple mainstream large models, and the DeepSeek-R1 full-blood version achieves single-machine deployment with doubled throughput.
  • High-order differential in scientific computing: PaddlePaddle Framework 3.0 provides support for high-order automatic differentiation, compilation optimization, and distributed training capabilities for scientific computing. Experiments on 41 different equations on NVIDIA Modulus show that the differential equation solving speed of PaddlePaddle is on average 115% faster than the version of PyTorch with compiler optimization enabled. Additionally, PaddlePaddle has also established the PaddleScience toolkit for solving general mathematical problems and the PaddleHelix toolkit focused on biological computing. Furthermore, PaddlePaddle Framework 3.0 natively supports complex technology systems, which is of great significance for data feature analysis in scenarios such as weather forecasting and aerodynamic analysis of automobiles and aircraft.
  • Neural Network Compiler: This feature significantly reduces the cost of performance optimization. The compiler of PaddlePaddle adopts an integrated design with the framework, capable of supporting efficient training and variable-shape inference for various models such as generative models and scientific computing models, providing a good balance between computational flexibility and high performance. After using the CINN compiler, over 60% of the models have shown significant performance improvements, with an average increase of 27.4%. The CINN neural network compiler has comprehensive improvements in completeness and performance. In this version, we have comprehensively optimized the front-end and back-end aspects of the compiler: including adding an automatic Re-Compute mechanism for reverse computation graphs, front-end Pass performance optimization, upgrading the symbol derivation mechanism, optimizing operator fusion strategies, enhancing the back-end Schedule strategy and subscript expression simplification capabilities, etc. At the same time, we have investigated and fixed a large number of correctness and performance issues, systematically improving the general optimization capabilities of the compiler.
  • Heterogeneous Multi-Chips Adaptation: One of the key features of PaddlePaddle is its ability to adapt to heterogeneous multi-core environments and fully leverage hardware potential. In terms of access mechanism, PaddlePaddle provides simple and efficient abstract interfaces and a basic operator system, reducing the cost of adaptation. In terms of operation mechanism, it optimizes scheduling and storage sharing mechanisms, enhancing scheduling efficiency. From the perspective of operator kernels, PaddlePaddle offers a compiler-based automatic fusion and tuning solution to improve end-to-end performance. Additionally, PaddlePaddle has established research and development infrastructure for new hardware vendors, including code integration, continuous integration, and model regression testing. These mechanisms ensure that new hardware is incorporated into PaddlePaddle's normal release system, allowing users to install and try it directly without the need for compilation. PaddlePaddle's comprehensive functionality and low-cost access mechanism have attracted hardware vendors to contribute a total of 4001 pull requests (PRs), encompassing 26584 commits.

In addition to the above core features, Highly Extensible Intermediate Representation To enhance the scalability of the PaddlePaddle framework, we have developed the Highly Extensible Intermediate Representation (PIR), which systematically abstracts the underlying core concepts and provides flexible and efficient components. As an infrastructure, PIR supports multiple technologies such as dynamic-to-static, automatic differentiation, automatic parallelization, combinational operators, and graph optimization, and is widely used in distributed training, model compression, and inference deployment scenarios. Through the Declarative Rewrite Rule (DRR) mechanism provided by PIR, the development cost of Pass can be reduced by 60%. At the same time, PIR has been verified in all scenarios and is enabled by default, supporting one-click dynamic-to-static conversion, ensuring excellent performance and good scalability of the framework. Continuous improvements have been made to the existing functions of the framework version 2.0, and new features have brought significant improvements in user experience, performance, ease of secondary development, and hardware adaptability. This version continues to enrich and enhance the API functions to meet more scenarios at the user experience level. For large model scenarios, optimization and improvement have been made to the distributed parallel strategy optimization and inference function enhancement. Thorough usability improvements have been made in terms of compilation and installation, with a new synchronous upgrade of the installation method and version of dependent packages. Comprehensive reinforcement of system security has been carried out, and comprehensive error correction checks have been conducted on product documentation. At the same time, a large amount of cleanup has been done on some obsolete code to ensure the simplicity of the architecture.

Incompatible upgrade

PaddlePaddle API supports implicit type promotion. In the most commonly used calculations such as addition, subtraction, multiplication, and division, if the data types of the two inputs are different, it is necessary to determine the data type of the output. Historically, PaddlePaddle has only partially supported implicit type promotion, and the actual rules are unclear. Objectively, this manifests as inconsistencies between dynamic and static graphs, inconsistencies between API and operator overloading, and non-compliance with commutativity. Especially when large models widely use mixed calculations with bf16/fp16 and fp32, unexpected issues are prone to occur and are difficult to locate. Starting from the 3.0 beta version, PaddlePaddle has clarified the implicit data type promotion rules, which defines in detail the types of calculation results for Tensor and Tensor, as well as Tensor and a scalar (Scalar), ensuring that calculations comply with commutativity, operator overloading is consistent with binary API results, and dynamic graphs and static graphs produce consistent results. This is more in line with user understanding and industry habits. #60638, #63842, #60011

Discontinued Features

Support for 0-dimensional Tensor has been stable for two versions. In this version, the switch FLAGS_set_to_1d, which converts 0-dimensional Tensor to a 1-dimensional Tensor containing only one element in some cases, has been removed. This switch is to accommodate incorrect writing in some suites where 0-dimensional Tensor is represented by a 1-dimensional Tensor containing only one element. That is, PaddlePaddle now fully distinguishes between the semantics of 0-dimensional Tensor and 1-dimensional Tensor containing only one element, and the two are not equivalent. #61227

1. User experience upgrade

New Features

API Function Enhancement

Bug Fixes

Document optimization

2. Basic execution architecture

PIR is fully implemented and enabled by default, supporting one-click transition from motion to stillness, ensuring excellent performance and good scalability of the framework.

Bug Fixes

Function optimization

New Features

Changes unrelated to ordinary users

Security Issues

  • Introduced approval rules for IR (Intermediate Representation) save/load operations to enhance security and governance during model serialization. #65737

Others

Developer

Performance optimization

Discontinued Features

3. Compiler architecture

The CINN compiler has seen comprehensive improvements in completeness and performance. In this version, we have conducted thorough optimizations across all aspects of the compiler's front-end and back-end: including the addition of an automatic Re-Compute mechanism for reverse computation graphs, front-end Pass performance optimization, symbol derivation mechanism upgrades, operator fusion strategy optimization, back-end Schedule strategy, and enhanced subscript expression simplification capabilities. At the same time, we have investigated and fixed a large number of correctness and performance issues, systematically enhancing the compiler's general optimization capabilities. When the CINN compiler is enabled for the PaddlePaddle PaddleX series models, over 60% of the models show significant performance improvements compared to dynamic graph mode.

New Features

  1. New hardware backend support: Added support for two new backends, HIP and SYCL. (#65146, #65329, #69554, #71204, #65438, #66476, #66620, #67813)
  2. Added support for manual setting of numerical ranges, equality constraints, and other information for symbol dimensions in reasoning scenarios. (#67628, #67384)

Function optimization

  1. Optimize the printing of error messages to enhance the development and debugging experience. (#67738, #68769, #71076)
  2. Support the Welford algorithm, which can simultaneously ensure the performance and accuracy of the BatchNorm-related operator Kenrel. (#71184, #71057)

Performance optimization

  1. New backend optimization strategies such as GridReduce, Loop merging, Transpose tuning, and automatic vectorization have been added, significantly enhancing Kernel performance across various dimensional spaces and under different hardware configurations in all scenarios. (#67236, #68897, #69409, #65336, #66419, #68338, #68364, #71087, #68019, #68122, #65187, #66742, #67083, #68667, #68750, #69376, #69350, #69740, #68918, #70092, #69607, #69794, #70258, #70547, #70581, #70649, #69732, #70786, #70942, #71014, #71263, #71249, #71340, #71301, [#71380](https://github.com
  2. Optimize operator fusion strategies, upgrading various strategies including horizontal fusion, multi-downstream fusion, Reshape alignment fusion, etc., to further enhance the fusion capabilities of operators and improve end-to-end optimization performance. (#66034, #67829, #68171, #69478, #69691, #70665, #71103, #70873)
  3. The simplification capability of backend subscript expressions has been upgraded, supporting the simplification of complex expressions with dynamic and static dimensions, significantly reducing the subscript computation overhead in the generated backend Kernel. (#68011, #68617, #68624, #68685, #68220, #68720, #68753, #68986, #68987, #69071, #69164, #69282, #69522, #69857, #70208, #70355, #70427, #70450, #68737, #70500, #70953, #70933, #71026, #70456, #70257, #70461, #70142, #71018, #71278)
  4. A new automatic Re-Compute mechanism for reverse computation graphs has been added, which can effectively reduce model training memory usage and improve performance. (#69342, #70255, #68241, #69954, #70832)
  5. Optimize the backend Host and Device code compilation process to reduce compilation time and improve the processing performance of branches in the Broadcast scenario. (#65669, #65916, #66109, #65611, #65990, #66088, #66207, #66537, #66768, #70685, #71410, #66062)
  6. Improved and upgraded the mechanisms for symbol derivation, simplification, and caching in dynamic dimensions, added symbol derivation interface implementations for all conventional operators (580+), and provided more constraint information for Kernel compilation.(#65343#66582#65500#65591#66637#68208#68056#68015#68096#68236#68973#68967#69133#68550#68882#69005#69911#70376#71153#66644#66650#66642#66729#66838#66762#66580#66612#66625#66643#66837#66946#67018#67049#66956#67008#66930#66877#66896#67120#67117#67098#67136#67294#67327#66827#67201#66892#67377#66619#67037#67412#67394#67374#67418#67348#67337#67390#67407#67491#67422#67461#67458#67486#67490#67462#67364#67435#67665#67426#67507#67730#67776#67806#67803#67788#67705#67814#67858#67751#67875#67663#67434#67818#68180#68547#68548#68670#68964#68929#68907#68917#68984#68644#69167#68975#68947#68978#68980#68979#69329#69055#69331#69414#69335#69017#69344#69069#69698#69919#69964#70337#70282#70741#70818#71031#70541#66609#66889#66633#66735#66935#66627#66730#67210#67115#67275#67472#67577#67328#67566#67451#68098#68225#68177#68102#67951#67957#68235#68447#68446#68183#68318#68385#67635#65623#65956#66063#65992#65880#66343#65889#66606#66618#66737#66607#66579#66732#66849#66400#66952#66570#66967#66595#67121#67206#67444#67494#67499#67267#67567#67455#67161#67581#67539#67625#67690#67454#67731#67734#67735#67607#67413#67387#67882#67864#67503#67861#67888#67884#67826#68044#67851#68276#69888#70093#70436#70914#71222)
  7. Optimized some front-end passes to enhance the robustness of the front-end processing flow and improve the performance of computationally intensive subgraphs. (#65142, #67466, #69228, #70994, #71226, #71297, #71443)
  8. Designed new backend IR basic components and related Pass interfaces to provide a more concise and efficient way of developing optimization strategies. Through automatic pruning strategies, it can effectively reduce the traversal overhead of backend IR. (#70485, #70765, #71042, #70952, #69454, #70361, #70334, #70406, #70191, #70462, #70548, #70592, #70437, #70619, #70543, #69611, #70739, #70533, #70696, #70498, #70829, #71111, #70883)

Bug fixes

  1. Fix some bugs in the derivation and implementation logic of operator symbols. (#65185, #65231, #65266, #65951, #67142, #67286, #65958, #65955, #66470, #66764, #66036, #66662, #66741, #66745, #66807, #66791, #66859, #66880, #66962)
  2. Fixed bugs in the lowering of some special operators to the compiler. (#68698, #68699, #68691, #68948, #70144, #70895)
  3. Fixed the issue of errors reported in some scenarios when integrating operators. (#67038, #67400, #67655, #67723, #68029, #68042, #68888, #69250, #69937, #70924)
  4. Fix the correctness issue of the backend when handling extreme values, and improve the robustness of the compiler. (#68327)
  5. Fixed implementation logic bugs in the backend Schedule and post-processing tuning process, resolving errors and performance issues in some cases. (#68605, #68937, #68587, #69060, #69608, #71471, #71068)
  6. Resolved the issue of randomness in the operator fusion process. (#69547, #70931)

4. Automatic parallel architecture

In the official 3.0 version, we have conducted in-depth verification and refinement of the automatic parallel architecture to better support the pre-training + fine-tuning process for common large model scenarios such as pure text dense models, pure text sparse models (MoE), and multi-modal understanding models. Specifically, we have added segmentation derivation rules for over 20 operators tailored for these scenarios, and support the conversion of automatic parallel training parameters into manual parallel parameters for downstream inference, making automatic parallelism fully usable and helping users reduce the development cost of large model parallel programs. Additionally, to further simplify the distributed development process for users, we have introduced a new paddle.distributed.parallel interface. Based on the encapsulation of distributed tensor notation syntax, it supports users in non-intrusively configuring common parallel strategies such as data parallelism, model parallelism, and pipeline parallelism outside of model networking. Furthermore, the static graph automatic parallel architecture has undergone a comprehensive upgrade based on PIR, with the underlying basic components, core modules, parallel strategies, and performance optimization strategies all implemented uniformly based on the extended PIR DistDialect. This has further enhanced the dynamic and static consistency of automatic parallelism, achieving performance levels on the Llama series models that are on par with or even surpass manual parallelism.

New Features

  • Added the paddle.distributed.parallel interface to support configuring common parallel strategies outside of model networking, simplifying the distributed development process. #69004, #69033, #69077, #69136, #69169, #69212, #69217, #69283, #69288, #69326, #69365, #69384, #69426, #69443, #69462, #69492, #69628, #69677, #69697, #69776, #69896, #70138, #70182, #70539, #71116, #71210
  • For pure text sparse scenarios, it supports MoE expert parallelism, implements an expert parallelism to mesh partitioning conversion mechanism, and supports automatic invocation of all2all communication. #66462, #66750, #68004, #68053, #68187, #68477, #69098, #69262, #69296, #70715, #71292, #71320
  • To meet the needs of users in extreme manual optimization scenarios for managing segmentation status and communication operations, and to address the issue of being unable to use tensor segmentation syntax in some non-SPMD scenarios, we have added the LocalLayer interface to support a hybrid network of automatic and manual parallelism. #70519, #70525, #70600, #71232, #71264, #71373
  • To enable users to run automatic parallel programs using domestic hardware, we have completed the adaptation for Kunlun chips, and support for other chips is also underway. #70997, #71126, #71229, #71289, #71425, #71500
  • For situations where the data dimension cannot be divided evenly by the device dimension, non-balanced splitting derivation and splitting transformation are supported. #66103, #67756, #69265, #70072
  • The shard_dataloader function has been upgraded to support setting the gradient accumulation step count through batch_sampler, and also supports scenarios with multiple model inputs. #65325, #70659
  • Upgrades have been made to the parameter saving and loading functions, supporting asynchronous storage of parameters, mutual loading of master_weight between dynamic and static graphs, as well as parameter version control and offload functions. #66858, #67427, #70105, #70639
  • To meet users' needs for converting dynamic networking involving PyLayer to static, support has been added for PyLayer in static graph mode, allowing distributed tensors to be run within PyLayer. #67326, #68190, #69089, #70831
  • To address the issue of incorrect dynamic-to-static conversion caused by inconsistency between the data stream input format and the input_spec actually required by the model for dynamic-to-static conversion, the dynamic-to-static conversion interface supports a user-defined input_spec feature, allowing users to input the required input_spec on their own. #69183
  • For hybrid parallel scenarios, the gradient clipping strategy has been adapted and supported. #65259, #65928, #69287, #69760, #71421
  • For scenarios where the number of model layers is not divisible by the number of devices, a non-balanced pipeline parallel strategy is supported, allowing users to split different numbers of network layers at different pipeline stages. #69728, #70164, #70230
  • Added set_mesh and get_mesh interfaces to enable users to easily set and retrieve the global mesh. #69999
  • Added automatic and manual parallelism accuracy alignment switches to facilitate the conversion of existing manual parallelism models to automatic parallelism and verify the accuracy of the results. #67681

Functional improvements

Improve and optimize the derivation rules for operator slicing

  • Added derivation rules for operators add_n, split, and softmax_grad. #65606, #69439
  • Added operator splitting derivation rules for assign and embedding_grad. #67457
  • Added clip operator slicing derivation rule. #70632
  • Added derivation rules for the dist_stack and gather_nd operators. #65426
  • Added the derivation rule for dropout operator segmentation. #70216
  • Added slicing derivation rule for fused_dropout_add operator. #67722
  • Added fast_ln custom operator segmentation derivation rule. #68148
  • Added greater_equal and less_equal operator slicing derivation rules. #68868
  • Added greater_than and less_than operator slicing derivation rules. #68133
  • Added if operator segmentation derivation rule. #69357
  • Added slicing derivation rules for operators logical_and, logical_not, logical_or, and logical_xor. #67840
  • Added logsumexp operator slicing derivation rule. #67840
  • Added non_zero operator slicing derivation rule. #67996
  • Added pad operator slicing derivation rule. #68304
  • Added the derivation rule for operator segmentation of p_norm. #68317
  • Added the derivation rule for the scatter_nd operator's slicing. #67980
  • Added sigmoid operator segmentation derivation rule. #71092

Static graph automatic parallel architecture based on PIR upgrade

Bug fixes

  • Fixed bugs in the segmentation derivation mechanism and the segmentation derivation rules for several operators. #65702, #65835, #66098, #66955, #67052, #67059, #67101, #67283, #67729, #67996, #68413, #68455, #68533, #68976, #68977, #69027, #69203, #69223, #69862, #69991, #70100, #70624, #71024, #71152, #71214, #71253, #71388
  • Fixed several bugs in the segmentation conversion mechanism. #65060, #65820, #67630, #67809, #68115, #68468, #70023
  • Fixed the bug of incorrect derivation of shard_degree in parameter slice parallelism. #68781, #69214
  • Fixed issues in scenarios such as inconsistent results between dynamic and static graphs in shard_dataloader, slicing dict-type data, and custom sampler scenarios. #65262, #66096, #66882, #69620
  • Fixed the bug where the recompute setting with use_reentrant=false was incompatible with parameter slicing. #65188
  • Fixed bugs in the parameter loading and saving functions. #66266, #69764
  • Fixed bugs in operators such as Conv2D, fill_constant, flash_attn_grad, reduce_scatter, if, tuple_push, and tuple_pop. #67587, #68008, #68586, #68589, #69519, #70207
  • Fixed bugs in communication operators such as reduce_scatter, p_send, and p_recv. #67386, #71433
  • Fixed bugs related to tensor type promotion. #66541, #68342
  • Fixed the bug where automatic allocation of GPU memory occurred when converting uninitialized distributed tensors to NumPy arrays on some cards. #66361
  • Fixed the bug that triggered data copying when calling to_tensor on non-segmented tensors. #67169
  • Fixed the bug related to the segmentation of the scaler parameter. #68289
  • Fixed the accuracy issue of enable_delay_scale_loss. #68525
  • Fixed the hang issue caused by different creation orders of communication groups. #68847
  • Fixed the bug of incorrect op_role setting in static graph scenarios. #67850, #67986, #68156
  • Fixed the bug where the output variable of the random number operator could not be sliced in static graphs. #67589, #67750, #68067
  • Fixed the bug where the graph cache mechanism failed in static graphs. #68488
  • Fixed the bug of index out-of-bounds in paddle.distributed.to_distributed. #70174
  • Fixed a bug in the pipeline parallel visualization tool. #71386

5. Operator mechanism

Operator-related PRs, including the splitting of combined operators, the adaptation of new hardware-compatible operator kernels, sparse operator operations, and the retirement of old IR operators, have laid the foundation for PIR-compatible compilers and achieving performance advantages across multiple hardware platforms. The standardization of the operator system has optimized the code structure, reduced technical debt, and improved maintainability.

New Features

Bug Fixes

Others

Discarded

Developer-related

Improvement

  • Supported more data types. #69143
  • Update xpu interface. #69800
  • Improved operator printing functionality. #69916
  • Upgraded the normalize operation to support more scenarios. #70152
  • Extended group_norm to handle cases where the rank is greater than 5. #68774
  • Improved the usage of backward_blacklist. #69356

Performance improvement

  • Optimized the performance of the where_double_grad operator. #70404
  • Change "for range" to "slice" to speed up the execution of grad. #69938

6. Framework performance optimization

PRs related to performance optimization, encompassing optimizing operator performance, enhancing kernel performance, optimizing memory usage, and refining namespaces, all aim to provide users with a superior development experience.

New Features

Functional improvements

Bug Fixes

Performance optimization

Others

Discarded

7. Inferential deployment

Focusing on two core directions: the construction of the new generation of Proven Intermediate Representation (PIR) ecosystem and large model inference optimization, the main breakthroughs include:

  1. Deep fusion of PIR-TensorRT
  • Complete the refactoring and code optimization of the core execution mechanism, and develop over 50 operator converters
  • Added low-precision support (FP16/INT8) and Generic Plugin execution capability
  • Build a complete unit testing system that supports the entire process of model loading/saving
  1. Leap in reasoning performance of large models
  • Added full-process support for the Mixture of Experts (MoE) system, covering Hopper architecture optimization
  • Supports processing of 128K ultra-long sequences, enhancing long text reasoning capabilities
  • Implement cutting-edge quantization schemes such as FP8/W8A8 to reduce memory usage
  1. Comprehensive upgrade of infrastructure
  • OneDNN has been upgraded to version 3.6, significantly enhancing CPU inference performance
  • Model loading speed optimized by over 40%, supporting fast loading of PIR models
  • Improve distributed inference support and fix allreduce data type issues

New Features

Feature-complete

  • The functional mechanism of Inference is well-established under PIR
  • The executor supports loading .json models #65223
  • Support controllable PIR mode switch-on/off #65596
  • Improved reasoning mechanism of large models
  • Optimized gemm algorithm search (cublaslt global search/offline caching) #65597, #66132
  • Enhance type system compatibility (PD_VISIT_FLOATING_AND_HALF_TYPES) #71022
  • Optimized attention mechanism (support for multiple blocks of MMHA/XPU) #67211, #68104

Performance optimization

  • OneDNN has been upgraded to version 3.6, resulting in a general improvement in model inference performance on GNR/EMR devices #69386
  • Operator performance optimization (layer_norm/top_p_sampling) #65711
  • Model loading acceleration (regular/PIR model) #69110, #70219

Bug fixes

Other modifications

  • Code cleanup and maintenance (API deprecation/compilation warning fixes) #68048, #70384
  • Third-party integration optimization (OpenVINO submodule management) #70313, #70425

8. Hardware adaptation

Continuously improve and upgrade the functions of platforms such as Kunlun and Haiguang to enhance user experience

New Features

The addition of operations (ops) and improvement of functions on Kunlun Core XPU involve the following ops: flash attention/flash_attn_unpadded, multinomial, matmul, repeat_interleave, logsumexp, index_put_grad, mean_grad, pow, pow_grad, rsqrt, full, rms_norm, rms_norm_grad, put_along_axis, Cumsum, argmin, masked_select/grad, expand_v2/grad, all2all, expand, reduce_sum, reduce_max, reduce_min, moe, fused_linear_param_grad_add, adamw, clip/clip_grad, tan, acos, blha_get_max_len, gather/gather_grad, scatter/scatter_grad, round, index_select/sindex_select_grad, isfinite, isinf, quantize_linear, dequantize_linear, conv3d_transpose, logsumexp_grad, index_add_grad, eye, gather_element, tril, triu, set_value_grad, argmax, take_along_axis, etc #65413, #64846, #65656, #65963, #66143, #66482, #66585, #67077, #67173, #67551, #63989, #67919, #68052, #68176, #68408, #68454, #68478, #68473, #68453, #68770, #68933, #69042, #68713, #69368, #69723, #69767, #69898, #69970, #69771, #70176, #70428, #70573, #70576, #70633, #70114, #70627, #71038, #71132, #71228, #71274, #71364, #71375, #71431, #71451, #67585, #67637, #67914, #67641, #67913, #67955, #68411, #68560, #68423, #68894, #71053, #71047, #69056, #70843, #65653, #68023, #67780, #68622, #67215

Add support for rocsolver and warpctc on Haiguang DCU, and carry out the addition of OPs and improvement of functions. The involved ops include: flash_attention, hipblaslt, fastgelu, multiclass_nms3

#68066, #69457, #68603, #65599, #70587, #71337, #70173

Bug fixes

Bug fix for OP on Kunlun Core XPU #65020, #65251, #65418, #65387, #65525, #65613, #65533, #65705, #65915, #66238, #66485, #67349, #67372, #67276, #67460, #67496, #67530, #67828, #68010, #68157, #68172, #68388, #68213, #68501, #68504, #68585, #69229, #69374, #69424, #69440, #69614, #68542, #69990, #70351, #70479, #70431, #70638, #70856, #70974, #70973, #71027, #71062, #71115, #71110, #70858, #71147, #71212, #71361, #71423, #70859, #71492, #71493, #69826, #67341, #68906, #71171

Bug fix for OP on Haiguang DCU #69617, #65716, #66630, #65399

Performance optimization

Kunlun Core XPU upgrades the functions of basic components such as streams and optimizes the performance of certain operations. #65102, #69727, #69899, #69942, #70025, #70640

Upgrade of hardware underlying basic libraries

The upgrade of the basic library supports Kunlun Core P800, as well as the support for basic components #65494, #65924, #69752, #70835, #65554, #66998, #65278, #70614, #71012, #71178, #71168, #68740, #71100, #65221, #67983

Others

Modifications to related modules such as op test #65654, #66233, #66728, #67959, #68169, #68418, #68434, #68445, #68877, #68993, #69006, #70471, #70706, #67777, #65698, #68433, #65689

9. Environment update

  • We optimized the framework's stability and cross-platform compatibility, fixed issues related to test coverage and compilation environment compatibility, and enhanced support for multiple platforms such as Windows, XPU, and DCU. Simultaneously, we streamlined the code structure, removed obsolete code and unnecessary dependent libraries to reduce maintenance costs, upgraded key dependencies such as CUDA, further optimized the CI/CD process, improved build speed, and enhanced overall system stability.

Bug Fixes

Improvement and Upgrade

New Features

Discarded

10. other

  • Changes unrelated to user usage, including cleanup of obsolete code, code migration, cleanup of unit tests, debugging, or upgrades to monitoring mechanisms.

Developer-related content

Discarded

11. List of contributors

0x3878f, 0x45f, 2742195759, 86kkd, A-nnonymous, ADream-ki, Aganlengzi, Albresky, AndPuQing, AndSonder, Aoraki-Dream, ApricityXX, Asthestarsfalll, Aurelius84, BHmingyang, BeingGod, Betelgeu, BiynXu, CJ77Qi, Caogration, DDDivano, Dale1314, Deleter-D, DesmonDay, Difers, Dmovic, DongBaiYue, DrRyanHuang, DrownFish19, Eddie-Wang1120, EgoistSA, FeixLiu, ForFishes, Fripping, From00, Function-Samuel, GoldenStain, Guanhuachen2003, GuoxiaWang, Hanyonggong, HarperCy, Hongqing-work, HydrogenSulfate, JZ-LIANG, Jeff114514, JiaWenxuan, LLee233, LanCole, Lans1ot, Layssy, Leoforever123, LiYuRio, LielinJiang, LittleHeroZZZX, Liujie0926, Liyulingyue, Luohongzhige, Marcusryz, MarisaSparkL, Micalling, MikhayEeer, MrXnneHang, MufanColin, NKNaN, Neo-WY, NeroLoh, PolaKuma, Qin-sx, QingshuChen, RachelXu7, RichardWooSJTU, RuohengMa, SCUcookie, Sekiro-x, SigureMo, Sunny-bot1, SylarTiaNII, Sylence8, TBD1, TR666, TimeYWL, Tom-Zheng, Turingg, Victor-Bayim, Vvsmile, WAYKEN-TSE, Wanglongzhi2001, Wangzheee, Waynezee, Wennie396, Whsjrczr, Wizard-ZP, Wong4j, XavierZXY, XiaociZhang, XieYunshen, Xing-lil, Xreki, YKTian-x2b, YZW-explorer, YanhuiDua, YuanRisheng, ZHOU05030, ZhangHandi, ZhangX-21, ZibinGuo, a2064968462, anderson101866, aooxin, aquagull, baoqiwen, bapijun, blacksheep-Aristotle, bukejiyu, carryyu, ccsuzzh, chang-wenbin, changeyoung98, chen2016013, ckl117, cmcamdy, co63oc, continue-coding, cqulilujia, crazyxiaoxi, cszdrg, cubehan3, cyber-pioneer, danleifeng, decade-afk, deepllz, dynamicheart, eee4017, eggman-1024, enkilee, epiphanyer, ethan-sem, fangfangssj, feixi21, fightfat, fufu0615, fxfxfxfxfxfxfxfx, fxy1699, gitliuyf, gongel, gongshaotian, gongweibao, gouzil, gsq7474741, guixxiic, gzy19990617, hanyang2508, haoyu2022, heavyrain-lzy, houj04, huangjiyi, huangkr03, hxzd5568, icpcccpc, inaomIIsfarell, iosmers, jeff41404, jerrywgz, jiachengdai, jiahy0825, jinmingyi1998, jinyouzhi, joseflv, jychen21, jzhang533, kangguangli, kanze1, kineast, kircle888, l1cacheDell, leo0519, lifulll, linkk08, little1d, liufengwei0103, liuruyan, lixcli, liym27, liyongchao911, lizexu123, lizhenyun01, lj970926, lshpku, lszxb, ltd0924, luotao1, lwkhahaha, lxd-cumt, mayang002, megemini, mikemikimike, ming1753, monster1015, mori0umi, ndyysheep, nizne9, nobodynobody, ooooo-create, penPenf28, phlrain, pkuzyc, qili93, rich04lin, risemeup1, ronny1996, rsmallblue, runzhech, skywalker2012, smile2game, sneaxiy, successfulbarrier, sunzhongkai588, swgu98, tc20042008, tianhaodongbd, tianshuo78520a, tizhou86, tlxd, uanu2002, umiswing, vivienfanghuagood, waliwali777, walkalone20, wanghuancoder, wangna11BD, will-jl944, winffke, winter-wang, wwwuyan, xiaoguoguo626807, xiaoluomi, xiaoyao0115, xingmingyyj, xkkkkkk23, xu8117, xuxinyi389, xz-alex, yangrongxinuser, yeteye, yinfan98, yongqiangma, yuan20041218, yuanlehome, yuguo-Jack, yumin066, zbt78, zeroRains, zhangbo9674, zhanghonggeng, zhanglirong1999, zhangting2020, zhangyk0314, zhangyuqin1998, zhiminzhang0830, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhwesky2010, zoooo0820, zrr1999, zty-king, zxcd, zyfncg