PaddlePaddle 2.6.0 Release Note EN - PaddlePaddle/Paddle GitHub Wiki

1. Important Updates

  • Paddle New generation IR(PIR) : In order to further improve scalability of the PaddlePaddle framework, we have developed a new generation intermediate representaion. It abstracts underlying core concepts of the PaddlePaddle framework, such as Operation, Attribute and Type, providing developers with flexible and efficient basic components. By introducing Dialect mechanism, PIR can comprehensively and hierarchically satisfy needs of each module for intermediate representations to greatly enhancing scalability of the framework. PIR strictly follows Static Single Assignment (SSA) principle, ensuring unity of top-level structure and harmonious coexistence of "operator sequentiality" and "computational graph semantics". In addition, PIR provides a more concise and low-cost Pass development process, with a series of built-in rich and functional Pass optimization strategies. It provides technical support for the ultimate performance optimization of large-scale models.
  • Static graph construction and compiler Optimization Architecture: In order to further improve performance of the framework, PaddlePaddle's dynamic to static training capability has been comprehensively upgraded to support adaptive graph construction capability. This has been tested on more than 700 PaddlePaddle industry-level models, with 100% success rate of one line code converter to start static training. Meanwhile, Compiler Infrastructure for Neural Networks (CINN) of PaddlePaddle framework is integrated into PaddlePaddle main Repo, making the compiler and PaddlePaddle more integrated. CINN completes architectural optimization and improvement of expansion capability, increasing system stability. Based on PIR framework, it is much more easied to bind dynamic to static, primitive operator, executor and compiler together, to providing more space for boosting overall performance of PaddlePaddle framework.
  • Enhanced dynamic graph distributed capability: Large models pose higher demands on the distributed training performance of framework. PaddlePaddle has comprehensive optimizations in dimensions of communication library, graph analysis, distributed strategy and task enable/disable, enhancing distributed computing capability of PaddlePaddle's dynamic graph and providing support for efficient training of large models. In terms of performance, training performance is further improved by reducing pipelined GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is significantly improved by fixing related Bugs.
  • Auto parallel architecture with dynamic-static unification: In order to further reduce difficulty of programming and optimizing large models, PaddlePaddle has fully optimized the Semi-Auto Parallel programming paradigm with dynamic-static unification, simplifying programming complexity for developers. Developers do not need to deeply understand complex concepts and APIs under the manual parallel programming paradigm, such as row-parallel, and column-parallel. They only need a small amount of tensor distribution annotations to implement the hybrid parallelism. The distribution specification will be propagated to all tensors and operators automatically, and the framework would handle the communication and synchronization needed by distributed training appropriately. Meanwhile, it supports dynamic-to-static distributed training by adding one extra code only, allowing developers to efficiently implement any mixed parallelism strategy and deeply simplify the development process of hybrid-parallel training paradigm.
  • Hardware Integration Solution (CustomDevice): With increased demand for parallel training on new hardware in large model scenarios, PaddlePaddle has added support for distributed advanced policies, custom operators, and custom fusion policies. Distributed communication library is upgraded, with newly added support for many advanced distributed policies such as MP, GroupShared, PP, SP and MOE. Moreover, it supports vendors to flexibly access Transformer operator libraries of different granularities and modify the computation graph through Fusion Pass for performance acceleration.
  • Installation and development experience: use of modular compilation optimizes logics of CMake codes, and improves efficiency of PaddlePaddle full compilation and incremental compilation. In addition, this can increase efficiency of RD development. It supports Python3.12, CUDA12, Hopper architecture compilation, with introduction of Clang and other tools to fully optimize code formats. In addition, C++ is changed from linking static libraries to linking dynamic libraries to reduce compilation volume. These optimizations provide users with a smoother and more efficient installation and development experience.

2. Incompatible Upgrade

  • In order to avoid misuse, we removed the 0-dimensional Tensor compatibility state switch, to achieve the same API behaviors as industry's mainstream habits. In the previous version, we already supported 0-dimensional Tensor, but we added a compatibility state switch in order to avoid error reporting of some models, as much as possible. That is, in some scenarios where model suite is used frequently and modification is not completed, we still used 1-dimensional Tensor with only 1 element to replace the 0-dimensional Tensor by default. In this version, compatibility state switch is removed, so the 1-dimensional Tensor with only 1 element will no longer be used, to replace 0-dimensional Tensor in any scenario. Behaviors of 376 APIs that should support the 0-dimensional Tensor have been corrected and unified, to thoroughly complete support for the 0-dimensional Tensor.#57036, #54581, #54500
  • To improve API usability, paddle.nn.functional.diag_embed has been streamlined to paddle.diag_embed, with support of use of Tensor.diag_embed. #58223
  • In order to solve the problem of differential computation error caused by Tensor index writing (e.g., tensor[0] = 10) under static graphs, and to comply with static graph specifications, this version introduces paddle.static.setitem API. In static graph environments, this API is recommended to support indexed write operations for tensor, instead of subscript operators. This change does not affect dynamic graph environments, where index write using subscript operators are still allowed. #53682
  • paddle.fluid API is completely retired in this version. In this update, we completely removed all paddle.fluid APIs and deleted the fluid directory. Meanwhile, a small number of PaddlePaddle underlying public components have been consolidated into the paddle.base directory. It is unnecessary for PaddlePaddle users to pay attention to fluid-related concepts and APIs, further simplifying PaddlePaddle API system and improving readability.#56576, #54424, #54829, #53992, #54806, #55754, #55986, #55345, #56099, #51717, #54152, #55522, #55757, #58521, #54936, #55007, #55661, #55970

3. Training Framework (including Distributed)

Python API

Upgrade Tensor indexing mechanism

This version comprehensively optimizes basic index, advanced index and joint index functions of Tensor, to better comply with industry standards and user habits. Specifically, we added support for view in basic index, fixed some wrong behaviors in advanced index, and implemented read function of joint index. In addition, we have sunk index parsing to C++ level, improved performance of high-level indexing operators, and removed redundant computations in bool indexing. With these optimizations, performance of Tensor's basic, advanced and joint index has been improved comprehensively. #56893, #58643, #57986, #56272, #58856, #55211, #57023, #56613, #55602, #59281, #57737

Upgrade Inplace mechanism

In earlier versions, in order to ensure correctness of inverse differentiation calculations, when reverse calculation of an API depends on its forward input data, PaddlePaddle avoids using Inplace operation method, with possibly overwriting original input data. This mechanism simplifies implementation process, and also limits the ability of many APIs to implement Inplace functionality. As a result, user experience may be affected. In this version, PaddlePaddle has fully upgraded the Inplace mechanism. It implements automatic detection of the dependency of reverse computation on forward inputs, to save input data when needed. Therefore, more Inplace operations are supported. This improvement not only improves memory usage efficiency, but also enhances functionality of the API. In addition, we have added 109 new APIs that support Inplace operations, including paddle.abs_, paddle.sin_/cos_/tan_, comparison operations such as paddle.greater_than_/less_than_/equal_, logical operations such as paddle.logical_and_/logical_or_/logical_not_, paddle.neg_ and paddle.log_. While enriching the feature set of PaddlePaddle, it improves users' efficiency and convenience in numerical computation and deep learning tasks. #54683, #55078, #55576, #56888, #55509, #57093

Other new APIs

  • Added paddle.nn.functional.scaled_dot_product_attention. This significantly improves computational efficiency of the attention mechanism in large models, and meets demand for high-performance computation in large-scale deep learning models. #55242
  • Added a series of new scientific computing-related APIs, including paddle.cummax and paddle.cummin for cumulative maximum and minimum computation, paddle.index_fill and paddle.masked_fill for filling tensor by index or mask, paddle.linalg.pca_lowrank for low-rank principal component analysis, paddle.hypot for calculating length of the hypotenuses of right triangles, and paddle.atleast_1d, paddle.atleast_2d, and paddle.atleast_3d to ensure the tensor is at least one, two, or three dimensional. We also provide paddle.select_scatter and paddle.diagonal_scatter for more flexible selection and hashing of tensor data, and paddle.multigammaln for choosing the natural logarithm of multigamma function. In addition, new optimizer-related APIs are added in this version, including: paddle.optimizer.lr.LinearLR and paddle.optimizer.lr.CosineAnnealingWarmRestarts for learning rate scheduling strategies; introduction of paddle.io.SubsetRandomSampler to support random sampling from a subset of data. These added APIs will further enhance flexibility and efficiency of PaddlePaddle in various application scenarios. #57416, #53546, #53743, #57295, #57726, #58764, #58323, #57720, #58209, #58214, #57792, #51395, #57724, #57355, #57744, #58244, #57599, #59343, #57879

New Generation of Paddle Intermediate Representation (PIR)

PIR systematically abstracts underlying core concepts such as Operation, Attribute and Type, to build a set of flexible and powerful base components for developers. In addition, PaddlePaddle can comprehensively and hierarchically manage requirements of each module on Intermediate Representation (IR) by introducing the concept of Dialect, and support developers to customize extension of Dialect according to specific needs to significantly improving scalability and adaptability of framework. In terms of designs, PIR strictly follows the Static Single Assignment (SSA) principle, unifies top-level structure, realizes compatibility of "Operator sequentiality" and "computational graph semantics". This provides a clear and consistent view of the complex computation process. In order to further optimize performance of large models, PIR also provides a set of more concise and low-cost Pass development processes, including Declarative Rewrite Rule (DRR) and Pattern Rewriter. In addition, a series of rich and full-featured Pass optimization strategies are built-in, to deeply optimize application according to characteristics of large models, thus providing strong support for ultimate performance of large models. Through these innovative designs and optimization methods, PIR lays a solid foundation for efficient operation and continuous expansion of the PaddlePaddle framework.

New features

Function optimization

Performance optimization

  • Added PIR Program operators such as DCE and constant_folding_pass, and structure-optimized Pass. #54935,#59430,#58753,#58732
  1. Added optimization operators fusing class Pass, such as fused_attention, fused_dropout_add, fused_gemm_epilogue_pass, fused_linear_param_grad_add_pass, fused_weight_only_linear_pass, and fused_softmax_mask_upper_triangle, to improve training and inference performance. #57557,#58272,#58188,#58401,#59366,#57655,#57360,#56672,#58537,#56247,#59391,#58897,#54933

Dynamic to static capability enhancement

Dynamic to static graph conversion is a key technology in deep learning frameworks. It allows developers to find the best balance between flexibility and training efficiency. This version of PaddlePaddle has fully upgraded core Dynamic to Static functionality. Success rate of dynamic to static training is up to 100% among 700+ models in PaddlePaddle industry-grade model library.

New features

  • Adopted Python Eval Frame and VM simulation execution technology to innovatively implement an adaptive Graph Break mechanism. This mechanism is especially designed for control flow scenarios. By introducing the CallLayer mechanism, it makes full use of the advantage of PaddlePaddle dynamic-static unification motion. Support hybrid mode of Abstract Syntax Tree (AST) and bytecode simulation. Efficiently captures control flow operators, thus dramatically improving ability of computational graph to be static. At cache optimization level, fuse advanced optimization technologies such as common sub-expression elimination, to significantly improve execution efficiency of Guard. These optimizations not only reduce redundant computations, but also improve overall system operation speed. To enhance robustness of the system, a simple and efficient data intermediate layer structure is designed. Structure supports correctness recovery of SideEffects, ensuring stability and reliability of system in complex environments. In addition, it is widely compatible with mainstream interpreter versions from Python 3.8 to 3.11, providing users with a wide range of applicability. #57824,#55887,#58155,#56107,#57490,#58829,#57240,#57588,#58117,#59823,#56077,#58956,#57653,#59855,#59017,#58424,#58187,#57793,#59698,#59747,#59710,#59297,#58423,#56262,#58103,#58538,#58771,#59191,#57754,#59439,#59816,#59035
  • Added dynamic to static syntax transcription parsing for PyLayer functions, making PyLayer's conversion between dynamic and static graphs smoother. Users can now seamlessly carry out dynamic to static training on PyLayer, to easily export inference models. #56108,#56531,#57066,#57633

Bug Fix

  • Fixed the issue that video memory is abnormal in some scenarios of dynamic to static in is_test=True mode. #58350
  • Fixed the issue that function decorated by @to_static is exported to jit.save model in scenarios like foo(x,x,y). #55963
  • Fixed the issue that dynamic and static logic of some API behaviors is not uniform. This improves success rate and user experience of dynamic to static graph conversion. #56092

Fixed vulnerability

  • Fixed a potential security vulnerability in use of eval() in dynamic to static syntax transcription module. #60100

Enhanced distributed dynamic graph capability

In order to meet the needs of large models, this version focuses on improving the distributed computing capability of the dynamic graph of the PaddlePaddle. Various improvements have been made in communication library, graph analysis, distributed policies and task enable/disable, to provide comprehensive support for large model training. In terms of performance, we further improved training performance by reducing streaming parallel GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is further improved by fixing related Bugs.

New features

  • Added TraceHang function in communication library, to quickly locate the faulty node when cluster training has Hang problem. #59217
  • In order to improve training efficiency and reduce memory, dynamic graph supports stride mechanism. #55156,#54762,#55850,#59190,#57005,#57005,#57331,#58033,#58033,#58303,#57835,#57189
  • Enhanced paddleviz function to facilitate analysis of computational graphs. #56837,#57626
  • In distributed Sharding strategies (Stage1,2,3), added main_grad function to support higher precision gradient accumulation, and reduce precision loss caused by low precision accumulation. #57972,#57934,#57473,#57537,#59611,#57960
  • In Sharding Stage1 strategy, added a switch variable to control whether to perform fusion calculation on Optimizer. #58790
  • In Recompute function, added support for Tuple input parameters, enhancing calling ability of Recompute interface. #56793
  • Enhanced Launch function, allowing distributed training without specifying endpoints in dynamic graphs. #54636

Function optimization

  • Implemented new communication library with dynamic-static unification. Communication operators are fully adapted to PHI operator system, reducing development and maintenance costs to better support dynamic graphs and auto parallel architecture upgrade. #54417,#57768,#57897,#55537,#56604,#57519,#56088,#57153,#57161,#57252,#57251,#57208,#57305,#57424,#57548,#57560,#57564,#57233,#55726,#58073
  • TCPStore is changed to a single instance to support dynamic graphs and auto parallel more flexibly. #55956
  • Improved maintainability and flexibility of distributed policies such as MP/PP/SP, including addition of printing warning and error messages, structural cleanup of code files, and optimization of PP restrictions on inputs. #54448,#59762,#55462,#54788,#54664,#56456,#55540
  • In PP strategy, added support for P2P communication in computation flow, making communication mode more flexible. #54747
  • Sharding strategy supports reduce Operation on gradient. #58842,#57967,#55495

Performance optimization

  • Implemented timely release of last layer of PP strategy, to save video memory. #54505
  • In MP strategy Tensor fusion, supported incoming params group to enhance Tensor fusion function. Improved allreduce asynchronous communication performance, and enhanced training performance through overlap of computation and communication. #57690,#55662
  • In Sharding strategy, carried out overlap for reverse computation and gradient communication, to improve training performance. For Sharding stage1, added Tensor fusion and fuse grad clip, and optimizer, to improve computational efficiency. Supported overlap between VPP and DP/Sharding Stage1, to improve communication and computation parallelism. Optimized performance of Sharding Stage1 under FP16. Check only gradient responsible for this sharding rank in the check finite stage, to reduce computation overhead; added environment variables to control whether Optimize is performed to save video memory, to achieve use of fewer resources for model training debugging. #55598,#55427,#56063,#55766,#59848
  • In Hybrid Parallel strategy, arranged Tensor fusion under PP/VPP to pre-run, to solve the problem of extra overhead of runtime fuse on video memory. Improved model training performance by reducing non-essential synchronous memcpy. #54403,#57215

Bug Fix

Auto parallel

This release fully optimizes Auto Parallel programming paradigm with dynamic-static unification to simplify programming complexity for developers. Developers do not need to understand complex concepts and APIs in manual parallel programming paradigm, such as row-parallel, column-parallel, and so on. A small amount of tensor distribution annotations is required to build a hybrid parallel model. Framework will handle the derivation of distribution states of all tensors and operators, and adding appropriate communication operators. Meanwhile, it supports the dynamic to static distributed training by just one extra code changed, enabling developers to efficiently and easily implement any hybrid parallel strategy. This can significantly reduce development costs of hybrid parallel training codes.

Improved auto parallel core functions

Enhanced semi-auto parallel capability of dynamic graph

Enhanced semi-auto parallel for static graphs

  • Added Sequence Parallel Parallelism; added FThenB, Interleaved 1F1B, Eager 1F1B, VPP and other scheduling modes for Pipeline Parallel, and supported the hybrid parallel between the above new parallelism and original parallelism. Supported visualization of pipeline scheduling. Upgraded gradient synchronization mechanism which supports gradient synchronization when data is sharded on any broadcast dimension. #57605,#54727,#54409,#54787,#58313,#59179,#59416,#59719,#59822,#59057,#59522,#57061
  • Adapted the executor to PIR, and supported PIR optimization Pass. In distributed scenarios, supports fuse_linear fuse, and etc., to improve performance. #58459,#58528,#55555,#59757,#59102,#57917
  • Upgraded underlying architecture: upgraded the executor to reuse the results of data-flow dependency analysis and static kernel selection; upgraded entire graph based sharding completion mechanism, to switch to new sharding derivation rules and support some long-tailed cases; optimized the support of control flow under distributed static graph to adapt to more scenarios; reduced the graph compilation time and refined error message format to improve user experience. #55389,#55650,#54938,#57447,#57751,#57742,#59524,#59526,#58669,#57616,#56511,#55727,#58906,#56016,#54897
  • Optimized the gpu memory usage in static graph mode, and added refined recomputing strategy; optimized auto mixed precision pass, and allows users to manually specify auto-cast region and fixed some bugs; supports parallel computation of cross-entropy; supports fusion operators such as scaled_dot_product_attention, fuse_rope, etc.; performs scheduling optimization to support better overlap between communication and computation in tensor parallelism and pipeline parallelsim. #58421,#58533,#59498,#59498,#59187,#59188,#58172,#58628,#56185,#56696,#59497,#58304,#58977

AutoTuner

This release implements a profiling based automatic search and tuning tool named AutoTuner for parallel strategies, to automatically combine parallel and optimization strategies. Users can select effective combination configurations for experiments, and AutoTuner will search for the optimal configuration for large model training and inference given the model and hardware specification. In addition, AutoTuner implements a variety of pruning methods, including gpu memory modelling based pruning, so the search space and search time can be significantly reduced. #54460,#54668,#59794,#59727,#59782,#54834,#58127,#56968,#55466,#56939,#58183,#58314,#55499,#59748

Operator library

Incompatible upgrade

In order to improve maintainability of PaddlePaddle framework, some deprecated operators in the framework (e.g. diag_v1, isfinite_v1, pad2d_v1, etc.) have been removed, and models using these operators saved through the PaddlePaddle 1.x training will not be able to infer on new version of PaddlePaddle. #57895,#57892,#57898,#57730,#57732,#57810,#57884,#57794,#57926,#57925,#57807,#57808

Operator library enhancements

Fixed bug

CUDA

New features

  • Added debugging class API paddle.amp.debugging.check_check_numerics. Calculated and returned number of outliers (NaN, Inf) and zero elements in this Tensor value. #54301
  • Added fused_rope fusion operator to accelerate LLaMA class large model training.#54351
  • Updated CUDNN Frontend API version to v0.9.1 and added fused_scale_bias_add_relu fusion operator to accelerate ResNet networks. Note this feature is in experimental period and is disabled by default. #58367, #54949, #58504
  • Based on Flash-Attention v2, added Tensor-like Mask function support. Inverse operator supports deterministic computation for debugging. #57276, #56363
  • Modified sparse conv3d backend implementation to support 2d shapes, avoiding front-end reshape overhead. #54707
  • Added matmul_int8 operator. (#55228)

Function optimization

  • Optimized CUDA Graph’s support for random number operators.#58310
  • Enhanced automatic mixed-precision training default functionality, including:
    • Optimizing the experience of using automatic mixed precision training interface.#58152,#55364,#57903
    • Added matrix computation class operators such as fused_attention, fused_feedforward, and fused_gemm_epilogue to framework's default whitelist, and unified default black and white list settings for dynamic and static graphs. #55373, #55713
    • The argsort, dist, erfinv, nanmedian, poisson operators and lamb optimizer operators support FP16 and BF16 low precision computing. #51662, #55105, #55287, #55824, #56056, #56184, #55641
    • Fixed elementwise_max operator low-precision implementation. Changed to use FP32 type for numerical computing, and reduce precision loss. #54799
    • Changed temporary result Tensor needed for Reduce class operator to FP32 type, to avoid precision loss caused by converting intermediate result to low precision. #55709)
  • Optimized GPU codes for flip, roll & roll_grad, index_put & index_put_grad, etc. Removed unnecessary C++ templates to optimize compilation time and reduce compiled binary size without performance degradation. #57309, #57525
  • For the bernoulli operator, added a check on legitimacy of input probabilities. #59174

Performance optimization

  • Optimized BroadcastKernel's support for large Tensor. Change to call INT32 version implementation for multiple times for large Tensor Sharding, improving operator performance by 7.27x. #57313, #57996
  • Optimized performance of Tensor save interface by copying the Tensor to CPU and then converting to numpy, to avoid overhead of automatically converting the Tensor to a continuous Tensor when Tensor is not continuous. #57040

Bug Fix

  • Fixed bug of memmory_efficient_attention operator supporting the sm_90. #58070
  • Fixed the NaN problem of softmax operator when axis=-1 and length is greater than 100000. #57851
  • Fixed bug of GPU access error in some cases for set_constant operator. #59905
  • Fixed GPU storage read/write contention issue in fast implementation version of layer_norm operator. #56435

Expanded Compiler Infrastructure for Neural Networks (CINN)

In this update, PaddlePaddle CINN focuses on optimization of architecture and comprehensive expansion of its capabilities. In view of increasing demand for dynamic shapes for large models, effective operation and optimization strategies of compiler under dynamic shapes are initially explored and implemented. At the architectural level, Python DSL is introduced, significantly improving CINN's development convenience and Debug capability and enabling developers to write and debug codes more efficiently. Meanwhile, logic of Schedule has been refactored to be dominated by GroupSchedule, enabling more general and stable optimization strategies at operator Group level. In order to enhance stability of CINN, a strong constraint component is explored and introduced. This can effectively reduce uncertainties and potential errors in the system. In addition, historical tool classes and software structure of CINN are systematically organized, optimized and improved, to further enhance readability and maintainability of codes. In terms of integration with other PaddlePaddle components, tight integration of CINN with PIR and Paddle has been further strengthened, making compiler more coherent with overall PaddlePaddle framework. This improvement not only enhances performance of the compiler, but also provides developers with a smoother and more unified development experience.

Compatibility upgrade

  • Updated storage read interface to be compatible with Paddle 2.0. #55836
  • Updated relu6 Op Mapper compatibility. #55611

Modification deprecation

  • Removed old Schedule form. #55566,#55391
  • Removed some obsolete tests. #56245,#57987
  • Removed the remove_nested_block Visitor tool that no longer works. #56972
  • Removed other useless codes. #55413

New features

Function optimization

Performance optimization

  • Fusion of vit attention. #54139
  • Optimized block reduce. #58196

Fixed bug

Documentation

4. Deployment Direction (Paddle Inference)

General inference optimization

This version of the upgrade improves performance and ease-of-use of the inference engine on GPU and CPU, reducing user cost and application cost of online inference. On GPU: A high-performance multi-threaded asynchronous executor is supported, and inference performance of each model is improved by 5%~10%. The new version of TensorRT and BF16 inference capabilities are also supported, and TensorRT inference performance and ease of use are further improved. On CPU: The latest version of OneDNN high-performance inference is supported. SwinTransformer, FastRCNN and other series of models have greatly improved performance.

Large model inference optimized

The fine-grained fusion inference optimization of generative large models is realized. Optimization solution ensures high-performance inference capability and excellent expandability. Users can flexibly utilize various fine-grained fusion operators and PaddlePaddle native operators to build a network structure of generative large models in free combinations as required, thus achieving efficient and low-cost inference. In addition, our solution also supports mainstream generative large model structure, significantly reducing deployment cost of inference for such models and strongly supports efficient and low-cost implementation of generative large models.

  • Supports the FMHA/MMHA for CacheKV division block scheduling. #59462
  • RoPE encoding fusion operator supports input sin/cos values. #55415
  • Added fine-grained fusion operators. Supports high-performance inference optimization of generative large models. Added operators such as quant_linear, weight_quantize, and linear_compress for support of large model quantitative inference. #57852,#55128,#59090,#56706,#59951,#55490,#59291,#59441,#59778,#59651#55301,#58637,#56673,#56401
  • Supports variable length inference series API. #57948
  • Supports the GQA inference. #58472,#58836
  • Added masked multihead attention. Supports high performance MMHA inference. #55344,#56411,#58134,#57936
  • weight_quantize/weight_only_linear supports the Volta architecture. #58082
  • Added weight_only_linear_grad for support of large model weight only quantization gradient transfer-back. #57685
  • Fixed large model dynamic to static bug. Optimized communication initialization logic between static graph cards. #56390,#57169,#56688,#56592,#58868
  • Optimized top_p_sampling random number generation logic. #59494

Paddle-TensorRT Inference Optimization

Modification deprecation

  • Removed fc_elementwise_add fusion from OneDNN. #55504
  • Removed redunant op. #54442

Bug Fix

5. Hardware Support

Hardware Integration Solution (Custom Device)

In this update, added support for distributed advanced strategy, custom operator and custom fusion strategy. By upgrading distributed communication library, supports MP, GroupShared, PP, SP, MOE and other advanced distributed strategies. Meanwhile, enables vendors to flexibly access Transformer operator libraries of different granularities, and modify computation graph through Fusion Pass for performance acceleration.

New features

  • Upgraded CustomDevice to support for Paddle's latest distributed communication library CommContext. Added a variety of advanced distributed strategies such as GroupShared and MOE. #56301,#54671,#57957,#56669,#54384,#54572,#54573,#54676
  • Upgraded CustomDevice to support CustomOP. Users can register undefined operators in Paddle PHI operator library. CustomDevice can support CustomOP via CAPI. #57038,#55532,#56755,#55532,#55533,#55659
  • Added CustomDevice's support for CustomPass function. Modified the computation graph IR through Python API. #55511,#55728
  • Added CustomDevice’s support for Paddle run_check. #56318
  • Added CustomDevice’s support for StreamSafeAllocator. #55393,#56380,#56536,#58035
  • Added CustomDevice’s support for DataTransform. #56627

Function optimization

  • Added CustomDevice’s support for more PaddlePaddle APIs such as Variable.set_value, adamw, share_external_data, mp_allreduce_sum, tensor.numpy, get_paddle_place, and GeneratorState. #55272, #56386, #57253, #56927,#56189,#55225,#55247
  • Modified CustomDevice dynamic library loading method from RTLD_NOW to RTLD_LAZY, to facilitate subsequent checking of compatibility of CustomDevice related software stack version. #57544
  • Added CustomDevice's detection function for FP16 operator under mixed precision training. #56053,#56176

Bug Fix

Kunlunxin XPU

New features

  • Added XPTI (XPU Profiling Tool Interface) to support collection and analysis function of runtime performance data. #54685,#54690,#54800
  • Supports Paddle's latest distributed communication library CommContext. #59418
  • Added XPU fusion operators, for example, fast_where. #55628
  • Added support for XPU Pluign function, facilitating users to develop XPU customized operators through XTDK programming. #55101,#59326
  • Added XPU’s support for AutoGrowthAllocator. #54121
  • Added operator support list of Kunlun3. #57683

Function optimization

  • Upgraded XPU Inference API. #54342
  • Optimized performance of some XPU operators. Added support for bf16 in some XPU operators, including unique/index_put,squeeze/unsqueeze kernels,swish/swish_grad,scatter_nd_add_grad/slice,rsqrt/bitwise_or/arange_tensor,where,collective. #56582,#58161,#58440,#58580,#58950,#58616,#59273
  • Optimized XPU memory management to avoid memory leakage. #59334,#54847
  • Supports INT8 inference. #57258
  • Added support for FP16 series inference operators. #55642,#54410
  • Supports share_external_memory interface to pass input and output. #55170
  • Supports open source quantization model XPU inference. #58568
  • Added context_gm_size configuration, instead of allocating global memory in Pass. #54674
  • Added embedding and fast_gather_nd plugin. #56488,#56103
  • Supports fusion of fast_layternorm + leaky_relu. #57113
  • Supports elementwise_min/max/floordiv/where inference in KL1 and KL2 precision. #58422
  • Supports autotune configuration of fc and conv2d operator. #58801
  • Supports conv and fc dynamic quantization. #59307
  • fc + act fusion support for sigmoid, swish and relu6. #54486
  • elementwise_sub/elementwise_div supports int data type. #55920

Bug Fix

Hygon DCU

Bug Fix

6. Environment Adaptation

Adopted modular compilation to optimize CMake codes, improving efficiency of compilation of PaddlePaddle. This can increase efficiency of RD local development. Meanwhile, supports compilation in Python3.12, CUDA12, and Hopper architecture, and using Clang tool to comprehensively optimize code formats. In addition, C++ unitest is changed from linking static libraries to linking dynamic libraries to reduce compilation size. These improvements provide users with a smoother and more efficient installation and development experience.

Thanks to Our Contributors

Azure-Tang, zhaoyinglia, From00, JZ-LIANG, xysheng-baidu, SylarTiaNII, kuizhiqing, zhiqiu, FeixLiu, liuzhenhai93, GhostScreaming, pangengzheng, xiaoyewww, wanghuancoder, ForFishes, hitywt, danleifeng, tianshuo78520a, ykkk2333, houj04, lj970926, XiaociZhang, HarperCy, cqulilujia, runzhech, RuohengMa, Caozhou1995, kangguangli, heavyrain-lzy, zyfncg, SigureMo, YuanRisheng, lchdl, LiYuRio, AndSonder, Wennie396, zhangbo9674, liudongxue01, risemeup1, phlrain, winter-wang, yuanlehome, NALLEIN, Liujie0926, yuguo-Jack, gitliuyf, zh794390558, Aurelius84, 6clc, GGBond8488, xiaoguoguo626807, Wong4j, iosmers, xiaoxiaohehe001, LielinJiang, carryyu, Difers, yangxiaoyu14, xuxinyi389, cxxly, gongshaotian, jjyaoao, lijialin03, lxd-cumt, cyber-pioneer, HydrogenSulfate, MayYouBeProsperous, Charles-hit, Patrick-Star125, ScottWong98, huangjiyi, DrRyanHuang, jinyouzhi, BeingGod, Wanglongzhi2001, yangguohao, zyt1024, longranger2, 2742195759, megemini, thisjiang, kevincheng2, zhoutianzi666, Wangzheee, ming1753, tianhaodongbd, freeliuzc, zhenyun-li, MARD1NO, RichardWooSJTU, eee4017, leo0519, csy0225, wwbitejotunn, bukejiyu, jiweibo, iamsonderr, ckl117, ronny1996, zhanglirong1999, LLee233, ZHUI, wangxn12138, zhwesky2010, Courtesy-Xs, zoooo0820, llyyxx0413, Asthestarsfalll, zxcd, pkuzyc, idontkonwher, sneaxiy, hong19860320, ZibinGuo, leolishaohao, MuShangCC, zhupengyang, shentanyue, Travis-Lee, wz1qqx, frank-oops, newway, QingshuChen, zhangyk0314, HandSomeLEEw, Shixiaowei02, zhangyuqin1998, Xing-lil, zhhsplendid, jiahy0825, xinyu-intel, MarioLulab, 0x45f, Tom-Zheng, xingmingyyj, zhangbopd, gouzil, zeroRains, BiynXu, WintersMontagne10335, wuhuachaocoding, GreatV, chenwhql, deepllz, parap1uie-s, ozogxyz, FisherWY, changeyoung98, zhiboniu, YangQun1 dynamicheart, Xreki, liugddx, Lylinnnnn, YSF-A, zzjjay, YanhuiDua, lishicheng1996, USTCKAY, abenmao, cocoshe, HermitSun, ccsuzzh, sanbuphy, enkilee, RedContritio, Liyulingyue, zrr1999, chen2016013, Galaxy1458, chalsliu, mrcangye, XieYunshen, zhiheng-liu, haohongxiang, ZzSean, JamesLim-sy, yuehuayingxueluo, niuliling123, umiswing, sijunhe, littsk, SecretXV, zhurou603, zhangjun, caizejun, yangjianfengo1, vivienfanghuagood, Xinyu302, lizexu123, yghstill, Li-fAngyU, VigiZhang, co63oc, dhanush-2501, ooooo-create, PommesPeter, zeus2x7, akshatvishu, jzhang533, Sekiro-x, gumblex, BernieHuang2008, YibinLiu666, qiuwenbogdut, XavierZXY, MqLeet, zhangting2020, mingxu1067, Ainavo, SSKlearns, yuchen202, silverling, zade23, wenxiaohahaha, NKNaN, Tsaiyue, fsczz, Tomoko-hjf, rhmaaa, zbt78, Hhankyangg, wangzhen38, zhengqiwen1997, engineer1109, onepick, qili93, Rane2021, nemonameless, DesmonDay, RachelXu7, ceci3, lyuwenyu, liuruyan, LokeZhou, shiyutang, lanxianghit, feifei-111, Sahala08, sunzhongkai588, Kaedeharai, Candy2Tang, liyongchao911, whisky-12, InsaneOnion, yoyoIcy, KongAKun, linzeyang, MuhammadNizamani, eltociear, Ligoml, LUZY0726, Windfarer, FlyingQianMM, jeng1220, junelotus, zlsh80826, Vvsmile, Frida-a, TonibMw, guoshengCS, zhink, ZhangYulongg, AlbertVan, fengxin-hello, mjp9527, entired, DanGuge.