PaddlePaddle 2.６.0 Release Note EN - PaddlePaddle/Paddle GitHub Wiki

1. Important Updates

Paddle New generation IR(PIR) : In order to further improve scalability of the PaddlePaddle framework, we have developed a new generation intermediate representaion. It abstracts underlying core concepts of the PaddlePaddle framework, such as Operation, Attribute and Type, providing developers with flexible and efficient basic components. By introducing Dialect mechanism, PIR can comprehensively and hierarchically satisfy needs of each module for intermediate representations to greatly enhancing scalability of the framework. PIR strictly follows Static Single Assignment (SSA) principle, ensuring unity of top-level structure and harmonious coexistence of "operator sequentiality" and "computational graph semantics". In addition, PIR provides a more concise and low-cost Pass development process, with a series of built-in rich and functional Pass optimization strategies. It provides technical support for the ultimate performance optimization of large-scale models.
Static graph construction and compiler Optimization Architecture: In order to further improve performance of the framework, PaddlePaddle's dynamic to static training capability has been comprehensively upgraded to support adaptive graph construction capability. This has been tested on more than 700 PaddlePaddle industry-level models, with 100% success rate of one line code converter to start static training. Meanwhile, Compiler Infrastructure for Neural Networks (CINN) of PaddlePaddle framework is integrated into PaddlePaddle main Repo, making the compiler and PaddlePaddle more integrated. CINN completes architectural optimization and improvement of expansion capability, increasing system stability. Based on PIR framework, it is much more easied to bind dynamic to static, primitive operator, executor and compiler together, to providing more space for boosting overall performance of PaddlePaddle framework.
Enhanced dynamic graph distributed capability: Large models pose higher demands on the distributed training performance of framework. PaddlePaddle has comprehensive optimizations in dimensions of communication library, graph analysis, distributed strategy and task enable/disable, enhancing distributed computing capability of PaddlePaddle's dynamic graph and providing support for efficient training of large models. In terms of performance, training performance is further improved by reducing pipelined GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is significantly improved by fixing related Bugs.
Auto parallel architecture with dynamic-static unification: In order to further reduce difficulty of programming and optimizing large models, PaddlePaddle has fully optimized the Semi-Auto Parallel programming paradigm with dynamic-static unification, simplifying programming complexity for developers. Developers do not need to deeply understand complex concepts and APIs under the manual parallel programming paradigm, such as row-parallel, and column-parallel. They only need a small amount of tensor distribution annotations to implement the hybrid parallelism. The distribution specification will be propagated to all tensors and operators automatically, and the framework would handle the communication and synchronization needed by distributed training appropriately. Meanwhile, it supports dynamic-to-static distributed training by adding one extra code only, allowing developers to efficiently implement any mixed parallelism strategy and deeply simplify the development process of hybrid-parallel training paradigm.
Hardware Integration Solution (CustomDevice): With increased demand for parallel training on new hardware in large model scenarios, PaddlePaddle has added support for distributed advanced policies, custom operators, and custom fusion policies. Distributed communication library is upgraded, with newly added support for many advanced distributed policies such as MP, GroupShared, PP, SP and MOE. Moreover, it supports vendors to flexibly access Transformer operator libraries of different granularities and modify the computation graph through Fusion Pass for performance acceleration.
Installation and development experience: use of modular compilation optimizes logics of CMake codes, and improves efficiency of PaddlePaddle full compilation and incremental compilation. In addition, this can increase efficiency of RD development. It supports Python3.12, CUDA12, Hopper architecture compilation, with introduction of Clang and other tools to fully optimize code formats. In addition, C++ is changed from linking static libraries to linking dynamic libraries to reduce compilation volume. These optimizations provide users with a smoother and more efficient installation and development experience.

2. Incompatible Upgrade

In order to avoid misuse, we removed the 0-dimensional Tensor compatibility state switch, to achieve the same API behaviors as industry's mainstream habits. In the previous version, we already supported 0-dimensional Tensor, but we added a compatibility state switch in order to avoid error reporting of some models, as much as possible. That is, in some scenarios where model suite is used frequently and modification is not completed, we still used 1-dimensional Tensor with only 1 element to replace the 0-dimensional Tensor by default. In this version, compatibility state switch is removed, so the 1-dimensional Tensor with only 1 element will no longer be used, to replace 0-dimensional Tensor in any scenario. Behaviors of 376 APIs that should support the 0-dimensional Tensor have been corrected and unified, to thoroughly complete support for the 0-dimensional Tensor.#57036, #54581, #54500
To improve API usability, paddle.nn.functional.diag_embed has been streamlined to paddle.diag_embed, with support of use of Tensor.diag_embed. #58223
In order to solve the problem of differential computation error caused by Tensor index writing (e.g., tensor[0] = 10) under static graphs, and to comply with static graph specifications, this version introduces paddle.static.setitem API. In static graph environments, this API is recommended to support indexed write operations for tensor, instead of subscript operators. This change does not affect dynamic graph environments, where index write using subscript operators are still allowed. #53682
paddle.fluid API is completely retired in this version. In this update, we completely removed all paddle.fluid APIs and deleted the fluid directory. Meanwhile, a small number of PaddlePaddle underlying public components have been consolidated into the paddle.base directory. It is unnecessary for PaddlePaddle users to pay attention to fluid-related concepts and APIs, further simplifying PaddlePaddle API system and improving readability.#56576, #54424, #54829, #53992, #54806, #55754, #55986, #55345, #56099, #51717, #54152, #55522, #55757, #58521, #54936, #55007, #55661, #55970

3. Training Framework (including Distributed)

Python API

Upgrade Tensor indexing mechanism

This version comprehensively optimizes basic index, advanced index and joint index functions of Tensor, to better comply with industry standards and user habits. Specifically, we added support for view in basic index, fixed some wrong behaviors in advanced index, and implemented read function of joint index. In addition, we have sunk index parsing to C++ level, improved performance of high-level indexing operators, and removed redundant computations in bool indexing. With these optimizations, performance of Tensor's basic, advanced and joint index has been improved comprehensively. #56893, #58643, #57986, #56272, #58856, #55211, #57023, #56613, #55602, #59281, #57737

Upgrade Inplace mechanism

In earlier versions, in order to ensure correctness of inverse differentiation calculations, when reverse calculation of an API depends on its forward input data, PaddlePaddle avoids using Inplace operation method, with possibly overwriting original input data. This mechanism simplifies implementation process, and also limits the ability of many APIs to implement Inplace functionality. As a result, user experience may be affected. In this version, PaddlePaddle has fully upgraded the Inplace mechanism. It implements automatic detection of the dependency of reverse computation on forward inputs, to save input data when needed. Therefore, more Inplace operations are supported. This improvement not only improves memory usage efficiency, but also enhances functionality of the API. In addition, we have added 109 new APIs that support Inplace operations, including paddle.abs_, paddle.sin_/cos_/tan_, comparison operations such as paddle.greater_than_/less_than_/equal_, logical operations such as paddle.logical_and_/logical_or_/logical_not_, paddle.neg_ and paddle.log_. While enriching the feature set of PaddlePaddle, it improves users' efficiency and convenience in numerical computation and deep learning tasks. #54683, #55078, #55576, #56888, #55509, #57093

Other new APIs

Added paddle.nn.functional.scaled_dot_product_attention. This significantly improves computational efficiency of the attention mechanism in large models, and meets demand for high-performance computation in large-scale deep learning models. #55242
Added a series of new scientific computing-related APIs, including paddle.cummax and paddle.cummin for cumulative maximum and minimum computation, paddle.index_fill and paddle.masked_fill for filling tensor by index or mask, paddle.linalg.pca_lowrank for low-rank principal component analysis, paddle.hypot for calculating length of the hypotenuses of right triangles, and paddle.atleast_1d, paddle.atleast_2d, and paddle.atleast_3d to ensure the tensor is at least one, two, or three dimensional. We also provide paddle.select_scatter and paddle.diagonal_scatter for more flexible selection and hashing of tensor data, and paddle.multigammaln for choosing the natural logarithm of multigamma function. In addition, new optimizer-related APIs are added in this version, including: paddle.optimizer.lr.LinearLR and paddle.optimizer.lr.CosineAnnealingWarmRestarts for learning rate scheduling strategies; introduction of paddle.io.SubsetRandomSampler to support random sampling from a subset of data. These added APIs will further enhance flexibility and efficiency of PaddlePaddle in various application scenarios. #57416, #53546, #53743, #57295, #57726, #58764, #58323, #57720, #58209, #58214, #57792, #51395, #57724, #57355, #57744, #58244, #57599, #59343, #57879

New Generation of Paddle Intermediate Representation (PIR)

PIR systematically abstracts underlying core concepts such as Operation, Attribute and Type, to build a set of flexible and powerful base components for developers. In addition, PaddlePaddle can comprehensively and hierarchically manage requirements of each module on Intermediate Representation (IR) by introducing the concept of Dialect, and support developers to customize extension of Dialect according to specific needs to significantly improving scalability and adaptability of framework. In terms of designs, PIR strictly follows the Static Single Assignment (SSA) principle, unifies top-level structure, realizes compatibility of "Operator sequentiality" and "computational graph semantics". This provides a clear and consistent view of the complex computation process. In order to further optimize performance of large models, PIR also provides a set of more concise and low-cost Pass development processes, including Declarative Rewrite Rule (DRR) and Pattern Rewriter. In addition, a series of rich and full-featured Pass optimization strategies are built-in, to deeply optimize application according to characteristics of large models, thus providing strong support for ultimate performance of large models. Through these innovative designs and optimization methods, PIR lays a solid foundation for efficient operation and continuous expansion of the PaddlePaddle framework.

New features

Abstracted core concepts of IR bottom layer and provided developers with flexible base components, such as Operation, Attribute, Value, Type, Trait, and Interface. #56354,#57106,#57349,#54844,#54984,#54565,#54562,#57249,#57550,#59278,#54875,#55041,#54987,#55903,#57582,#57580,#58052,#55322,#57418,#57635,#55328,#57463,#59791,#59821,#59115,#57461,#59392,#57373,#59118
Added Dialect mechanism to support comprehensive and hierarchical management of intermediate representation requirements of each module of framework. Through built-in Builtin Dialect, it supports developers to customize and extend Dialect according to their needs. #56325,#57539,#54682,#55381,#56156,#56431,#56615,#57103,#57209
Normalized PaddlePaddle static graph operator system. Added OperatorDialect and KernelDialect. Managed conceptual differences of operators in the form of Dialect during compilation and execution, making Architecture clearer. #56284,#54469,#58660,#58975,#56680,#54790,#54826,#54840,#55699,#55648,#55880,#56101,#56754,#54944,#56836,#57185,#58757,#56243,#56436,#57741,#59124,#57054,#56984,#57403,#57904,#58031,#56924,#59270,#55343,#56557,#55693,#54428
Added ShapeDialect with built-in rich shape operation operators for constructing dynamic shape constraints and expressions for AI compilers. #56727,#59254,#58368,#57069,#57337,#56351,#57029,#58036,#59032,#57961,#56427,#57459
Unified top-level structure of Framework Program, supporting compatible representation of "operator sequentiality" and "computational graph semantics", decoupling dependency on ir::Graph, and strictly following the principle of Static Single Assignment (SSA). #59369,#54563,#57051,#57306,#57857
Added IrPrinter and IrPaser components to support serialization and deserialization of PIR Programs, providing a friendly debugging experience for PIR development. #55695,#59449,#54369,#54499,#55518,#55784,#57180,#57471,#54859,#54968,#55209,#57314,#57969
Built a new, simple and low-cost Pass development system based on Declarative Rewrite Rule (DDR) and Pattern Rewriter, with built-in a series of rich and full-featured Pass Optimization strategies, to accelerate training and inference execution process. #54385,#54738,#55859,#56638,#57090,#58673,#59415,#56729,#58655
Added ProgramTranslator component, to support conversion from ProgramDesc to new generation of IR representations of PaddlePaddle by pressing one key, with provision of easy-to-use C++ and Python interfaces. #55433,#54470,#58044,#58390,#58100,#55403,#55406,#54719,#56550,#55448,#55453,#56294,#56308,#56842,#58517
With help of automatic code generation technology, it can generate the full amount of static graph operator representations for PaddlePaddle framework by pressing one key. Sank static graph networking logic to C++ side and bind it to _C_ops module. This can greatly streamline API code on Python side, realize ultimate dynamic-static unification of APIs of PaddlePaddle Framework, and upgrade a lot of Python APIs to support static graph networking of the new IR. #56570,#55745,#56955,#57298,#57946,#57248,#56080,#54396,#54551,#56520,#55002,#57067,#59320,#59348,#57164,#57267,#59064,#54340,#54895,#55004,#56196,#56862,#58991,#55428,#55909,#56241,#56526,#56571,#56518,#57016,#56653,#56809,#57158,#55422,#55458,#55432,#55467,#55483,#55419,#55517,#55500,#56674,#57693,#55008,#57166,#57157,#57159,#57175,#57325,#57330,#57415,#57122,#57393,#57344,#57667,#57348,#57700,#58093,#58005,#58081,#58094,#58137,#58287,#58352,#58340,#58363,#58331,#58343,#58317,#58450,#58377,#58466,#58470,#58491,#58546,#58587,#58453,#58634,#58604,#58605,#58593,#58675,#58699,#58384,#58629,#58579,#58695,#58548,#58688,#58792,#58843,#58840,#58718,#58883,#58785,#58608,#58781,#58783,#58429,#58685,#58696,#58690,#58831,#58929,#58740,#58937,#58782,#58833,#58882,#58935,#58931,#59041,#59040,#58877,#58888,#59042,#58780,#58682,#58815,#58676,#58678,#58446,#59077,#59091,#58661,#58832,#58642,#58698,#59313,#59371,#58700,#58953,#58879,#59469,#59573,#59481,#59419,#59509,#58735,#59616,#59582,#59420,#59500,#58911,#59535,#54891,#56794,#57477,#57929,#57765,#58693,#58603,#56291,#57123,#57317,#57341,#57020,#57324,#57761,#57762,#57907,#57909,#58099,#58110,#58114,#58139,#58144,#58165,#58194,#58138,#58113,#58245,#58318,#58105,#58348,#58235,#58354,#58341,#58445,#58418,#58239,#58473,#58239,#58391,#58501,#58519,#58416,#58588,#58531,#58730,#58773,#58862,#58946,#58500,#56585,#57480,#57433,#58498

Function optimization

Upgraded static graph executor to extend more Kernel Instruction types, and supported loading of PIR with efficiently scheduling execution. This has significant video memory and performance gains in training and inference. #54570,#58665,#57291,#54452,#57431,#54692,#55112,#55210,#55401,#55772,#55828,#56148,#54763,#56886,#57284,#57268,#57791,#56789,#56704,#57594,#58397,#58337,#58756,#58371
Reconstructed auto-differentiation module for PIR, migrate and adapted the high-order auto-differentiation function. Optimized Stop Gradient transfer mechanism, so logic is clearer and function is more robust. #55660,#57084,#56890,#58942,#59373,#57206,#58145,#55235,#57255,#56925,#55957,#56163,#56316,#57294,#57449,#59520,#59565,#56265,#56512,#56650,#57183,#57956,#59100
Optimized design and representation of control flow forward and reverse operators, introduced ControlFlow Dialect, and supported conversion and execution from control flow operators to PIR under ProgramDesc. #58729,#57364,#58625,#57475,#57265,#56799,#59033,#57342,#57801,#57958,#57949,#57937,#59231,#59496,#59321,#58088,#58198,#58024,#58089,#58086,#59175,#59423,#59567,#58098,#58163,#58250,#58277,#58355,#59020,#59200,#59585,#58109
Upgraded dynamic to static execution flow to support PIR, optimized dynamic to static subgraph Pass mechanism, and supported users to try and use functions in the PIR system under the @to_static function. #57566,#55620,#56791,#57357,#59152,#59312,#58630,#56035,#59447,#57361,#59261,#59774
Upgraded combination operator function with introducing the concept of Backend to manage logic of combination operator module of dynamic and static graphs in a hierarchical way. Sank necessary components and operator splitting rules into C++, to dramatically reduce maintenance costs. #58153,#56391,#56614,#57030,#57554,#58018,#58130,#58581,#58679,#59054,#55480,#58451,#55647,#56342,#56798,#57561,#58023,#57722

Performance optimization

Added PIR Program operators such as DCE and constant_folding_pass, and structure-optimized Pass. #54935,#59430,#58753,#58732

Added optimization operators fusing class Pass, such as fused_attention, fused_dropout_add, fused_gemm_epilogue_pass, fused_linear_param_grad_add_pass, fused_weight_only_linear_pass, and fused_softmax_mask_upper_triangle, to improve training and inference performance. #57557,#58272,#58188,#58401,#59366,#57655,#57360,#56672,#58537,#56247,#59391,#58897,#54933

Dynamic to static capability enhancement

Dynamic to static graph conversion is a key technology in deep learning frameworks. It allows developers to find the best balance between flexibility and training efficiency. This version of PaddlePaddle has fully upgraded core Dynamic to Static functionality. Success rate of dynamic to static training is up to 100% among 700+ models in PaddlePaddle industry-grade model library.

New features

Adopted Python Eval Frame and VM simulation execution technology to innovatively implement an adaptive Graph Break mechanism. This mechanism is especially designed for control flow scenarios. By introducing the CallLayer mechanism, it makes full use of the advantage of PaddlePaddle dynamic-static unification motion. Support hybrid mode of Abstract Syntax Tree (AST) and bytecode simulation. Efficiently captures control flow operators, thus dramatically improving ability of computational graph to be static. At cache optimization level, fuse advanced optimization technologies such as common sub-expression elimination, to significantly improve execution efficiency of Guard. These optimizations not only reduce redundant computations, but also improve overall system operation speed. To enhance robustness of the system, a simple and efficient data intermediate layer structure is designed. Structure supports correctness recovery of SideEffects, ensuring stability and reliability of system in complex environments. In addition, it is widely compatible with mainstream interpreter versions from Python 3.8 to 3.11, providing users with a wide range of applicability. #57824,#55887,#58155,#56107,#57490,#58829,#57240,#57588,#58117,#59823,#56077,#58956,#57653,#59855,#59017,#58424,#58187,#57793,#59698,#59747,#59710,#59297,#58423,#56262,#58103,#58538,#58771,#59191,#57754,#59439,#59816,#59035
Added dynamic to static syntax transcription parsing for PyLayer functions, making PyLayer's conversion between dynamic and static graphs smoother. Users can now seamlessly carry out dynamic to static training on PyLayer, to easily export inference models. #56108,#56531,#57066,#57633

Bug Fix

Fixed the issue that video memory is abnormal in some scenarios of dynamic to static in is_test=True mode. #58350
Fixed the issue that function decorated by @to_static is exported to jit.save model in scenarios like foo(x,x,y). #55963
Fixed the issue that dynamic and static logic of some API behaviors is not uniform. This improves success rate and user experience of dynamic to static graph conversion. #56092

Fixed vulnerability

Fixed a potential security vulnerability in use of eval() in dynamic to static syntax transcription module. #60100

Enhanced distributed dynamic graph capability

In order to meet the needs of large models, this version focuses on improving the distributed computing capability of the dynamic graph of the PaddlePaddle. Various improvements have been made in communication library, graph analysis, distributed policies and task enable/disable, to provide comprehensive support for large model training. In terms of performance, we further improved training performance by reducing streaming parallel GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is further improved by fixing related Bugs.

New features

Added TraceHang function in communication library, to quickly locate the faulty node when cluster training has Hang problem. #59217
In order to improve training efficiency and reduce memory, dynamic graph supports stride mechanism. #55156,#54762,#55850,#59190,#57005,#57005,#57331,#58033,#58033,#58303,#57835,#57189
Enhanced paddleviz function to facilitate analysis of computational graphs. #56837,#57626
In distributed Sharding strategies (Stage1,2,3), added main_grad function to support higher precision gradient accumulation, and reduce precision loss caused by low precision accumulation. #57972,#57934,#57473,#57537,#59611,#57960
In Sharding Stage1 strategy, added a switch variable to control whether to perform fusion calculation on Optimizer. #58790
In Recompute function, added support for Tuple input parameters, enhancing calling ability of Recompute interface. #56793
Enhanced Launch function, allowing distributed training without specifying endpoints in dynamic graphs. #54636

Function optimization

Implemented new communication library with dynamic-static unification. Communication operators are fully adapted to PHI operator system, reducing development and maintenance costs to better support dynamic graphs and auto parallel architecture upgrade. #54417,#57768,#57897,#55537,#56604,#57519,#56088,#57153,#57161,#57252,#57251,#57208,#57305,#57424,#57548,#57560,#57564,#57233,#55726,#58073
TCPStore is changed to a single instance to support dynamic graphs and auto parallel more flexibly. #55956
Improved maintainability and flexibility of distributed policies such as MP/PP/SP, including addition of printing warning and error messages, structural cleanup of code files, and optimization of PP restrictions on inputs. #54448,#59762,#55462,#54788,#54664,#56456,#55540
In PP strategy, added support for P2P communication in computation flow, making communication mode more flexible. #54747
Sharding strategy supports reduce Operation on gradient. #58842,#57967,#55495

Performance optimization

Implemented timely release of last layer of PP strategy, to save video memory. #54505
In MP strategy Tensor fusion, supported incoming params group to enhance Tensor fusion function. Improved allreduce asynchronous communication performance, and enhanced training performance through overlap of computation and communication. #57690,#55662
In Sharding strategy, carried out overlap for reverse computation and gradient communication, to improve training performance. For Sharding stage1, added Tensor fusion and fuse grad clip, and optimizer, to improve computational efficiency. Supported overlap between VPP and DP/Sharding Stage1, to improve communication and computation parallelism. Optimized performance of Sharding Stage1 under FP16. Check only gradient responsible for this sharding rank in the check finite stage, to reduce computation overhead; added environment variables to control whether Optimize is performed to save video memory, to achieve use of fewer resources for model training debugging. #55598,#55427,#56063,#55766,#59848
In Hybrid Parallel strategy, arranged Tensor fusion under PP/VPP to pre-run, to solve the problem of extra overhead of runtime fuse on video memory. Improved model training performance by reducing non-essential synchronous memcpy. #54403,#57215

Bug Fix

Fixed 13 bugs in PP, Launch function, MP strategy, and fuse_rope, to enhance stability of distributed strategies. At mechanism level, fixed errors of inplace and tensor reference to improve stability. #55116,#55782,#59609,#57394,#55864,#58482,#54571,#55896,#54648,#58307,#55679,#58133,#58408,#59707,#55342,#54703,#54869,#55568,#55233,#56418,#56428,#56892,#57192,#59161,#59340,#57006,#57353,#57352,#59088
Fixed bug that PP strategy can't release single-layer output in time. Fixed the bug that initialization process may Hang. #54624,#58844,#54673,#58376
Fixed the bug calculation is wrong when input data type is not uniform under MP strategy. Fixed the bug of parameter synchronization under MP strategy. Fixed the bug user input config is not used correctly. #58858,#57918,#58037
Unified judgment method of dygraph and dynamic mode. #54633
Fixed the bug shape of sin and cos in fuse_rope is not correct. #56132
Fixed the bug task fails to due to long endpoints in Luanch distributed scenarios. Fixed the bug endpoints may be out of order. #55011,#55478
Fixed the bug MEA function may cause segmentation fault error. #55408

Auto parallel

This release fully optimizes Auto Parallel programming paradigm with dynamic-static unification to simplify programming complexity for developers. Developers do not need to understand complex concepts and APIs in manual parallel programming paradigm, such as row-parallel, column-parallel, and so on. A small amount of tensor distribution annotations is required to build a hybrid parallel model. Framework will handle the derivation of distribution states of all tensors and operators, and adding appropriate communication operators. Meanwhile, it supports the dynamic to static distributed training by just one extra code changed, enabling developers to efficiently and easily implement any hybrid parallel strategy. This can significantly reduce development costs of hybrid parallel training codes.

Improved auto parallel core functions

Implemented auto parallel core APIs such as process_mesh, placement, shard_tensor, reshard, dtensor_from_fn, unshard_dtensor, shard_layer, to_static, and so on. #55494,#59059,#56561,#54425,#59557,#59682,#56565,#59862,#59856,#59342,#59575,#57604,#57293,#57278
Implemented Sharding derivation rules based on Enisum expressions, and completed 20+ classes of operator Sharding derivation rules, which covers LLaMA, GPT and other transformer-like large language models. #55196,#53863,#56257,#55394,#54810,#55508,#56257,#57813,#58149,#58506,#58563,#58360,#58920,#59050,#58760,#59083,#59236,#59350,#59411,#59260,#54373,#54991,#55397,#55350,#55177,#56443,#58097,#56509,#56502,#56504,#56506,#56507,#56505,#57176,#57374,#57573,#57545,#57875,#57866,#58854,#59109,#59185,#58913,#59547,#58296,#59545,#59039,#59002,#58087,#56367,#57877,#56839,#59003,#57269,#55130,#58474,#57197,#57467,#57259,#57280,#56508
Implemented distributed checkpoint storage and loading with dynamic-static unification. Supports ReShard upon arbitrary Sharding of storage and loading in a Sharding state. #59659,#59843,#60033,#60034

Enhanced semi-auto parallel capability of dynamic graph

Basic data structure supplementation: Added DistTensor, Placements and other distributed specific basic data structures on C++ end, and exposed to Python end. Supports debugging and printing of related attributes and values. #58930,#59068,#55436,#56449,#59683,#55593,#58032,#56368,#59086
Added SPMD derivation and Reshard generation logic in execution flow for all operators, and adapted to multiple types of inputs and outputs such as vector and optional, as well as special mechanisms such as cpu fallback and multi-kernel selection. #56602,#57321,#57092,#56831,#57119,#58819,#58254,#55698,#59241,#59328,#58644,#56202,#59159,#58573,#59246,#59133,#59186,#57505,#57241,#58928
Adapted auto parallel execution logic for special types of operators, such as custom operators. Supports automatic conversion of DistTensor and DenseTensor as mixed inputs. #57774,#59108,#58436,#59523,#59136,#59352,#59062,#58434,#59148,#58553,#58716,#58369,#59061,#58841,#59139,#59141,#58837,#59137,#59143
Optimized dynamic graph execution system: Adapted Autograd execution process. Supports dynamic graph's inverse gradient aggregation, AMP, Hook, PyLayer, View, custom operators, and other surrounding mechanisms. #58437,#58769,#58796,#58339,#58409,#58772,#58380,#58447,#58706,#58656,#58172,#59401,#58727,#58238,#59243,#58469,#58442,#58487,#58476,#59706
Added support for Pipeline Parallelism, Sequence Parallelism and other distributed parallelism. #58126,#59766,#59060,#59841,#58609,#59688,#58449、#59598
Added various Reshard strategies and support tensor conversions between different distributed states. #58592,#59138,#59367,#59621,#59758,#59777,#56975,#58550,#58703,#57210,#58734,#56833,#59292,#57432,#57568,#56553,#58284,#56039,#55552,#56149

Enhanced semi-auto parallel for static graphs

Added Sequence Parallel Parallelism; added FThenB, Interleaved 1F1B, Eager 1F1B, VPP and other scheduling modes for Pipeline Parallel, and supported the hybrid parallel between the above new parallelism and original parallelism. Supported visualization of pipeline scheduling. Upgraded gradient synchronization mechanism which supports gradient synchronization when data is sharded on any broadcast dimension. #57605,#54727,#54409,#54787,#58313,#59179,#59416,#59719,#59822,#59057,#59522,#57061
Adapted the executor to PIR, and supported PIR optimization Pass. In distributed scenarios, supports fuse_linear fuse, and etc., to improve performance. #58459,#58528,#55555,#59757,#59102,#57917
Upgraded underlying architecture: upgraded the executor to reuse the results of data-flow dependency analysis and static kernel selection; upgraded entire graph based sharding completion mechanism, to switch to new sharding derivation rules and support some long-tailed cases; optimized the support of control flow under distributed static graph to adapt to more scenarios; reduced the graph compilation time and refined error message format to improve user experience. #55389,#55650,#54938,#57447,#57751,#57742,#59524,#59526,#58669,#57616,#56511,#55727,#58906,#56016,#54897
Optimized the gpu memory usage in static graph mode, and added refined recomputing strategy; optimized auto mixed precision pass, and allows users to manually specify auto-cast region and fixed some bugs; supports parallel computation of cross-entropy; supports fusion operators such as scaled_dot_product_attention, fuse_rope, etc.; performs scheduling optimization to support better overlap between communication and computation in tensor parallelism and pipeline parallelsim. #58421,#58533,#59498,#59498,#59187,#59188,#58172,#58628,#56185,#56696,#59497,#58304,#58977

AutoTuner

This release implements a profiling based automatic search and tuning tool named AutoTuner for parallel strategies, to automatically combine parallel and optimization strategies. Users can select effective combination configurations for experiments, and AutoTuner will search for the optimal configuration for large model training and inference given the model and hardware specification. In addition, AutoTuner implements a variety of pruning methods, including gpu memory modelling based pruning, so the search space and search time can be significantly reduced. #54460,#54668,#59794,#59727,#59782,#54834,#58127,#56968,#55466,#56939,#58183,#58314,#55499,#59748

Operator library

Incompatible upgrade

In order to improve maintainability of PaddlePaddle framework, some deprecated operators in the framework (e.g. diag_v1, isfinite_v1, pad2d_v1, etc.) have been removed, and models using these operators saved through the PaddlePaddle 1.x training will not be able to infer on new version of PaddlePaddle. #57895,#57892,#57898,#57730,#57732,#57810,#57884,#57794,#57926,#57925,#57807,#57808

Operator library enhancements

The complex kernels of PaddlePaddle PHI operator library have been further enhanced, and a total of 40+ complex kernels have been added. #55380, #56349, #56412, #56323, #56723, #56457, #56903 #56914, #57116, #56048, #57244, #57639, #57638, #57540, #58545, #58336, #58532, #58839, #59079, #59277, #59122, #57058
Optimized and added XPU kernels for some operators, and enhanced the support for data types such as bfloat16 on XPU kernel. #54478, #57740, #58346, #58456, #58662, #59066, #59263), #59375, #59505, #59653, #55001, #57272, #56169, #59454, #59480, #55914, #54758, #54827, #58364, #58419, #58982, #57216, #59166, #55033, #55375, #58805, #59389, #57077, #55166, #56773
Added some operators for optimizing large model training and inference performance. #55758, #54998, #55400, #54630, #55969, #55026, #58986
Improved mechanism of Tensor Strided in the operator library. #59422, #59325, #56863, #56882, #56947
Optimized function implementation and template function in some kernels to reduce size of complied library package. #57083, #57299, #57261, #57290, #57118, #57551, #57509, #57558, #57064, #57365, #57327, #57603, #57671, #57672, #57631, #57082, #57721, #57823, #57821, #57815, #57822, #57541, #57817, #57838

Fixed bug

Fixed some bugs with CUDA 12 adaptation of the PaddlePaddle framework. #54640, #57820, #58958, #58179, #55594

CUDA

New features

Added debugging class API paddle.amp.debugging.check_check_numerics. Calculated and returned number of outliers (NaN, Inf) and zero elements in this Tensor value. #54301
Added fused_rope fusion operator to accelerate LLaMA class large model training.#54351
Updated CUDNN Frontend API version to v0.9.1 and added fused_scale_bias_add_relu fusion operator to accelerate ResNet networks. Note this feature is in experimental period and is disabled by default. #58367, #54949, #58504
Based on Flash-Attention v2, added Tensor-like Mask function support. Inverse operator supports deterministic computation for debugging. #57276, #56363
Modified sparse conv3d backend implementation to support 2d shapes, avoiding front-end reshape overhead. #54707
Added matmul_int8 operator. (#55228)

Function optimization

Optimized CUDA Graph’s support for random number operators.#58310
Enhanced automatic mixed-precision training default functionality, including:
- Optimizing the experience of using automatic mixed precision training interface.#58152,#55364,#57903
- Added matrix computation class operators such as fused_attention, fused_feedforward, and fused_gemm_epilogue to framework's default whitelist, and unified default black and white list settings for dynamic and static graphs. #55373, #55713
- The argsort, dist, erfinv, nanmedian, poisson operators and lamb optimizer operators support FP16 and BF16 low precision computing. #51662, #55105, #55287, #55824, #56056, #56184, #55641
- Fixed elementwise_max operator low-precision implementation. Changed to use FP32 type for numerical computing, and reduce precision loss. #54799
- Changed temporary result Tensor needed for Reduce class operator to FP32 type, to avoid precision loss caused by converting intermediate result to low precision. #55709)
Optimized GPU codes for flip, roll & roll_grad, index_put & index_put_grad, etc. Removed unnecessary C++ templates to optimize compilation time and reduce compiled binary size without performance degradation. #57309, #57525
For the bernoulli operator, added a check on legitimacy of input probabilities. #59174

Performance optimization

Optimized BroadcastKernel's support for large Tensor. Change to call INT32 version implementation for multiple times for large Tensor Sharding, improving operator performance by 7.27x. #57313, #57996
Optimized performance of Tensor save interface by copying the Tensor to CPU and then converting to numpy, to avoid overhead of automatically converting the Tensor to a continuous Tensor when Tensor is not continuous. #57040

Bug Fix

Fixed bug of memmory_efficient_attention operator supporting the sm_90. #58070
Fixed the NaN problem of softmax operator when axis=-1 and length is greater than 100000. #57851
Fixed bug of GPU access error in some cases for set_constant operator. #59905
Fixed GPU storage read/write contention issue in fast implementation version of layer_norm operator. #56435

Expanded Compiler Infrastructure for Neural Networks (CINN)

In this update, PaddlePaddle CINN focuses on optimization of architecture and comprehensive expansion of its capabilities. In view of increasing demand for dynamic shapes for large models, effective operation and optimization strategies of compiler under dynamic shapes are initially explored and implemented. At the architectural level, Python DSL is introduced, significantly improving CINN's development convenience and Debug capability and enabling developers to write and debug codes more efficiently. Meanwhile, logic of Schedule has been refactored to be dominated by GroupSchedule, enabling more general and stable optimization strategies at operator Group level. In order to enhance stability of CINN, a strong constraint component is explored and introduced. This can effectively reduce uncertainties and potential errors in the system. In addition, historical tool classes and software structure of CINN are systematically organized, optimized and improved, to further enhance readability and maintainability of codes. In terms of integration with other PaddlePaddle components, tight integration of CINN with PIR and Paddle has been further strengthened, making compiler more coherent with overall PaddlePaddle framework. This improvement not only enhances performance of the compiler, but also provides developers with a smoother and more unified development experience.

Compatibility upgrade

Updated storage read interface to be compatible with Paddle 2.0. #55836
Updated relu6 Op Mapper compatibility. #55611

Modification deprecation

Removed old Schedule form. #55566,#55391
Removed some obsolete tests. #56245,#57987
Removed the remove_nested_block Visitor tool that no longer works. #56972
Removed other useless codes. #55413

New features

Added CINN paddle.framework.core.is_run_with_cinn() API on the PaddlePaddle side. #54355
Added CINN related operator logics, including various combinatorial operator’s disassembly logic. #56072,#58210,#58502, #58591, #58981, #59135, #59274, #59306, #59202, #59176, #59534, #59713, #59798; Supports bf16, amp and other forms #54399, #54368, #54608; Supports operator zero-dimensional capability #54892, #54919, #54907, #54966
Supports CINN and PaddlePaddle PIR, and combinator operator junction operation mode, so new PIR and CINN operation is integrated. #54732, #56074, #58216, #55680, #56302, #59037, #55186, #58641
There are strongly constrained components to stabilize CINN changes. #58719, #59309, #58993
Added Group Schedule related CINN architecture process. #58399, #56444
Added CUTLASS, error handling, and NVRTC Cubin Fmad options to CINN architecture functions preliminarily. #58079, #57198, #58794
Added Python interface language for CINN. #57731, #57515, #57644, #57981, #58009
Added dynamic shape functionality for CINN to cover ASTGen to generate dynamic shape symbols, to replace the ISL to generate dynamic shape signals #56360, #57207, #57454; Added Bucket Conditional Compilation functionality #59165; Added Schedule, Device, and IR level support for dynamic shape #58988, #59493, #58717, #58602, #59196
Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. #56122, #57777, #57569

Function optimization

Enriched or improved operator functionality, including improvements to various operator processes such as Repair Reverse, FP16, Infershape, Operator Single Test, etc. #56320, #56845, #54939,#54378,#55321,#55336,#55337,#55442,#55470,#55489,#55510,#55547,#55505,#55563,#54280,#59650,#54862,#55135,#55292,#55333,#55316,#55379,#55326
Improved CINN, PaddlePaddle, PIR, combinator operator junction operation, including various and PIR and its actuator interface and CINN mutual support. #59170,#58766,#59255,#59203,#59024,#57829,#58135,#58193,#58207,#58606,#59437,#59759,#55075,#56805,#57764,#58620,#59769,#58702,#58749,#59025,#58820,#58908,#58169
There are strongly constrained components to stabilize CINN changes. #55090,#55705,#57587,#59501
Improved CINN IR and related tool codes. #55145,#55955,#56307,#55519,#56958,#57019,#57230,#57531,#57532,#57524,#58770,#59337,#59096,#56274,#56350,#57312,#55171
Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. #54982,#57963,#58220,#55484,#55935,#55590,#56530,#58344,#59810
CINN architectural improvements, including parallel compilation, low-level storage allocation method, print information, Group structure, Pass structure, etc. #56282, #59014,#59209,#52660,#54749,#58694,#58940,#59504,#56123
Improved CINN codegen, jit instruction, dim args, and host kernel to support dynamic shape. #58825,#59395,#59398,#59540,#59470,#59640
CINN error reporting optimization. #54983,#55544
Improved cleanup of CINN codes, including CI, file paths, C++17, Flags, third-party libraries, Docker, etc. #55018,#55121,#55009,#55888,#56168,#56192,#56896,#53861,#55208

Performance optimization

Fusion of vit attention. #54139
Optimized block reduce. #58196

Fixed bug

Fixed operator-related bugs. #56280,#57767,#58406,#54406,#54494,#54751,#55674,#55684,#55683,#57798,#57816,#57687,#56719,#59756,#59770,#58811
Fixed process architecture-related bugs. #54899,#59737,#59356,#56105,#56662,#58146,#58910,#58121,#58943,#58886,#59642,#56164,#56338,#56966,#59112,#55820,#56660,#57307,#57530,#58236,#55190,#55043,#55667
Other bugs. #57239,#55530,#56605,#58243,#58197,#58197,#56086,#56065,#58775,#54750,#58595,#58873

Documentation

Added README file. #58349

4. Deployment Direction (Paddle Inference)

General inference optimization

This version of the upgrade improves performance and ease-of-use of the inference engine on GPU and CPU, reducing user cost and application cost of online inference. On GPU: A high-performance multi-threaded asynchronous executor is supported, and inference performance of each model is improved by 5%~10%. The new version of TensorRT and BF16 inference capabilities are also supported, and TensorRT inference performance and ease of use are further improved. On CPU: The latest version of OneDNN high-performance inference is supported. SwinTransformer, FastRCNN and other series of models have greatly improved performance.

matmul supports transpose and broadcast operations. #56827
TruncatedNormal and Assign supports FP64 data types. #57507
Supports conv2d explicit quantized inference. #57160,#58015
Added conv_fuse_pass. Support conv + bn fusion. The conv2d_fusion is renamed fused_conv2d_add_act. #58724,#55374,#54477,#59431
Mixed precision inference supports OP whitelisting. #56535
OneDNN optimization is enabled by default. Supports SwinTransformer, FastRCNNd and other inference optimizations. #58560,#59394,#59421,#58435,#58488,#59259,#56303,#56782,#57598,#58361,#59641,#59527,#59663,#59744
Added share_data and support for pass in specified data. #57933

Large model inference optimized

The fine-grained fusion inference optimization of generative large models is realized. Optimization solution ensures high-performance inference capability and excellent expandability. Users can flexibly utilize various fine-grained fusion operators and PaddlePaddle native operators to build a network structure of generative large models in free combinations as required, thus achieving efficient and low-cost inference. In addition, our solution also supports mainstream generative large model structure, significantly reducing deployment cost of inference for such models and strongly supports efficient and low-cost implementation of generative large models.

Supports the FMHA/MMHA for CacheKV division block scheduling. #59462
RoPE encoding fusion operator supports input sin/cos values. #55415
Added fine-grained fusion operators. Supports high-performance inference optimization of generative large models. Added operators such as quant_linear, weight_quantize, and linear_compress for support of large model quantitative inference. #57852,#55128,#59090,#56706,#59951,#55490,#59291,#59441,#59778,#59651 #55301,#58637,#56673,#56401
Supports variable length inference series API. #57948
Supports the GQA inference. #58472,#58836
Added masked multihead attention. Supports high performance MMHA inference. #55344,#56411,#58134,#57936
weight_quantize/weight_only_linear supports the Volta architecture. #58082
Added weight_only_linear_grad for support of large model weight only quantization gradient transfer-back. #57685
Fixed large model dynamic to static bug. Optimized communication initialization logic between static graph cards. #56390,#57169,#56688,#56592,#58868
Optimized top_p_sampling random number generation logic. #59494

Paddle-TensorRT Inference Optimization

elementwise_add fusion supports NHWC format. #56795
conv2d supports filter as input. #55246。
Supports BF16 and FP64 inference. #59765,#55520
Added MarkTrtEngineOutputs API. Users can specify TensorRT Engine outputs. #56858,#56188,#57407
Customized OP can generate TensorRT Plugin automatically. #58976,#56037
TensorRT inference allows users to specify input hook to optimize shape collection process. #59466,#54841,#57498,#54861,#54432,#55503
TensorRT Inference supports inference model after saving Tuning. #55893,#56952,#57031
Supports variable length Transformer model PromptTuning. #57034
Added operators such as bitwise_and, bitwise_or, bitwise_not, cumsum, einsum, lookup_table, assign, flip, size, scatter, solve, unbind, reduce, and argsort. Optimized support of existing operators. #59214,#59293,#54882,#54097,#54860,#55426,#54372,#55688,#56069,#59563,#59317,#59424,#55476,#56043,#58549,#57326,#59409)
TensorRT enables video memory sharing by default. #59495,#58251
PrelnResidualBiasPluginDynamic supports 4D input. #56304
Added support for FlashAttention for Paddle-TRT inference for architectures below SM80.#56492

Modification deprecation

Removed fc_elementwise_add fusion from OneDNN. #55504
Removed redunant op. #54442

Bug Fix

Fixed “Inference so” link flags conflict issue. #59755
Fixed constant_folding pass execution error. #55556
Fixed softmax forward speed bug and reverse accuracy bug. #56036,#57858 #57538
Fixed customized OP while error and export bug. #58898,#59318
Fixed CUDA 12.0 compilation problem on Windows platform. #59852
Fixed bug of inference partial operator error when TensorRT version is later than 8.6. #54379,#54679,#54251
Fixed and removed inference fusion Pass. #54846,#54887,#55573,#56434,#56326,#56753,#57491,#56909,#54536,#55073,#55081,#55240,#56439,#59009
Fixed error of multi-stream inference context switching. #57629,#58048,#54994

5. Hardware Support

Hardware Integration Solution (Custom Device)

In this update, added support for distributed advanced strategy, custom operator and custom fusion strategy. By upgrading distributed communication library, supports MP, GroupShared, PP, SP, MOE and other advanced distributed strategies. Meanwhile, enables vendors to flexibly access Transformer operator libraries of different granularities, and modify computation graph through Fusion Pass for performance acceleration.

New features

Upgraded CustomDevice to support for Paddle's latest distributed communication library CommContext. Added a variety of advanced distributed strategies such as GroupShared and MOE. #56301,#54671,#57957,#56669,#54384,#54572,#54573,#54676
Upgraded CustomDevice to support CustomOP. Users can register undefined operators in Paddle PHI operator library. CustomDevice can support CustomOP via CAPI. #57038,#55532,#56755,#55532,#55533,#55659
Added CustomDevice's support for CustomPass function. Modified the computation graph IR through Python API. #55511,#55728
Added CustomDevice’s support for Paddle run_check. #56318
Added CustomDevice’s support for StreamSafeAllocator. #55393,#56380,#56536,#58035
Added CustomDevice’s support for DataTransform. #56627

Function optimization

Added CustomDevice’s support for more PaddlePaddle APIs such as Variable.set_value, adamw, share_external_data, mp_allreduce_sum, tensor.numpy, get_paddle_place, and GeneratorState. #55272, #56386, #57253, #56927,#56189,#55225,#55247
Modified CustomDevice dynamic library loading method from RTLD_NOW to RTLD_LAZY, to facilitate subsequent checking of compatibility of CustomDevice related software stack version. #57544
Added CustomDevice's detection function for FP16 operator under mixed precision training. #56053,#56176

Bug Fix

Fixed some problems in CustomDevice's support for distributed communication libraries. #55293,#58038,#59800
Fixed some problems in CustomDevice on some operators, including c_softmax_with_cross_entropy,data loader,SplitDenseTensor,grad accumulation,atan2 grad.#56486,#55541,#55615,#56052,#56067
Fixed some problems of device management in CustomDevice, including device exceptions (#56556,#58639,#55173), exception events (#56745,#58059), video memory exception (#56977,#59247,#54606), device initialization (#57099,#57994), device release (#54932,#55351,#55783), and device resource pooling, etc.(#55229,#56580)
Fixed CustomDevice compilation-related issues. #56760,#56766

Kunlunxin XPU

New features

Added XPTI (XPU Profiling Tool Interface) to support collection and analysis function of runtime performance data. #54685,#54690,#54800
Supports Paddle's latest distributed communication library CommContext. #59418
Added XPU fusion operators, for example, fast_where. #55628
Added support for XPU Pluign function, facilitating users to develop XPU customized operators through XTDK programming. #55101,#59326
Added XPU’s support for AutoGrowthAllocator. #54121
Added operator support list of Kunlun3. #57683

Function optimization

Upgraded XPU Inference API. #54342
Optimized performance of some XPU operators. Added support for bf16 in some XPU operators, including unique/index_put,squeeze/unsqueeze kernels,swish/swish_grad,scatter_nd_add_grad/slice,rsqrt/bitwise_or/arange_tensor,where,collective. #56582,#58161,#58440,#58580,#58950,#58616,#59273
Optimized XPU memory management to avoid memory leakage. #59334,#54847
Supports INT8 inference. #57258
Added support for FP16 series inference operators. #55642,#54410
Supports share_external_memory interface to pass input and output. #55170
Supports open source quantization model XPU inference. #58568
Added context_gm_size configuration, instead of allocating global memory in Pass. #54674
Added embedding and fast_gather_nd plugin. #56488,#56103
Supports fusion of fast_layternorm + leaky_relu. #57113
Supports elementwise_min/max/floordiv/where inference in KL1 and KL2 precision. #58422
Supports autotune configuration of fc and conv2d operator. #58801
Supports conv and fc dynamic quantization. #59307
fc + act fusion support for sigmoid, swish and relu6. #54486
elementwise_sub/elementwise_div supports int data type. #55920

Bug Fix

Fixed XPU communication library issues and some operator issues including rnn, layer_norm_grad, yolo_box. (#55475,#55515) (#55656,#54669,#55310

Hygon DCU

Bug Fix

Fixed some operator bugs of Hygon DCU, including rnn, concat/split, fft, and so on.#59402,#55821,#56340)
Fixed issues related to communication library of Hygon DCU. #57110
Fixed compilation-related problems of Hygon DCU. #59775,#55507,#55612,#54952,#55076,#56079,#54874)
Fixed support issue of Hygon DCU for BF16 data type. #56517

6. Environment Adaptation

Adopted modular compilation to optimize CMake codes, improving efficiency of compilation of PaddlePaddle. This can increase efficiency of RD local development. Meanwhile, supports compilation in Python3.12, CUDA12, and Hopper architecture, and using Clang tool to comprehensively optimize code formats. In addition, C++ unitest is changed from linking static libraries to linking dynamic libraries to reduce compilation size. These improvements provide users with a smoother and more efficient installation and development experience.

CMake code optimization: stratify directories into independent static libraries, to improve incremental compilation efficiency. #59095, #58960,#56591,#58484
CMake compilation stratification: to realize compilation layering of PaddlePaddle architecture from bottom-up and improve compilation efficiency. #56442,#54729,#55733,#56352,#55109,#54992,#57698,#55147,#55113,#56691,#58618,#58899,#59140,#59129,#59222,#59105,#59711
Offline compilation of third-party libraries: Third-party dependent libraries are compiled offline, so CI/CE system does not need to download third-party libraries repeatedly in every compilation, improving operation efficiency of the CI/CE system. #54344,#54370,#54466,#54438,#54388,#54436,#54392,#54646,#54380,#55501,#55136,#54451,#55631,#55549,#56165,#54391,#54614,#54522,#54764,#54400,#54322
PaddlePaddle supports Python 3.12. #59396,#58069
Using Clang tool to optimize source codes and improve code quality. #59626,#55895,#56632,#54449,#54523,#54796,#55847,#55807,#56261,#57522,#57868,#57809,#55658,#58285,#55491,#55506,#55279,#55741,#55894,#55704,#55800,#55799,#55983,#55954,#55764,#56246,#56219,#56217,#56216,#56208,#56134,#56253,#56255,#56693,#56692,#56637,#56636,#56647,#56218,#56640,#56635,#55675,#56601,#56485,#56648,#56747,#56676,#56649,#56895,#56994,#56904,#56744,#56954,#57114,#57343,#57483,#57871,#57861,#58028,#57627,#59072
C++ unitest has changed from linking static libraries to linking dynamic libraries, reducing compilation size and improving compilation efficiency. #59477,#56630,#57789,#54257,#59620,#59384,#59619,#58583,#58821,#58710，#58619
Fixed bug related to source code compilation, improving compilation efficiency. #56617,#58195,#56136,#54540,#57172,#54429,#55603,#54807,#56102,#56829,#56951,#56555,#57781,#57836,#58807,#54535,#54946,#54437,#54411,#54411,#54391,#54466,#54480,#54480,#54724,#59193,#54735,#54812,#56430,#56655,#56684,#56774,#56936,#56949,#56974,#57171,#57712,#56617,#58181,#58253,#58268,#59051,#59048,#59081,#59076,#59155,#59253,#59347,#58957,#59443,#58998,#57574,#55889,#59078,#55762,#56252,#56715,#54905,#56978,#57032,#57179,#57179,#58996,#59915,#54883,#56746,#57674,#60117,#55627,#54568,#54450,#54513,#54615,#54913,#54916,#55148,#55125,#55479,#55723,#55831,#55904,#56085,#56259,#56366,#56366,#56546,#56679,#57222,#57387,#57993,#59556,#57931,#58112,#54228,#56913,#56993,#55042,#55305,#55286,#56634,#57778,#58374,#58640,#58822,#59055,#59303,#59487,#58400,#59283,#54791,#59134,#56206,#56199,#56670,#58923
Fixed bug related to Paddle ARM compilation. #55416,#55548

Thanks to Our Contributors

Azure-Tang, zhaoyinglia, From00, JZ-LIANG, xysheng-baidu, SylarTiaNII, kuizhiqing, zhiqiu, FeixLiu, liuzhenhai93, GhostScreaming, pangengzheng, xiaoyewww, wanghuancoder, ForFishes, hitywt, danleifeng, tianshuo78520a, ykkk2333, houj04, lj970926, XiaociZhang, HarperCy, cqulilujia, runzhech, RuohengMa, Caozhou1995, kangguangli, heavyrain-lzy, zyfncg, SigureMo, YuanRisheng, lchdl, LiYuRio, AndSonder, Wennie396, zhangbo9674, liudongxue01, risemeup1, phlrain, winter-wang, yuanlehome, NALLEIN, Liujie0926, yuguo-Jack, gitliuyf, zh794390558, Aurelius84, 6clc, GGBond8488, xiaoguoguo626807, Wong4j, iosmers, xiaoxiaohehe001, LielinJiang, carryyu, Difers, yangxiaoyu14, xuxinyi389, cxxly, gongshaotian, jjyaoao, lijialin03, lxd-cumt, cyber-pioneer, HydrogenSulfate, MayYouBeProsperous, Charles-hit, Patrick-Star125, ScottWong98, huangjiyi, DrRyanHuang, jinyouzhi, BeingGod, Wanglongzhi2001, yangguohao, zyt1024, longranger2, 2742195759, megemini, thisjiang, kevincheng2, zhoutianzi666, Wangzheee, ming1753, tianhaodongbd, freeliuzc, zhenyun-li, MARD1NO, RichardWooSJTU, eee4017, leo0519, csy0225, wwbitejotunn, bukejiyu, jiweibo, iamsonderr, ckl117, ronny1996, zhanglirong1999, LLee233, ZHUI, wangxn12138, zhwesky2010, Courtesy-Xs, zoooo0820, llyyxx0413, Asthestarsfalll, zxcd, pkuzyc, idontkonwher, sneaxiy, hong19860320, ZibinGuo, leolishaohao, MuShangCC, zhupengyang, shentanyue, Travis-Lee, wz1qqx, frank-oops, newway, QingshuChen, zhangyk0314, HandSomeLEEw, Shixiaowei02, zhangyuqin1998, Xing-lil, zhhsplendid, jiahy0825, xinyu-intel, MarioLulab, 0x45f, Tom-Zheng, xingmingyyj, zhangbopd, gouzil, zeroRains, BiynXu, WintersMontagne10335, wuhuachaocoding, GreatV, chenwhql, deepllz, parap1uie-s, ozogxyz, FisherWY, changeyoung98, zhiboniu, YangQun1 dynamicheart, Xreki, liugddx, Lylinnnnn, YSF-A, zzjjay, YanhuiDua, lishicheng1996, USTCKAY, abenmao, cocoshe, HermitSun, ccsuzzh, sanbuphy, enkilee, RedContritio, Liyulingyue, zrr1999, chen2016013, Galaxy1458, chalsliu, mrcangye, XieYunshen, zhiheng-liu, haohongxiang, ZzSean, JamesLim-sy, yuehuayingxueluo, niuliling123, umiswing, sijunhe, littsk, SecretXV, zhurou603, zhangjun, caizejun, yangjianfengo1, vivienfanghuagood, Xinyu302, lizexu123, yghstill, Li-fAngyU, VigiZhang, co63oc, dhanush-2501, ooooo-create, PommesPeter, zeus2x7, akshatvishu, jzhang533, Sekiro-x, gumblex, BernieHuang2008, YibinLiu666, qiuwenbogdut, XavierZXY, MqLeet, zhangting2020, mingxu1067, Ainavo, SSKlearns, yuchen202, silverling, zade23, wenxiaohahaha, NKNaN, Tsaiyue, fsczz, Tomoko-hjf, rhmaaa, zbt78, Hhankyangg, wangzhen38, zhengqiwen1997, engineer1109, onepick, qili93, Rane2021, nemonameless, DesmonDay, RachelXu7, ceci3, lyuwenyu, liuruyan, LokeZhou, shiyutang, lanxianghit, feifei-111, Sahala08, sunzhongkai588, Kaedeharai, Candy2Tang, liyongchao911, whisky-12, InsaneOnion, yoyoIcy, KongAKun, linzeyang, MuhammadNizamani, eltociear, Ligoml, LUZY0726, Windfarer, FlyingQianMM, jeng1220, junelotus, zlsh80826, Vvsmile, Frida-a, TonibMw, guoshengCS, zhink, ZhangYulongg, AlbertVan, fengxin-hello, mjp9527, entired, DanGuge.