PaddlePaddle 3.2.0 Release Note EN - PaddlePaddle/Paddle GitHub Wiki
Important Update
The PaddlePaddle framework version 3.2 has further enhanced its performance in large model training and inference, hardware adaptation, and support for mainstream large models and high-performance acceleration libraries.
- In terms of large model training, the PaddlePaddle framework has undergone upgrades in three aspects: computation, parallel strategy, and fault tolerance:
- From the perspective of basic computational performance, FlashMask V3, a sparse mask attention computation with overlapping storage and computation, is proposed to maximize the computational efficiency of Attention. Additionally, it also implements an efficient lossless training technique with FP8 mixed precision effect.
- At the level of distributed parallel strategy, a dynamically adaptive VRAM offloading strategy is proposed to achieve optimal balance between memory and computation. Combined with an innovatively designed VRAM-friendly pipeline parallel scheduling, it further reduces VRAM overhead.
- Enhanced the native fault tolerance capability of the framework, implemented a large-scale cluster training fault tolerance system, which can monitor online silent data corruption and other difficult-to-detect faults without affecting training efficiency, and implemented a highly available checkpoint disaster recovery method to reduce the loss of interruption recovery.
- In terms of hardware adaptation, we have comprehensively upgraded the plug-in adaptation solution for CUDA-like chips. In terms of device resource management and scheduling, as well as high-performance collective communication libraries, management interface upgrades and communication capability enhancements have been made for CUDA-like chips, with a particular emphasis on enhancing distributed communication capabilities, aligning XCCL with the various structures and functions of NCCL.
- Added a registration mechanism for CUDA-like operators. Taking Muxi adaptation as an example, operator kernel registration can be completed with just one line of code based on the reuse of GPU operator kernels. According to statistical calculations, the reuse rate of operator kernels can reach up to 92%, significantly reducing hardware adaptation costs. In terms of user experience, the focus has been placed on enhancing compatibility, encompassing the development of interfaces compatible with industry practices, compatibility with the SafeTensors model format, and compatibility with third-party high-performance acceleration libraries.
- The newly added and modified development interfaces are compatible with industry practices, with a series of new APIs and aliases introduced, along with new parameter aliases and both proprietary and generic parameters.
- Fully compatible with the Safetensors model format. The newly added FlexCheckpoint mechanism supports automatic parameter re-sharding across distributed strategies and model structures, significantly reducing the cost of weight conversion and thereby enhancing the end-to-end training and inference development efficiency of large models.
- The system has systematically enhanced its interface compatibility and operator registration capabilities, enabling one-click import of high-performance acceleration libraries. These libraries can be directly reused in PaddlePaddle's model training and inference acceleration processes without requiring code modifications.
1. user experience
New features
- New APIs:
paddle.msort
,paddle.ravel
,paddle.nn.functional.dropout1d
,paddle.Tensor.type_as
,paddle.Tensor.requires_grad
,paddle.view_as_complex
,paddle.view_as_real
,paddle.nn.Parameter
,paddle.broadcast_shapes
,paddle.range
,paddle.as_tensor
,paddle.scatter_reduce/scatter_reduce_
,paddle.scatter_add
,paddle.tensor
,paddle.softmax
,paddle.Tensor.softmax
,paddle.rand_like
,paddle.is_autocast_enabled
,paddle.get_autocast_gpu_dtype
,paddle.Tensor.repeat
,paddle.permute
. #74421,#74439,#74444,#74454,#74459,[#74491]( https://github.com/PaddlePaddle/Paddle/pull/74491 )[# 74466]( https://github.com/PaddlePaddle/Paddle/pull/74466 ),#74438,#74594,#74542,#74694,#74564,#74540,#74586,#74651,#74807,#74632,#74834,#74952,#74772,#74441,#74561,#74525 - Added a series of APIs under
paddle.compat.*
to support common usage in the industry and facilitate code migration, includingpaddle.compat.median
,paddle.compat.nanmedian
,paddle.compat.softmax
,paddle.compat.sort
,paddle.compat.split
,paddle.compat.min/max
, andpaddle.compat.Unfold
. #74865, #74874 - Added a series of initialization APIs to support commonly used parameter initialization methods in the industry, including
paddle.nn.init.kaiming_uniform_
,paddle.nn.init.xavier_uniform_
,paddle.nn.init.uniform_
,paddle.nn.init.kaiming_normal_
,paddle.nn.init.xavier_normal_
,paddle.nn.init.normal_
,paddle.nn.init.calculate_gain
,paddle.nn.init.constant_
,paddle.nn.init.dirac_
,paddle.nn.init.eye_
,paddle.nn.init.ones_
,paddle.nn.init.orthogonal_
,paddle.nn.init.trunc_normal_
, andpaddle.nn.init.zeros_
. #74478 - Added usage of parameter aliases in API, allowing for more flexible input options such as
x
orinput
. This includes functions likepaddle.maximum
,paddle.minimum
,paddle.sqrt
,paddle.topk
,paddle.polar
,paddle.stack
,paddle.cos
,paddle.floor
,paddle.log
,paddle.pow
,paddle.rsqrt
,paddle.sign
,paddle.sin
,paddle.multiply
, andpaddle.where
. #74683, #74795, #74887, #74592 paddle.Tensor
now supports multiple initialization methods, enabling flexible Tensor creation. #74619, #75022, #75065- The API has added some proprietary parameters to enhance existing functions, including
paddle.nn.functional.gelu
,paddle.divide/div/div_
,paddle.add
,paddle.Tensor.copy_
,paddle.norm
,paddle.linalg.norm
,paddle.nn.functional.silu
, andpaddle.repeat_interleave
. #74485, #74562, #74420, #74768, #74855, #74903, #74788, #74631, #74947 - The API has added some common parameters:
out
,device
,dtype
,requires_grad
,pin_memory
, andbias
, enhancing the existing functionality. These includepaddle.zeros
,paddle.zeros_like
,paddle.ones
,paddle.ones_like
,paddle.arange
,paddle.eye
,paddle.empty
,paddle.empty_like
,paddle.full
,paddle.full_like
,paddle.randn
,paddle.Tensor.new_full
,paddle.Tensor.new_empty
,paddle.Tensor.new_ones
,paddle.Tensor.new_zeros
,paddle.tril/triu
,paddle.bmm
,paddle.nn.Conv1D/Conv2D/Conv3D/Embedding
,paddle.diff
,paddle.cumsum
,paddle.var
,paddle.multinomial
, andpaddle.mean
. #74477,#74526,#74711,#74582,#74624,#74849,#74612,#74875,#74641,#74949,#74918,#74914,#74934,#74920,#74955,#74226,#74946 - Added aliases to APIs to support more calling methods. These include
paddle.Tensor.mul_/mul
,paddle.autograd.Function
,paddle.argwhere
,paddle.cat
,paddle.clamp
,paddle.ger
,paddle.take_along_dim
,paddle.linalg.matmul
,paddle.special.logsumexp
,paddle.concatenate
,paddle.eq/gt
,paddle.Tensor.take_along_dim
, andpaddle.nn.Conv1d/Conv2d/Conv3d
, etc. #74493, #74569, #74870
Bug fixes
- Fixed the precision issue of
paddle.nanmedian
in PaddlePaddle. #74263 - Fixed the issue of
paddle.distributed.fleet.utils.hybrid_parallel_util.fused_allreduce_gradients
in 0-D scenarios. #74957 - Fixed the issue of
paddle.matmul
in distributed mode. #74989
Enhanced functionality
- For scenarios involving the return of multiple Tensor objects, the experience has been optimized through encapsulation using the Paddle data structure, including
paddle.topk
.#74931 - Create a class API to support the usage of variable-sized parameters. #74494
Documents
Other
- Optimization related to code style. #74654,#74655,#74665,#74660,#74667,#74664,#74662,#74661,#74658,#74657,#74666,#74659,#74663,#74656,#74673,#74672,#74671,#74674,#74675,#74670,#74669,#74677,#74709,#74714,#74712,#74713,#74704,#74746,#74748,#74743,#74742,#74744,#74745,#74747,#74794,#74789,#74793,#74786,#74791,#74787,#74827,#74608,#74288,#74287,#74385,#74395,#74475,#74647
- Optimization related to MKLDNN/ONEDNN. #74299,#74244,#74230,#74314,#74327,#74325,#74326,#74315,#74399,#74398,#74393,#74392,#74367,#74391,#74423,#74424,#74436,#74417,#74410,#74473,#74458,#74501,#74487,#74502,#74513,#74518,#74516,#74507,#74504,#74505,#74509,#74535,#74536,#74517,#74503,#74557,#74550,#74575,#74587,#74576,#74588,#74549,#74581,#74583,#74628,#74630,#74635,#74679,#74648,#74127,#74636,#74552,#74551,#74678,#74680,#74730,#74751,#74895,#74821,#74897,#74734
- Optimizations related to code implementation, variable and file renaming. #74309, #74597, #74613, #74376, #74479, #74960, #74968, #74977
- Optimizations related to unit tests, and bug fixes for unit test issues. #74595
- Compilation-related optimizations and CI issue fixes. #74356, #74936
- Optimize debugging and printing information, and optimize error reporting information. #74765, #74381, #74384, #74386, #74387, #74383, #74519, #74520, #74468
- Optimizations related to custom operators. #74402
- Distributed FlexCheckpoint support. #74966, #74593, #74785, #74814
2. Basic execution architecture
New features
- Support for dynamic graphs. #74484
- Support for safetensors. #74642, #74609, #75049
- Added offloader to optimize computation efficiency. #74837
- Added API support for forward computation of conv_transpose. #74431
- Added offloader to optimize computation efficiency. #74837
- The inference deployment has added w4afp8 quantization inference, supporting w4afp8 quantization weight pure permutation and all2all communication #74270
Bug fixes
- Core framework and infrastructure optimization. #74336, #74554, #74634
- Calculation accuracy and type handling. #74278, #74222, #74830
- Optimization of dynamic dimension check logic. #74633, #74650
- Memory and illegal access fixes. #74347, #73443, #74953
- Fixed printing of error/warning messages. #74474, #74533, #74685, #74721, #74754
- Code quality and documentation correction. #74378, #74828
- Fixed the processing logic of the flashmask API. #74928
- Fixed the issue where splitting CudaGraph subgraphs did not take effect in dynamic-to-static mode. (#74749)
Enhanced functionality
- C++ extension development. #74338
- Optimization of FlexCP function. #74752, #74981
- Optimize memory allocation. #74463
Deprecated
- Clean up old IR-related unit tests for dynamic, static, and transition scenarios. #74698, #74715, #74718, #74782, #74962
Other
- Update patch version. #74940
3. Distributed & automatic parallelism
Parallel strategy
In version 3.2, we have made multiple enhancements to the pipeline parallelism feature, including implementing support for dictionary parameter passing and extending the compatibility of Pipeline Layer and SharedLayerDesc with non-pipeline parallelism. Additionally, we have fixed several critical issues, such as IPC API exceptions for large-sized tensors, evaluation batches and non-computational losses in pipeline parallelism, gradient release errors in MoE models, hang issues caused by NCCL communication reconstruction in PP scenarios, and event management errors in dual-pipeline parallelism. Furthermore, we have conducted various performance optimizations, improved the computation overlap efficiency of dual-pipeline parallelism to enhance training performance, and upgraded the clear_param_storage method to support the clearing and resetting operations of multiple color collections in sharding mode.
New Features
- Implement support for dictionary parameter passing in Pipeline Parallel. #74574, #74867
- Pipeline Layer and SharedLayerDesc support non-pipeline parallelism (nonpp parallel). #74573
Bug fixes
- Fixed the IPC API issue with large-sized tensors. #74472
- Fixed issues related to evaluation batch and non-compute_loss in pipeline parallelism. #74170
- Fixed the gradient release issue on MoE model. #74972
- Fixed the hang issue when rebuilding NCCL comm in the pp scenario. #73625
- Fixed the event management error in dual pipeline parallelism (dual pp). #74158
Optimization and improvement
- Optimize the efficiency of computation overlap in parallel dual pipelines to enhance training performance. #74527
- Upgrade the clear_param_storage method to support the clearing and resetting of multiple color collections under sharding. #74741
Automatic parallelism
Functional improvements
- Support the default splitting derivation rule for the same dimension of distributed tensors when it is split by multiple mesh dimensions. #74396
- Improved the slicing derivation rule of the
reshape
operator to support scenarios where the same dimension of a distributed tensor is sliced by multiple mesh dimensions. #74352, #74579, #74565 - Support changing the mesh of a tensor without altering the distributed tensor data. #74248
Bug fixes
- Fixed the bug of repeatedly creating communication groups when calling the
get_group
method ofProcessMesh
. #73099 - Fixed the bug in the
get_local_slices
method in the MoE scenario. #74705 - Fixed the bug of gradient clipping in the MoE scenario. #74916
- Fixed the bug where the
stop_gradient
parameter could not be passed between different stages in the pipeline parallel scenario. #73459 - Fixed the accuracy bug of gradient clipping in parallel pipeline scenarios. #74409
- Fixed the bug of generating redundant outputs in the dynamic graph pipeline parallel scenario. #74913
- Fixed the bug that the operators
moe_combine
andmoe_gate_dispatch
did not work in the MoE scenario. #74645
Other
- Support accuracy alignment for manual and automatic parallelism of data loaders. #73941
- Optimize the dynamic graph pipeline parallel scheduling logic. #74720
Communication Library
In version 3.2, we fixed an error in DeepEP's support for sm90 compilation, added a pre-allocation function to the video memory allocation requested by DeepEP, and upgraded its intranode and internode computation kernels, further optimizing performance and stability.
Bug fixes
- Fixed a bug in DeepEP support for sm90 compilation. #74762
Functional improvements
- Added pre-allocation function for the GPU memory allocation requested by DeepEP. #74465
- Upgraded the intranode and internode computation kernels of DeepEP. #74284
4. Operator mechanism
New features
- API compatibility support. #74506, #74676, #74558, #74572, #74691, #74703, #74750, #74757, #74802, #74546, #74547, #74802, #74859, #74910, #74873, #74882, #74901, #74899, #74449
- Added fused_partial_rope operator. #74577
Bug fixes
- 0-size Tensor related fixes. #74295, #74305, #74323, #74354
- Major Tensor-related fixes. #74242, #74293, #74289, #74279, #74330, #74329, #74342, #74369, #74370, #74404, #74537, #74451, #74172, #74324, #74964, #74360, #74379, #74377, #74380, #74362, #74197
- API compatibility-related fixes. #74764, #74869, #74935
- [Open Source Task] Investigate and resolve precision issues in Paddle CPU/GPU Kernels. #74149, #74598, #74719, #74625, #74555
- Other important fixes. #74282, #74313, #74303, #74306, #74298, #74044, #74290, #74348, #74364, #74332, #74224, #74382, #74406, #74434, #74448, #74457, #74322, #74530, #74716, #74839, #74842, #74854, #74919, #74767, #75003
Enhanced functionality
- Improved API compatibility. #74456, #74480, #74523, #74490, #74548, #74596, #74568, #74559, #74629, #74623, #74700, #74643, #74602, #74783, #74781, #74735, #74725, #74815, #74856, #74925, #74545, #74932, #74784
- Slice/stride related optimizations. #74731, #74740, #74769, #74810, #74841, #74954, #74888, #74944, #74312, #74291, #74271, #74320, #74344, #74727, #74637
- Operator optimization and CUDA support. #74693, #74922, #74967
- Improved debugging information and compatibility enhancements. #74372, #74622
- Operator function expansion and optimization. #74790, #74979
Performance optimization
- FP8 computation optimization. #74471, #74684, #74911
- Basic operator performance optimization. #74442, #74638
- Support fa3 variable-length sequence reverse computation and optimize forward API. #73831
- Added FlashMask V2 function. #74729
Documents
- Fixed issues with English documentation and copyright year. #74737
Other
- The WITH_XPU_FFT option is enabled by default on XPU hardware. #74699
5. Hardware adaptation
Improved CUDA-like hardware integration solution
- The CUDA-like hardware access solution supports the reuse of cuBlas kernels #74591,
- Fix known issues in the CUDA-like hardware access solution #74397, #74411, #74428, #74877, #74939
Main warehouse supports multiple hardware for single testing
New Custom Device API Support
6. Installation environment
Bug fixes
- Fixed the bug in flashattent compilation cache. #74388
- Fixed the bug where site.USER_SITE was None. #74373
- Fixed the compilation bug of gtest in multi-architecture Linux systems. #74723
- Fixed multiple compilation errors in DEBUG mode when WITH_GPU=ON. #74401
- Fixed the compilation bug of CUDA12.6 under Windows. #74990
- Fixed the bug in the api-benchmark baseline pipeline. #74770
- Fixed the bug in the api-benchmark baseline pipeline. #74778
- Fixed the bug in the api-benchmark baseline pipeline. #74779
- Fixed the bug in the api-benchmark baseline pipeline. #74780
- Fixed the bug in the api-benchmark baseline pipeline. #74800
- Fixed the bug in the api-benchmark baseline pipeline. #74803
Other
- Disable the test_custom_contiguous unit test. #74337
- Support for timed triggering of baseline tasks in the slice pipeline. #74419
- Support manually specifying the pr for adding slice recording baselines. #74445
- Check if there are any issues in the code. #74460
- Support CI PaddleX tasks on XPU. #74426
- Support slice pipeline exemption mechanism. #74482
- Updated the Paddle base image. #73423
- Fixed Ninja version 1.11 for Windows. #74590
- Support adding the ability to close PRs and cancel CIs. #74604
- Support for quickly skipping all CI. #74696
- Add an api-benchmark baseline pipeline. #74690
- Update the nccl version. #74809
- Update the RD list for the approve pipeline. #74838
- Update the RD list for the approve pipeline. #74902
- Update safetensor to the mirror. #74904
- Added the compilation flag for flashatten. #74959
- Temporarily disable the win-inference pipeline. #74980
- Support for compiling phi dynamic libraries on Windows. #74950
7. List of contributors
AIbin, Ayakouji, baiyue, baoqiwen, Chang Lu, Chen Zhiyang, co63oc, cyberslack_lee, cyy536, datutu-L, Deng Haodong, Difer, Eddie-Wang, enzodechine, fangfangssj, feri, fxyfxy777, ggggxm, GoldPancake, gouzil, Gu Shiwei, Haze188 灏喆, hohdiy, hong, HU Shenwei, huangjiyi, HydrogenSulfate, kjagsdq, LCStayingdullCircuit, Leo Guo, lightbrother, liufengwei0103, liuruyan, LiYuRio, LLSGYN, Lucas, Luckycheng222, lzy, Nana, Nyakku Shigure, ooo oo, Qianyue He, risemeup1, Ruibiao Chen, Ryan, Shuhao Liang, sneaxiy, Starrysea996, SUN Dong, Tao Luo, Tian, tianhaodongbd, tianshuo78520a, umiswing, waliwali777, wanghuancoder, Wenhao.Dai, wyw, XiaoguangHu, xiaoguoguo626807, xingmingyyj, Yichen Zhang, Yohanna, yongqiangma, Yuan Xiaolan, YUNSHEN XIE, Yuntao Nie, Yuqiang Ge, Yutian Rao, Zero Rains, Zhan Rongrui, Zhang Ting, zhanghonggeng, Zhaowu Pan, zhengshengning, ZhenxingLi, Zhou Xin, zhupengyang, zhwesky2010, Zichao, zty-king, Zx, zyfncg, zzm, 周周周, 正在学习, 苍天荒