PaddlePaddle 2.5.0 Release Note EN - PaddlePaddle/Paddle GitHub Wiki
- New dynamic-static unification architecture: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%.
- Operator architecture unification of PHI operator library: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware.
- Full go-live of new actuator for static graph: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced.
-
Python API supporting 0-dimensional tensor: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as
paddle.sum
etc. - New environment adaptation: Adapt to CUDA 12. Compilation with gcc12 is supported.
- PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like
loss.numpy()[0]
.After this modification, output loss in model training is a 0-dimensional tensor. When usingloss.numpy()
, users can take out or print the loss. The codes are short, easy to understand, and in line with the industry's habit. -
paddle.fluid
API is fully decommissioned. According to the plan that has been previewed in the last version, 1116paddle.fluid
APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand. - Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic.
- In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including
paddle.static.ParallelExecutor
andpaddle.static. CompiledProgram(). with_data_parallel( )
APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e.,paddle.distributed.launch
API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on data parallel to modify model code. #50351,#50501,#51240,#51701,#51616,#51369,#52671 - Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse.
- API input supports 0-dimensional tensor, involving
paddle.reshape
,paddle.trace
,paddle.linalg.norm
and other 286 APIs. #53208, #53592, #47074, #53186, #47677, #49357, #50237, #46555, #47219, #47501, #47858, #47961, #48058, #48007, #49755, #51024, #51566, #51899, #49813, #47812, #47849, #47251, #53125, #53828, #51265, #47689, #48452, #49072, #48638, #49175, #49279, #50857, #49805, #47734, #45992, #49616, #49959, #50536, #49544, #49842, #46909, #49361, #50169, #48314, #48735, #49122, #49122, #49177, #49501, #49562, #49340, #49550, #49596, #49730, #49667, #49692, #49854, #49845, #49803, #49889, #49904, #49518, #49884, #49880, #49862, #49921, #49260, #49929, #49570, #49882, #50213, #49780, #50271, #50289, #50293, #49735, #50433, #49847, #50635, #50950, #50947, #49460, #53087, #51687, #52185, #54649 - API output supports 0-dimensional tensor, involving
paddle.sum
,paddle.min/max
,paddle.any/all
and other 90 APIs. #52891, #52861, #52775, #52850, #52843, #52857, #51721, #53051, #53192, #52739, #52741, #53175, #51889, #53199, #53242, #53421 - In addition to the support of 0-dimensional tensor, fix the original non-standard codes, and provide hints and compatibility for non-standard usage in the model codes. #51562, #51586, #51757, #52197, #54117。
- Add
paddle.autograd.jacobian
andpaddle.autograd.hessian
APIs for scientific computing. #53331 - Add sparse computing API. For example,
paddle.sparse.reshape
,paddle.sparse.sum
andpaddle.sparse.slice
. #46694, #51513, #53794, #51406 - Add APIsFor example,
paddle.optimizer.LBFGS
,paddle.index_put
andpaddle.logaddexp
. #53314, #51912, #52886, #50843, #47282, #52284
- Add paddle.nn.utils.clip_grad_norm_ for gradient clipping support and paddle.Tensor.data_ptr for getting the address of the Tensor data's memory/GPU memory. PR49935, PR48235, PR49173
- Add the saved_tensors_hooks mechanism, for temporary storage and retrieval of forward Tensor used in backward computation. PR45763, PR46215, PR48124
- Tensor supports pickler, for serialization of Tensor. PR47025, PR48179
- Add debug logs, to print forward Python stacks when nan/inf appears in reverse. PR53217 PR52639 PR52729
- Add the support for expand_v2, tile, concat, assign, slice higher-order differentiation. PR45941, PR45942, PR45940, PR45879, PR45960
- Optimize log printing for dynamic graphs, including log content, VLog level, and error reporting content. PR45783, PR46349, PR46934, PR47724
- Add FLAGS_auto_growth_chunk_size_in_mb for minimum chunk size settings of auto_growth_allocator. PR52204
- Fix bugs in some operators, including batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad. PR47802, PR47634, PR47349, PR46124, PR46147, PR50388, PR48626, PR48519, PR50386, PR48432, PR51851
- Fix some PyLayer bugs. PR51740, PR47154, PR47323, PR54041, PR48533
- Makes sure sync_batch_norm is sequential in reverse to avoid hang or precision errors due to misordering. PR52268, PR52860, PR52779
- Fix a bug of linspace under AMP. PR46088
- Fix Python C API’s incorrect call that causes Windows to crash. PR46833
- Fix the bug that DataLoader may miss deleting/dev/shm. PR48511
- Fix some bugs of paddle.grad. PR47151
- Add error message for operators that do not support higher order differentiation. PR47231
- Add numpyarray support for python operators. PR48229
- Delete either of element_size APIs. PR49631
- Fix the bug of crash when opening old dynamic graph VLOG. PR47115
- For XPU, change to d2h+h2d in case of d2d, to solve the multi-threading problem. PR48373
- Python operators sink to C++ implementation, to improve API performance. There is a 3x to 6x performance improvement in this class of APIs after sinking. PR45811, PR46326, PR46329, PR46520, PR46542, PR46565, PR47060, PR47077, PR47174, PR47315
- Optimize the Optimizer CPU scheduling performance to reduce GPU Gap caused by Optimizer phase. PR49787, PR50188, PR51340, PR49864, PR50158, PR50335
- According to the logic that API can be sunk to C++, API is sunk to C++ to improve API performance. PR46412, PR46190
- Optimize unnecessary call logic on Python side under dynamic graph, to improve API performance. PR46221, PR49473, PR49574, PR49589, PR49612, PR49717, PR49733, PR49823, PR49508, PR46840
- Optimize use of Allocator to improve dynamic graph API scheduling performance. PR47125, PR48548, PR50995, PR47731
- Optimize fused_attention operator performance. PR48902
- For optimizer's _add_accumulator, if device is CPU and under dynamic graphs, use full to initialize var directly. PR48189
- Prune unnecessarily executed subgraphs for inverse graphs to improve performance. PR47827
- Optimize performance of initalizers. PR46033
- Add fused dropout add operator to improve computation performance when dropout and add are used together. #52903
The new actuator for static graph implements a number of functions and performance optimizations, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced. #45913,#46025,#48911,#50239,#45696,#46092,#48158,#51389,#49708,#49275,#48789,#49939,#51149,#52652
New function support for custom extension mechanism to achieve the C++ extension of the arithmetic function binding to the Python side, to further enhance the framework's secondary development capabilities. The extension supports custom hardware to use a custom operator mechanism to meet the needs of hardware manufacturers to implement non-Paddle existing operations. The extension supports custom operators in the implementation of the inplace
, vector < Tensor>
output, optional < Tnesor>
input and other high-level mechanisms in custom operators. Optimized scheduling performance of custom operators in dynamic graph mode, with a 25.4% performance improvement for operators with multiple input parameters. Add new commonly used operators and APIs for custom operator Tensor extensions. Support chaining calls and simplify code writing. Optimize the operator kernel selection mechanism. Improve the logic of some operator kernels, enhance supported data types and optimize performance. Add and improve XPU kernels 100+. Fix 170+ bugs.
#49222, #51773, #51923, #53080, #50731, #50563, #50840, #50983, #51713, #48733, #50558, #50764, #51973, #52216, #51027, #50745, #50756, #50886, #50813, #50869, #51085, #51646, #51620, #51844, #52421, #52872, #52597, #50582, #52114, #52915, #50928, #48272, #48702, #52191, #52191, #47374, #47375, #47378, #54126, #47638, #47661, #50606, #53528, #50599, #51727, #50825, #50773, #50979, #53336, #53555, #53716, #53753, #53981, #53977, #53980, #54043, #54066, #52866, #53043, #53325, #54323, #54367, #51353, #53749, #50013, #47570, #50997, #51241, #49537
Unify all remaining 350+ operator kernels under the original operator system into PHI operator library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all Fluid header files the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce cost of accessing hardware. #47856, #49328, #49138, #52014, #52044, #52116, #52486, #52101, #52882, #53003, #53034, #51914, #49116, #52626, #52878, #52879, #52880, #52875, #51600, #51601, #51590, #51887, #51891, #52036, #52130, #52134, #51951, #51886, #52274, #52263, #51913, #52145, #52347, #52370, #52437, #52424, #52231, #52522, #52529, #52802, #52799, #52855, #52711, #52940, #53309, #47817, #48001, #48063, #48049, #48168, #48415, #48696, #48970, #50183, #50407, #50498, #50419, #50282, #50870, #50911, #50865, #51288, #53735, #47248, #47787, #52202, #47579, #49444, #45772, #51264, #51634, #51631, #47385, #46342, #47510, #47532, #47702, #47860, #49470, #50358, #49121, #50190, #52374, #52372, #52375, #52371
- Add the combination rules for combinators such as dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish #50497, #50838, #50861, #50819, #50810, #51527, #51070, #51539, #51061, #49894, #50422, #51874, #51341, #50295, #50298, #50672, #51432, #51003
- Add the vjp rule for combinators such as gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad #50966, #51653, #52663, #51742, #52203, #50794, #50305, #50786, #50679, #51045, #51230, #51474, #51283, #51238, #49831, #51838, #50771, #50565, #51768, #51750, #51748, #52532, #52935, #50963, #51430, #53141, #52469, #50436, #51059, #51296, #52533, #53374
- Add the second-order differentiation rule for combinators such as matmul, tanh, and elementwise #50452, #52192, #53014
- Add the bf16 datatype support for combinators such as exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max #54263, #54236, #53865, #54175, #54399
- Add support for assigning semantics to containers in control flow in dynamic-to-static. #51248
- For to_static, add full graph fallback function. When dynamic-to-static conversion fails, the whole graph can fall back to the dynamic graph mode of execution. For the fallback mechanism, add the set_eval_frame API. #50111, #52006
- For to_static, support the combinator mechanism. Support the scenario of using register_hook under to_static decoration; #49836, #52948, #53572
- Add a backend parameter to the to_static API. It can be specified as
CINN
or None. When the parameter is specified as CINN, the CINN compiler will be used to accelerate training and inference. #52596 - Add the code automatic generation function for the primitive API. Based on operator definitions in ops.yaml and legacy_ops.yaml, automatically generate code for the primitive API. Automatically generate the Tensor computation API. #50315, #49654, #50642
- Add the function of forward combination of operators. By registering the combination rules of forward operators, it can split forward operators into base operators. #49605
- Add the combinator switch. You can set environmental variables in shell to split operators in different ways. #50309
- Add
OpTest
combination test function to guarantee accuracy of operators. Add elementwise class base operator unit test. Add batch_norm CINN unit test. #50509, #50807, #52815
- Add combinator to support FP16 operation and AMP O1 operation. Add AMP logic for softmax and layer_norm operators. #52397, #52598, #51473
- Simplify combination rules and vjp rules of the combinator batch_norm. #54012, #51827, #51933,
- Optimize combination rules for combinators, and improve performance of combination rules with containing scalar. Optimize log printing for combinators. #51960, #50160
- Combinator supports the jit.save API. Add custom VJP rule API. #52344, #50885
- Remove the overwrite parameter from combinator gather_grad. #52707
- Clean up dynamic-to-static code style, optimize error message, and standardize logs. #48637, #46128, #52527, #46800,#46415
- For dynamic-to-static, call the append backward to get
grad var name
to fix the error in the high order gradient computation. #53250 - Upgrade the dynamic-to-static function, and clean up the temporary directory of to_static to speed up code conversion. Enhance to_static to automatically skip internal API. Support use of to_static decorator in the program. #47102, #50596, #45768
- For dynamic-to-static, optimize
print
function conversion to support printing Tensor parameters at the networking stage. Upgrade the parameter collection mechanism. #48672, #50336
- For the combinator, fix cmake compilation errors. Fix cuda 12 test errors. Fix bugs of operators such as meshgird, expand_as, concat, conv, and arrange. #49643, #54622, #53951, #53951, #53350, #51486, #52764
- For the combinator, fix the bug in a number of scenarios such as rank=1, shape=-1, amp, and multi-process. #51413, #51435, #50518, #47301,
- For the combinator, fix bugs in automatic code generation of composite grad maker and static prim api. Fix bugs that op creation attributes are missing, and some combination rules do not take effect. #50854, #51445, #50780, #52120
- Fix some other bugs for combinators #50086, #51208, #51577, #53598, #47500, #52119, #50397, #50527, #50788, #51014, #52154, #52752
- For dynamic-to-static, fix the bugs of dataloader, cond input dict, transformer import, T5 model memory leak, and grad var name parsing error. #49821, #47299, #50776, #50883, #51100, #51464, #51966, #52110, #52821
- For dynamic-to-static, fix the bugs of Lazy initialization, Windows training, is_paddle_func failure, and recurrent op failure to delete pass. #50785, #52580, #51585, #51763, #51763
- Add scope caching and reuse mechanism during execution of run_program_op in dynamic-to-static, to avoid passing new scope for each step. #45813
- Remove the distributed sharding API in the old dynamic graphs. #49334
- Upgrade fleet to distributed directory. #50834
- Optimize log printing for distributed strategies. #47761
- For re-computation, support hook mode, inplace function, and stop_gradient mode. Support more flexible use. #48471, #47985
- Data parallel
- For data parallel, support no_sync API for blocking parameter gradient communications. Support the parameter synchronization function. Add scale API to scale parameters. #47536,#51895,#47519
- Fix the problem of video memory leakage under data parallel. #47369,#47444,#48668
- Support sparse parameter gradient synchronization. #52785
- Pipeline parallel
- Optimize pipeline performance, and remove communication wait. Optimize scheduling and communication overlap. #46209,#54003,#54312,#53384,#54310,#46399,#46483,#46780,#46116
- Support custom sharding, log printing, random seed setting, and timer elapsed time printing. #53344, #47670,#47336,#52656,#53831
- Optimize video memory release logic in pipeline scheduling, and release intermediate variables and data in advance. #54557, #47199,#47497,#48045,#54672
- Support VPP mode and model saving for pipeline parallel. #54196, #52927,#47801,#45922,#47242
- Grouping sharding parallel
- sharding stage2 parallel supports the quantization function, hybrid parallel training, gradient accumulation, XPU hardware, BF16 low precision computation, optimizer learning rate setting, offload function, and data parallel. #47169,#47535, #46795,#47711,#48310,#46846,#48857,#49196,#49931,#47114,#49767
- Optimize sharing stage2 performance. Support the communication computation overlap. #46495,#46894
- sharding stage3 support shared parameters, and untrainable parameters. #48695,#48577
- Tensor model parallel
- Optimize tensor model parallel performance to reduce performance impact of stream sharding. #47715,#51617
- Support parameter, optimizer shapes, gradient synchronization. #51428,#53254, #53335,#45803,#46303,#52293
- Optimize tensor model parallel operators such as c_embedding, softmax_with_corss_entropy. #53197,#53547,#53541,#52789,#46491,#52742,#53419
- Launch
- Communication library
- Add custom mixed parallel communication groups, topology information printing, and custom communication topology order. #47021,#54000,#51781
- Remove communication library dependency on Place information #47857
- Add communications library to support GLOO operator. Support send/recv/gather. #52221, #52334,#49084
- Disable reverse computation of communication operator. #47636
- Add communication library static shape check, to help determine whether communication volume is matched. #48256,#48915,#48646
- Support communication python object type, BF16 type, alltoall, reduce, allgather, group call, global gather, broadcast, and scatter communication methods. Support XPU device communications. #51765,#45844,#48059,#48115, #48339,#49252,#49451,#50085,#50701,#48208,#48736,#51762,#52495,#53514,#48232,#49896,#49941,#45584
- Add support for communications between computational streams. #46182,#46023,#46295,#46761,#47481,#47740,#47976,#48163,#48396,#48308,#47110,#53089
- Optimize communication library TCP linking time. #49810,#47184
- Improve semi-automatic parallel for static graphs:
- Add FLOPs computation function for multiple operators, and add computation Cost modelling based on FLOPs. #48083,#47978,#47595,#48083,#48084,#47816
- Improve API ease-of-use. Perfect the DistAttr, Process Mesh, Engine API, information printing, input and output modules. Implement the Engine new cost API. It can be used to theoretically analyze model running time and video memory overhead. #47503,#46416,#46554, #46633,#49214,#53848,#46552, #47043, #49665, #52912, #45776, #47263
- Optimize the generality and ease of use of Pass. Support more scenarios, and reduce time spent on Pass pre-analysis. #46519,#47358,#46391, #51035
- Enhance debugging capabilities with distributed randomness control mechanisms and hybrid parallel precision alignment tools. #52903,#49865
- Support automatic sharding of inference generation task networking. Adapt special usage of control flow and conditional block in the generation model. #46771, #54067
- Improve grad_clip to support load balancing in data parallel scenarios. #49510, #49249
- Semi-automatic parallel performance improvement for static graphs:
- Add the Sharding Pass automated communication Fuse and multi-streams communication functions, with throughput performance improved by 26% on two machines for GPT 6.7B model. #48604, #47180,#46180
- Add Recompute optimization strategy tuning function. Select optimal recompute checkpoint settings based on video memory and model size. #48608,#47846,#49010
- For the pipeline parallel, add 1F1B scheduling optimization Pass #54260, #45915
- Optimize data parallel. Support optimizations such as converged communication and communication computation Overlap, with performance improved by 5% in GPT 1.3B model. #48092,#45643,#49744, #47578
- Optimize Reshard module concate performance. Reduce number of concates in some scenarios. #47809
- Optimize mixing accuracy, upgrade Pass performance, support BF16 low accuracy, and adapt the auto mixing parallel of the while loop control flow. #51285,#51147, #49219, #49079
- Improve function of fully automatic parallel for static graphs:
- Clean up the all list in ps directory, in which API is not exposed #51289
- Clean up cvm operator #48989
- For GPUPS, add support for AFS. #46611
- Degrade PGLBOX2.0 log, fix stuck issue of dense parameter, fix the bug that barrier does not take effect, and add get_epoch_finish python side interface #49946,#50166,#50349
- GPUPs run to switch to specified mode. #51115
- GPUPS is added to benchmark. #49587,#49649
- Fix the GPUPS optimizer selection bug, fix reader reading problem, and fix RPC compilation problem. #47026,#47192,#49878, #46356,#46575,#49389,#46258,#50136
- Add rocksdb compilation method. #46074
- Add compilation support for CUDA 12.0. Fix related unit test. (#49539, #54542)
- Add CUDNN Frontend API compilation support and related unit test. You can use
WITH_CUDNN_FRONTEND=ON
compilation option for start. (#47524, #47612)
- Add mixed precision strategy and optimize precision:
- Add and optimize FP16 and BF16 data type support for more than 200 operators in the framework, including logsumexp, reduce_max, cumprod, sync_batch_norm, compare class OP, etc. Carry out precision optimization and unit test for all FP16 and BF16 operators. Improve the unit test framework function for low-precision operators, to ensure there is no loss of accuracy in the process of large-model training. (#51193, #51114, #45817, #52862, #52919, #52921, #46413, #48205, #54193, #48041, #48121, #46364, #51153, #53023, #53079, #53137, #46212, #50908, #52555, #51582, #47897, #45601, #53522, #52666, #50101, #48315, #50847, #50905, #50906, #50909, #50916, #50917, #50920, #50919, #50904, #50918, #50938, #50858, #50933, #50945, #50936, #51168, #51493, #50924, #50923, #50926, #50925, #50930, #53284, #53286, #53285, #50976, #50915, #50915, #48192, #50993, #50998, #51380, #51137, #51106, #51197, #51159, #51552, #51151, #51005, #51565, #51036, #51185, #51791, #51083, #51694, #51689, #51009, #51051, #51532, #51978, #51903, #51888, #52016, #52035, #52184, #52018, #51787, #51640, #52172, #52193, #51160, #51809, #51678, #52158, #51015, #52240, #52276, #52233, #52220, #52107, #52282, #52311, #52315, #52357, #52256, #51649, #52413, #52369, #51837, #52112, #51819, #52388, #52411, #52521, #51300, #51117, #52380, #52317, #51263, #52668, #52259, #50999, #52407, #52288, #52845, #50953, #52667, #52582, #52426, #51884, #52630, #52136, #52604, #51615, #51275, #52898, #52918, #52572, #52683, #52956, #52963, #52954, #52444, #52314, #52887, #52195, #53100, #52961, #52953, #53111, #53549, #53736, #52920, #53195, #53535, #53876, #53785, #53722, #54285, #54232, #53922, #47277, #50811, #54571, #50129, #50340, #50848, #50849, #50868, #50878, #50929, #50939, #50973, #50913, #51145, #51090, #51098, #51094, #51216, #51736, #51684, #51925, #54030, #50700, #52264, #51069, #51101, #51286, #53582,#49869))
- AMP optimization: Comprehensively upgrade and optimize ease of use, accuracy stability and debuggability of AMP training, to better support acceleration of large model training. In terms of ease of use, unify the API for dynamic and static graphs. Add new conversion interfaces such as model.float(), model.float16() and model.bfloat16(). In terms of accuracy stability, enhance automatic adjustment of the strategy for BF16 type. Optimize blacklist settings. Enhance support of the multi_precision function by optimizer operators Adagrad, Adamax, Adadelta, and RMSProp. In the O2 mode, improve master grad mechanism, add type promotion mechanism and a new parameter for the specific module to use float32 computation to guarantee accuracy. In terms of debuggability, add the paddle.amp.debugging module to provide operator statistics, outlier detection, and accuracy comparison. ( #50132, #50078, #50131, #49705, #52936, #52871, #53289, #53362, #54240, #53768, #48041, #47672, #48843, #49391, #51635, #45541, #53742, #51020, #51063, #52514, #50940, #52936, #53439, #53712, #48238, #52215, #53012, #52918, #54571)
- For GroupNorm operator, add support for NHWC data format. (#47533)
- For index_put operator, add support for mixed data types of bool and int. (#54195)
- Add sparse.is_nan API for determining whether a sparse tensor contains a NaN element. (#51513)
- Fix bugs of computation errors of several operators such as trace, roll, dropout_nd, and log_softmax, stack overflow, and some unit test error. (#50243, #52012, #53795, #53149, #53654, #51054, #49373, #53038)
- Fix the problem that conv operator exhaustive search does not work in some scenarios. (#47065)
- Fix timeout problem of collective_reduce_scatter and other operators on A100. (#54513)
- Fix the problem of attribute error in FusedLinear unit test. (#50359)
- Fix the OOM problem that may occur when using Profiler. (#46089)
- Further optimize GPU Kernel and eigen implementations of the framework's large number of operators, including max_pool3d, dropout, adaptive_pooling, depthwise_conv2d, transpose, eigh, broadcast class computations, reduce class computations, prelu, logsumexp, and sparse, to achieve better performance in more configuration scenarios. (#45820, #45959, #45934, #46332, #46287, #47233, #48855, #48560, #49419, #49748, #50348, #52401, #51131, #51141, #51479, #51835, #52509, #52482, #52700, #53112, #53659, #53658, #53154, #54071, #53622, #52952, #46046, #46119, #45946, #47212, #47791, #47454, #45230, #48899, #33051, #49040, #48992, #49086, #50808, #46431, #50931, #48056, #46071, #49231, #38660, #50287, #46111, #46997, #45854, #47738, #48635, #50353, #50362, #51934, #54045, #46679, #52093, #52969)
- Provide more fusion implementations and related fusion pass, such as fused_feed_forward, gather-gemm-scatter, matmul + bias, layernorm_shift_partition + element_add, and elementwise class fusion, to further improve performance of models that use the mode. ( #50423, #50091, #50364, #53017, #50755, #50050, #47099, #48848, #49383, #50809, #52361, #52028, #48439, #49009, #51427, #52731, #51805)
In order to guarantee stability and reduce R&D cost of the IR system, we have developed a new IR system for PaddlePaddle. Complete basic data structure definition, operator definition generation, and execution system adaptation. In order to better support higher-order requirements of scientific computing scenarios, complete higher-order adaptation of operators such as silu and cast.
- Complete the definition of IR data structure, including type system and operator definition. Implement execution adaptation with phi kernel. #51112, #51992, #50412, #53557, #53953, #50959, #54250, #54197, #54289, #51636, #52846, #53988, #54143, #54035, #54052, #54340, #54356, #54068, #53894, #53707, #54185, #54031, #54220, #54275, #54281, #54186, #54259, #54124, #54292, #48068, #53978
- Improve the basic pass setup, including basic pass definition, pass registration management. #54023,#54170, #54170, #54308, #54348, #54385
- Improve adaptation of high-level arithmetic, including modification of the basic module and adaptation of silu and cast arithmetic. #52005, #53425, #53417, #53417, #53498, #53171, #53632, #53605, #53746, #53874, #54164, #45888, #46024, #46446, #46960
- Add CINN support for 0D-Tensor. At present, in order to cooperate with the upgrade of the main framework, it is supported by adding pass temporarily. We will replace and upgrade the solution later. (#53382, #53955, #54064, #54118, #54216, #53454)
- Add CINN support for int8/uint8/int16/uint16/bf16 data types. (#50566, #53637)
- Add support for the CINN expand operator. (#46776)
- Add CINN support for PaddleInference. (#45009)
- For CINN compiler, pass skip_gc_vars attribute to CINN subgraph. CINN adds fetch operator for skip_gc_vars. #49471, #49553
- For CINN compiler, conv2d and conv2d_grad do not use cinn operator by default. #51645
- Add build_cinn_pass to BuildStrategy for use in dynamic-to-static (#49496)
- Add reshape operator to perform unit test under combinator mechanism. (#51276)
- Change version of the main framework binding CINN from fixed commit to develop. (#49775)
- Set default Target parameter for CINN. (#50182)
- Fix the problem of inconsistent operator order after topology sorting during CINN symbolization. (#52556)
- Fix some operator computation errors, accuracy degradation, and unit test related problems. (#53859, #54261, #46801, #53676, #53772)
- Fix the problem of CINN support for float16 type. (#48249)
- Fix the problem in build_cinn_pass. (#46843)
- Fix the problem of no data area due to incorrect GC when CINN is turned on during combinator + dynamic-to-static. (#50116)
- Fix the problems of compiler dropout amp error, combinator resnet error, and inplace variable not found #51688, #52813, #51769
- Optimize reshape related fusion strategy (#53066)
- Optimize performance of BuildCINNPass. (#49696)
- Optimize performance of subgraph detection module. (#45040, #46937)
- Add support for the distributed strategy MP/Sharding/PP/MoE and recompute on the training side. Add support for the distributed strategy MP on the inference side. Support for hardware Ascend NPU and Cambricon MLU accessed through CustomDevice, without changing any codes, to automatically inherit all new distributed strategies added by CustomDevice. #52872, #54384, #53220, #54572, #54573, #54676, #53044, #53719, #53701, #53702, #53703
- Add API paddle.device.is_compiled_with_custom_device. It is convenient for users to judge whether the current environment supports the plug-in device backend of a certain hardware. #49271
- Add environment variable CUSTOM_DEVICE_BLACK_LIST setting, to support automatic heterogeneous operation on CPU of blacklisted operators. #50409, #50666
- Optimize CustomDevice performance by reducing number of calls to get_device_count interface in runtime. #46963
- For the training side, use a new version of dynamic graph, with adding support for distributed strategy MP/Sharding/PP and recompute function, and communication library. For the inference side, add support for distributed strategy MP and support for XPU FasterTransformer operator acceleration library. #49531, #49815, #48897, #50717, #51082, #49757, #51399, #50329, #48369, #47838,#48076,#47882,#48961,#49043,#49749,#49806,#53427,#48470,#49207,#52296,#51785,#47168,#47445,#50200,#49934,#50792,#52228,#53337,#53389,#53496,#53609,#53697,#53496,#53720,#53734,#54172,PR46227
- Support Paddle TensorRT multiple subgraph TensorRT engine or TensorRT engine between different Predictors to share video memory in order to save video memory. #45842 #47631
- For the C++ API, add Shape and data type API to obtain the input Tensor, and add Shape and data type API to obtain the output Tensor. For the C API, add SetExecStream, EnableMkldnnInt8 and other C++ existing APIs for serviced deployment. #49758
- Add paddle.inference.Predictor.register_output_hook() API. Support printing of the output of each layer under GPU inference in case of debugging. Support use in control flow models such as While. It should be noted the API does not support Paddle-TensorRT. #54433 ,#47050 , #54254 。
- Paddle Inference Predictor API supports paddle::Tensor as input and output, so users can directly reuse the PaddlePaddle dynamics graph for pre-inference and post-inference processing. (#50445)
- Enhance Paddle TensorRT dynamic shape running ability, config.enable_tuned_tensorrt_dynamic_shape() API to build TensorRT Engine at runtime without passing any parameters. It is unnecessary to collect shape information before running. To avoid rebuilding at runtime, it is necessary to overwrite minimum and maximum Shape in first operations for several times. #52162 。
- Paddle-TensorRT supports model input in NHWC format. #49633 。
- Extend config.Exp_DisableTensorRtOPs API to disable access to TensorRT by specifying the name of the Tensor variable. #49497 。
- Enhance GPU mixed-precision inference (non-Paddle TensorRT scenarios). For the Config.enable_use_gpu enhancement, you can set precision type. #47993
- Support double type input for inference. #51786 。
- Since the TensorRT operator does not support the INT64 type, leading to running failure of INT64 data type in the model. Paddle-TensorRT has been enhanced to automatically convert, with reducing the model to run in the INT32 type when model contains INT64 data type. #45547
- Paddle-TensorRT supports more operators into TensorRT inference, including:
- expand_v2,gather_nd,rsqrt,sign,not,onehot,arg_min,temporal_shift,expend_as_v2,setvalue,index_select,round,acosh,square,reduce_max,not_equal,reduce_min,reduce_prod,grid_sampler,elementwise_mod,pad3d ,greater_equal,bitwise,cumsum,matmul_v2,reciprocal,where,bmm,take_along_axis,less_than,greater_than, logical_or, logical_xor, logical_and, less_equal,range,reduce_all,reduce_any ,fill_any_like ,pow
- #47002 , #47589 ,#48223 ,#48557 , #48655 , #49113 , #51207 ,#51028 ,#50341 ,#51498 ,#48534 ,#48684 , #49393 , #49615 ,#50934 ,#50974,#50986 , #52000 ,#51971 , #52518 ,#44918 ,#48230 ,#47820 , #46877 , #48358 , #48592 ,#48697 , #53088 , #47974 , #53462
- Enhance Paddle-TensorRT mapping operators strided_slice, instance_norm, prelu, argmax, cast, nearest_interp_v2, elementwise, bilinear. #46819 ,#47998 ,#48043 ,#48998 , #49675 , #47495
- Paddle-TensorRT partial operators (scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range, unary, equal, elementwise_op). Support 0-dimensional Tensor. #53660 ,#53627 , #53634 , #53714 , #53729 ,#53769 ,#53506 ,#53704
- Support compilation for versions earlier than GCC12 + CUDA 12.0. #50106
- Paddle-TensorRT's DeformableConv plugin supports dynamic Shape input. #50698
- For Paddle-TensorRT, add plugin support for lookup_table operator. #46613
- Add config.enable_low_precision_io() API to support low-precision type input in Paddle-TensorRT scenario. #52485
- Paddle-TensorRT's LayerNorm plugin supports FP16 computation. #45043
- Predictor's input data paddle_infer::Tensor supports bool type. #49388
- Paddle-TensorRT enhanced Convolution implementation uses ConvolutionNd. #47653
- conv2d_fusion operator supports NHWC format. #49047
- Adjust the directory structure related to Phi operators under C++ inference library. #53091
- Support rebuilding TensorRT Engine instead of reporting errors when TensorRT serialization and loading versions do not match. #50775 。
- Optimize Paddle-TensorRT runtime to print log messages. #50181
- Support elementwise 0-dimensional Tensor inputs for oneDNN-based CPU inference. #51656
- Clean up and normalize support for Paddle-TensorRT's FC, matmul, matmul_v2 operators, and unify and upgrade to use TensorRT's IMatrixMultiplyLayer for support. #52222
- Support multiple lookup_tables into Paddle-TensorRT's Embedding+Eltwise+LayerNorm fusion. #46243 ,#46230
- Add MoE fusion Phi operator to improve inference performance of MoE model. #48703
- In the scenario of INT8 quantized inference, Paddle-TensorRT plugin can fall back to FP16 computation, instead of FP32 computation. #50554
- Optimize memory and video memory in case of inference. #49051 , #49046 ,#53930
- Optimize Layout and enhance Pass. #52997
- Support caching of operator Shape inferences to improve model inference performance. #48312
- Optimize bias+add+relu fusion using half2 instructions. #49048
- Optimize Concat Kernel for multiple inputs using vectorization operations. #49540
- Implement Convolution, Depthwise Convolution and related fusion operators based on CUTLASS to improve inference speed. #47989 ,#50603 ,#51792 ,#50603
- Paddle-TensorRT supports FlashAttention’s plugin, to improve inference speed of models such as StableDiffusion. #49438 。
- Add Transpose+LayerNorm fusion PASS, to improve inference speed of models such as StableDiffusion. #50082 。
- Add Elementwise+Transpose fusion. #50081
- Optimize Paddle-TensorRT Group Norm plugin implementation. #49160
- For Config.EnableTensorRtEngine() API, add use_cuda_graph parameter. You can enable CUDA Graph. It should be noted you need to ensure the model input shape remains unchanged during usage, to reduce runtime consumption. #53406
- Support inplace operation of Reshape, to reduce copying time of the model at runtime. #49146
- Optimize LayerNorm kernel implementation based on oneDNN. #47782
- Support fusion of quantize+transpose and transpose+dequantize based on oneDNN. #49509
- When MKLDNN is turned on in CPU inference, FC-related fusion pass is enabled by default, to improve performance. #45704
- CPU OneDNN inference supports suqeeze2 + transpose2 fusion. #47592
- Add ExpRunWithRuntimeConfig API and XpuRuntimeConfig, to allow settings of parameters such as external streams, and L3 cache during inference. GetExecStream API supports obtaining Kunlun external stream objects. Input and output support Kunlun device memory, to reduce D2H and H2D overheads. #53334、 #52466、 #53240
- Add multi-encoder, fused_multi_transformer and fusion pass, to improve performance of ERNIE and Transformer class models. #50570、#51346、 #50499、#53982、#50759、#51571、 #53144、#53306
- Optimize BeamSearch performance. Transform, remove and fuse fine-grained operators such as write_read_array and gather, to improve model performance when beam_size=1. #53130
- Transform multiple stack operators with the same input into unsqueeze operators that support broadcast. Unsquee/squeeze supports inplace computation. #52099
- Add support for exporting multi-card inference models for Kunlunxin. #50490
- Add embedding_with_eltwise_add fusion pass and operator phi kernel, to reduce video memory usage and improve inference performance. #50590
- interpolate class operator phi kernel supports FP16. #52358
- argmax operator supports INT32 type output. #51303
- Fix the error of only model file when saving serialized model after turning on mixed-precision inference mode. #52994
- Fix segment error of instance_norm when scale and bias are empty. #52627
- conv_transpose operator supports FP16. #53626
- Add yolo_box_xpu fusion pass and operator phi kernel, to optimize YOLO model generic substructure. #54163
- Add conv2d_xpu fusion pass and operator phi kernel, and support FP16 inference, to optimize convolution operation inference consumption time. #52247 ,#53626
- Add sigmoid_elementmul generic fusion pass, to fuse to swish operator to match conv2d_fusion pass to improve YOLO model inference performance. #53580
- Add act_add fusion pass and operator phi kernel to improve inference performance. #53965
- Add fold_interp_outsize fusion pass, to improve inference performance. #54245
- Solve the problem of incorrect results due to duplicate fusion when there is shared weight in FC. #51108、#51039
- Remove op_device attribute where operator is only used for training, to prevent wrong choice of place for training during inference. #51029
- Support saving of optimized models, allowing PASS optimization to be skipped in case of re-inference, to reduce first time inference time. #53696
- Solve the problem of computation error caused by the CPUPlace input of operator Kernel being forced to copy to XPU. #51306
- subblock supports early copying of H2D parameters to improve inference performance. #51876
- Fix scale memory size of the output activation of Kunlunxin 2nd generation chip. #53505
- In new executor Kunlunxin D2D copy, support asynchronous execution. #51876
- Remove concat operator with only one input. #52304
- lookup_table_v2 supports FP16 to remove redundant cast operator. #52888
- Control flow While operator supports caching scope, to reduce overhead of creating new scope every time. #52628
- Scatter newly supports FP16, to remove redundant cast operators and elementwise_mul operators with an input of 1. #52831
- Upgrade of dynamic graph quantization function.
- Add a new API for quantization training of dynamic graph models:
paddle.quantization.QAT
. Support passing quantization-related parameters through configuration, simplifying quantization training process and difficulty of secondary development. (#49398) - Add a new offline quantization API:
paddle.quantization.PTQ
. Support exporting quantization model to model format supported by inference. (#50107) - Add STUB operator to simulate actual quantization operation during training process. (#50510)
- Add a new API for quantization training of dynamic graph models:
- Support quantization training model to load parameters of offline quantization model. Support more operators for quantization, including matmul, scale, and conv1d. #47892, #45911,#48912
- Support hybrid parallel training of static graph quantization training. #52219
- Fix the problem in the process of dynamic graph quantization:
Improve efficiency of source code compilation, and promote setuptools + ninja compilation method to increase development efficiency: In CPU scenarios, full amount of compilation time is reduced by 20 min, and compilation speed is increased by 24.52%. In GPU scenario, full amount of compilation time is reduced by 22 min, and compilation speed is increased by 29.31%. In order to adapt to mainstream development environments, PaddlePaddle supports gcc12 compilation and C++17 in the source code, and adapts to the latest CUDA12. In terms of code quality, complete cleanup of compilation warnings, to improve compilation experience. At the third-party dependency level, we have upgraded the version of underlying protobuf to reduce dependency, cleaned up deprecated attributes of some earlier versions of dependency libraries and old code formats, and removed support for Python 2.x.
- ninja compilation adaptation to improve compilation speed. #52433,#48932,#49420,#48435,#49303,#49448,#49838,#50067,#52796,#50431,#49181,#48867,#48490,#48211,#49499,#53076
- setuptools compilation and package all-in-one adaptation. #48770,#46957,#49583,#47602,#48301,#50800,#42575),#49826,#49002,#51443,#51528,#52621,#52465
- gcc12 support. #52960,#52265,#46546,#52318,#46808,#47466,#52083,#48176,#49423,#49452,#51037,#52007,#52441,#52085,#50817,#52646,#50777,#53288,#54009
- c++17 standard support. #53345,#53892,#54282,#49017,#47635,#54258
- cuda12 support. #52285,#49592,#52232,#52654,#54641
- CodeStyle。#45909,#47772,#48538,#49522,#47264,#49558
- Compilation Warning is removed. #47163,#47216,#47309,#47252,#47341,#47399,#47513,#47558,#47706,#52717,#51203,#51336,#51608,#51633,#46644,#53092,#53185,#53246,#53650,#53683,#53687,#53886,#53689,#53679,#53681,#53532,#47137,#47045,#52186,#52490,#53924,#53938,#53945,#53851,#53847,#53818,#53931
- Support protobuf upgrade. #49875,#48495,#49673,#52499,#51161,#49168
- Support offline compilation of third-party libraries. #54326,#54370,#54335,#54346,#53744,#54319,#53915
- Phi independent compilation header file dependency decoupling. #50456,#47088,#52573,#52651
- Python2.x decommissioning. #48685
- Fix bugs such as null pointer usage, illegal address access, memory out of bounds, divide by 0, and Python IndexError PR49976, PR49993, PR49942, PR49965, PR50000, PR50005, PR49953, PR49995, PR49974, PR50015, PR50010, PR49979, PR49994, PR49977, PR49968, PR49984, PR49958, PR50008, PR51714, PR51847, PR51034, PR51088, PR51091, PR51092, PR49966, PR49656, PR52161, PR49548, PR49546, PR49547, PR49549, PR51850
This release contains contributions from: 1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin Wu Jiawen , Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, Ding Yi, Fu Jianhan, Liu Ge Gu Tou, Lu Lin, Zhou Zhouzhou, Jiang Yongyong, Xue Zhawu, Zhang Chunqiao, Zhang Zhenghai, Ning Meng Wei, Wang Mingdong, Shi Xiaowei, Chao Ji Ma Niu, Chen Cangye, Qi Ma Xiao Mao