Performance comparison for IWL clopen TCV simulations - Antoinehoff/personal_gkyl_scripts GitHub Wiki
Overview
This report analyzes the performance of three different TCV (Tokamak à Configuration Variable) gyrokinetic simulations with varying grid resolutions and computational resources. The simulations use the clopen (closed+open) configuration with both sheath and twist-and-shift boundary condition as well as adaptive sources.
We run for 200 time steps on NERSC Perlmutter GPU nodes with gkylzero
commit ID 600a631.
Moving from low-resolution case (1 node, 4 GPUs) to high-resolution case (4 nodes, 16 GPUs) results in parallel efficiency reduction of 13%.
Field solves and boundary condition enforcement emerge as the primary bottlenecks. These components grow from ~25% of total runtime at low resolution to over 40% at high resolution. The Forward Euler step remains the most computationally intensive, but its relative cost decreases with resolution, dropping from 64% to 49% of total runtime.
The collision terms remain consistently dominant within the Forward Euler computation (50-55% across all resolutions), while collision moments become relatively less expensive at higher resolutions (15.2% → 6.6%), indicating good scaling for this component.
Simulation Configurations
Configuration | Grid Resolution | Nodes | GPUs | cells per GPU |
---|---|---|---|---|
Low-res | 24×16×12×12×8 | 1 | 4 | ~100k |
Medium-res | 48×32×16×12×8 | 2 | 8 | ~300k |
High-res | 96×64×16×12×8 | 4 | 16 | ~600k |
Note: Each Perlmutter GPU node contains 4 NVIDIA A100-PCIE-40GB GPUs
Total Runtime Comparison
Configuration | Total Time Loop (sec) | Forward Euler Calls | RK Stage-2 Failures | Ideal scaling ( cell per GPU increase/ time increase) |
---|---|---|---|---|
Low-res | 30.1 | 602 | 1 | 1.0 |
Medium-res | 87.0 | 656 | 28 | 0.96 |
High-res | 206.5 | 608 | 4 | 0.87 |
Detailed Performance Breakdown
Forward Euler Computation
The Forward Euler step is the dominant computational component:
Configuration | Forward Euler Time (sec) | % of Total Loop | Time per Call (ms) |
---|---|---|---|
Low-res | 19.4 | 64.4% | 32.2 |
Medium-res | 52.5 | 60.4% | 80.1 |
High-res | 101.2 | 49.0% | 166.4 |
Major Computational Components (% of Forward Euler time)
Component | Low-res | Medium-res | High-res |
---|---|---|---|
Collision terms | 50.8% | 55.3% | 52.8% |
Collisionless terms | 21.0% | 22.5% | 22.7% |
Collision moments | 15.2% | 9.1% | 6.6% |
Sources | 11.9% | 11.4% | 13.5% |
Boundary fluxes | 6.0% | 5.8% | 8.7% |
Field Solve Performance
The Poisson field solver scales significantly with resolution:
Configuration | Field Solve Time (sec) | % of Total Loop | Scaling Factor |
---|---|---|---|
Low-res | 4.5 | 14.8% | 1.0x |
Medium-res | 13.7 | 15.8% | 3.1x |
High-res | 42.9 | 20.8% | 9.5x |
The field solver shows the worst scaling behavior, increasing from 15% to 21% of total runtime as resolution increases.
Boundary Conditions
Boundary condition enforcement also scales poorly:
Configuration | BC Time (sec) | % of Total Loop | Scaling Factor |
---|---|---|---|
Low-res | 3.1 | 10.3% | 1.0x |
Medium-res | 12.5 | 14.4% | 4.0x |
High-res | 42.1 | 20.4% | 13.6x |
I/O Performance
Data output times scale with both grid size and data volume:
Configuration | Total I/O (sec) | Species Diag Write (sec) | f write (sec) |
---|---|---|---|
Low-res | 1.0 | 0.57 (56%) | 0.37 (36%) |
Medium-res | 2.6 | 1.42 (55%) | 1.01 (39%) |
High-res | 6.5 | 3.56 (55%) | 2.67 (41%) |
Source files
TCV_NT_3x2v_IWL_adapt_src_24x16x12x12x8_200steps_1nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src_48x32x16x12x8_200steps_2nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src_96x64x16x12x8_200steps_4nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src.txt