Performance comparison for IWL clopen TCV simulations - Antoinehoff/personal_gkyl_scripts GitHub Wiki

Overview

This report analyzes the performance of three different TCV (Tokamak à Configuration Variable) gyrokinetic simulations with varying grid resolutions and computational resources. The simulations use the clopen (closed+open) configuration with both sheath and twist-and-shift boundary condition as well as adaptive sources. We run for 200 time steps on NERSC Perlmutter GPU nodes with gkylzero commit ID 600a631.

Moving from low-resolution case (1 node, 4 GPUs) to high-resolution case (4 nodes, 16 GPUs) results in parallel efficiency reduction of 13%.

Field solves and boundary condition enforcement emerge as the primary bottlenecks. These components grow from ~25% of total runtime at low resolution to over 40% at high resolution. The Forward Euler step remains the most computationally intensive, but its relative cost decreases with resolution, dropping from 64% to 49% of total runtime.

The collision terms remain consistently dominant within the Forward Euler computation (50-55% across all resolutions), while collision moments become relatively less expensive at higher resolutions (15.2% → 6.6%), indicating good scaling for this component.

Simulation Configurations

Configuration Grid Resolution Nodes GPUs cells per GPU
Low-res 24×16×12×12×8 1 4 ~100k
Medium-res 48×32×16×12×8 2 8 ~300k
High-res 96×64×16×12×8 4 16 ~600k

Note: Each Perlmutter GPU node contains 4 NVIDIA A100-PCIE-40GB GPUs

Total Runtime Comparison

Configuration Total Time Loop (sec) Forward Euler Calls RK Stage-2 Failures Ideal scaling ( cell per GPU increase/ time increase)
Low-res 30.1 602 1 1.0
Medium-res 87.0 656 28 0.96
High-res 206.5 608 4 0.87

Detailed Performance Breakdown

Forward Euler Computation

The Forward Euler step is the dominant computational component:

Configuration Forward Euler Time (sec) % of Total Loop Time per Call (ms)
Low-res 19.4 64.4% 32.2
Medium-res 52.5 60.4% 80.1
High-res 101.2 49.0% 166.4

Major Computational Components (% of Forward Euler time)

Component Low-res Medium-res High-res
Collision terms 50.8% 55.3% 52.8%
Collisionless terms 21.0% 22.5% 22.7%
Collision moments 15.2% 9.1% 6.6%
Sources 11.9% 11.4% 13.5%
Boundary fluxes 6.0% 5.8% 8.7%

Field Solve Performance

The Poisson field solver scales significantly with resolution:

Configuration Field Solve Time (sec) % of Total Loop Scaling Factor
Low-res 4.5 14.8% 1.0x
Medium-res 13.7 15.8% 3.1x
High-res 42.9 20.8% 9.5x

The field solver shows the worst scaling behavior, increasing from 15% to 21% of total runtime as resolution increases.

Boundary Conditions

Boundary condition enforcement also scales poorly:

Configuration BC Time (sec) % of Total Loop Scaling Factor
Low-res 3.1 10.3% 1.0x
Medium-res 12.5 14.4% 4.0x
High-res 42.1 20.4% 13.6x

I/O Performance

Data output times scale with both grid size and data volume:

Configuration Total I/O (sec) Species Diag Write (sec) f write (sec)
Low-res 1.0 0.57 (56%) 0.37 (36%)
Medium-res 2.6 1.42 (55%) 1.01 (39%)
High-res 6.5 3.56 (55%) 2.67 (41%)

Source files

TCV_NT_3x2v_IWL_adapt_src_24x16x12x12x8_200steps_1nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src_48x32x16x12x8_200steps_2nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src_96x64x16x12x8_200steps_4nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src.txt