Performance comparison for IWL clopen TCV simulations - Antoinehoff/personal_gkyl_scripts GitHub Wiki

Overview

This report analyzes the performance of three different TCV (Tokamak à Configuration Variable) gyrokinetic simulations with varying grid resolutions and computational resources. The simulations use the clopen (closed+open) configuration with both sheath and twist-and-shift boundary condition as well as adaptive sources. We run for 200 time steps on NERSC Perlmutter GPU nodes with gkylzero commit ID 600a631.

Moving from low-resolution case (1 node, 4 GPUs) to high-resolution case (4 nodes, 16 GPUs) results in parallel efficiency reduction of 13%.

Field solves and boundary condition enforcement emerge as the primary bottlenecks. These components grow from ~25% of total runtime at low resolution to over 40% at high resolution. The Forward Euler step remains the most computationally intensive, but its relative cost decreases with resolution, dropping from 64% to 49% of total runtime.

The collision terms remain consistently dominant within the Forward Euler computation (50-55% across all resolutions), while collision moments become relatively less expensive at higher resolutions (15.2% → 6.6%), indicating good scaling for this component.

Simulation Configurations

Configuration	Grid Resolution	Nodes	GPUs	cells per GPU
Low-res	24×16×12×12×8	1	4	~100k
Medium-res	48×32×16×12×8	2	8	~300k
High-res	96×64×16×12×8	4	16	~600k

Note: Each Perlmutter GPU node contains 4 NVIDIA A100-PCIE-40GB GPUs

Total Runtime Comparison

Configuration	Total Time Loop (sec)	Forward Euler Calls	RK Stage-2 Failures	Ideal scaling ( cell per GPU increase/ time increase)
Low-res	30.1	602	1	1.0
Medium-res	87.0	656	28	0.96
High-res	206.5	608	4	0.87

Detailed Performance Breakdown

Forward Euler Computation

The Forward Euler step is the dominant computational component:

Configuration	Forward Euler Time (sec)	% of Total Loop	Time per Call (ms)
Low-res	19.4	64.4%	32.2
Medium-res	52.5	60.4%	80.1
High-res	101.2	49.0%	166.4

Major Computational Components (% of Forward Euler time)

Component	Low-res	Medium-res	High-res
Collision terms	50.8%	55.3%	52.8%
Collisionless terms	21.0%	22.5%	22.7%
Collision moments	15.2%	9.1%	6.6%
Sources	11.9%	11.4%	13.5%
Boundary fluxes	6.0%	5.8%	8.7%

Field Solve Performance

The Poisson field solver scales significantly with resolution:

Configuration	Field Solve Time (sec)	% of Total Loop	Scaling Factor
Low-res	4.5	14.8%	1.0x
Medium-res	13.7	15.8%	3.1x
High-res	42.9	20.8%	9.5x

The field solver shows the worst scaling behavior, increasing from 15% to 21% of total runtime as resolution increases.

Boundary Conditions

Boundary condition enforcement also scales poorly:

Configuration	BC Time (sec)	% of Total Loop	Scaling Factor
Low-res	3.1	10.3%	1.0x
Medium-res	12.5	14.4%	4.0x
High-res	42.1	20.4%	13.6x

I/O Performance

Data output times scale with both grid size and data volume:

Configuration	Total I/O (sec)	Species Diag Write (sec)	f write (sec)
Low-res	1.0	0.57 (56%)	0.37 (36%)
Medium-res	2.6	1.42 (55%)	1.01 (39%)
High-res	6.5	3.56 (55%)	2.67 (41%)

Source files

TCV_NT_3x2v_IWL_adapt_src_24x16x12x12x8_200steps_1nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src_48x32x16x12x8_200steps_2nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src_96x64x16x12x8_200steps_4nodes_perlmutter.txt TCV_NT_3x2v_IWL_adapt_src.txt