How to determine the most efficient device#

Execute benchmarks of your simulation on a variety of device configurations then compare the results to determine which is the most efficient. Your simulation model, parameters, system size, and available hardware all impact the resulting performance. When benchmarking, make sure that all GPU kernels have completed autotuning and that the memory caches have been warmed up before measuring performance.

For example:

import hoomd
import argparse

kT = 1.2

# Parse command line arguments.
parser = argparse.ArgumentParser()
parser.add_argument('--device', default='CPU')
parser.add_argument('--replicate', default=1, type=int)
parser.add_argument('--steps', default=10_000, type=int)
args = parser.parse_args()

# Create WCA MD simulation
device = getattr(hoomd.device, args.device)()
simulation = hoomd.Simulation(device=device, seed=1)
simulation.create_state_from_gsd(filename='spheres.gsd')
simulation.state.replicate(
    nx=args.replicate,
    ny=args.replicate,
    nz=args.replicate,
)
simulation.state.thermalize_particle_momenta(filter=hoomd.filter.All(), kT=kT)

cell = hoomd.md.nlist.Cell(buffer=0.2)
lj = hoomd.md.pair.LJ(nlist=cell)
lj.params[('A', 'A')] = dict(sigma=1, epsilon=1)
lj.r_cut[('A', 'A')] = 2**(1 / 6)

constant_volume = hoomd.md.methods.ConstantVolume(
    filter=hoomd.filter.All(),
    thermostat=hoomd.md.methods.thermostats.Bussi(kT=kT))

simulation.operations.integrator = hoomd.md.Integrator(
    dt=0.001, methods=[constant_volume], forces=[lj])

# Wait until GPU kernel parameter autotuning is complete.
if args.device == 'GPU':
    simulation.run(100)
    while not simulation.operations.is_tuning_complete:
        simulation.run(100)

# Warm up memory caches and pre-computed quantities.
simulation.run(args.steps)

# Run the benchmark and print the performance.
simulation.run(args.steps)
device.notice(f'TPS: {simulation.tps:0.5g}')

Example Results (N=2048)#

On AMD EPYC 7742 (PSC Bridges-2) and NVIDIA A100 (NCSA Delta), this script reports ($ mpirun -n $P python3 determine-the-most-efficient-device.py --device $PROCESSOR):

Processor	P	TPS
CPU	1	2699
CPU	2	4868
CPU	4	8043
CPU	8	12585
CPU	16	18168
CPU	32	22394
CPU	64	25031
GPU	1	15955

The optimal device selection depends on the metric. When the metric is wall clock time only, choose the highest performance benchmark. When the metric is a cost, choose based on the efficiency of each device configuration.

One cost metric is compute time. Most HPC resources assign a cost by CPU core hours. Some HPC resources may assign an effective cost to GPUs. When this is not the case, use the ratio of available GPU hours to CPU core hours as a substitute. This example will assign a relative cost of 1 GPU hour to 64 CPU core hours. The efficiency is:

\[\begin{split}\eta = \begin{cases} \frac{S_\mathrm{P\ CPUs}}{S_\mathrm{1\ CPU}} \cdot \frac{1}{P} & \mathrm{CPU} \\ \frac{S_\mathrm{P\ GPUs}}{S_\mathrm{1\ CPU}} \cdot \frac{1}{64 P} & \mathrm{GPU} \\ \end{cases}\end{split}\]

where $S$ is the relevant performance metric.

Performance and efficiency of 2048 particle WCA simulations.

With 2048 particles in this example, the CPU is always more efficient than the GPU and the CPU is faster than the GPU when $P \ge 16$. Therefore, the CPU is always the optimal choice. Choose a number of ranks $P$ depending on project needs and budgets. Larger values of $P$ will provide results with lower latency at the cost of more CPU core hours. In this example, $P=8$ ($\eta \sim 0.6$) is a middle ground providing a significant reduction in time to solution at a moderate extra cost in CPU core hours.

Example Results (N=131,072)#

The results are very different with 131,072 particles ($ mpirun -n $P python3 determine-the-most-efficient-device.py --device $PROCESSOR --replicate=4):

Processor	P	TPS
CPU	1	36.072
CPU	2	61.988
CPU	4	143.25
CPU	8	281.35
CPU	16	502.48
CPU	32	910.58
CPU	64	1451.5
CPU	128	2216.1
CPU	256	2706.8
GPU	1	7276.5

Performance and efficiency of 131,072 particle WCA simulations.

At a this system size, the GPU is always both faster and more efficient than the CPU.

Compare the two examples and notice that the TPS achieved by the GPU is only cut in half when the system size is increased by a factor of 64. This signals that the smaller system size was not able to utilize all the parallel processing units on the GPU.

Note

Use trial moves per second (hoomd.hpmc.integrate.HPMCIntegrator.mps) as the performance metric when benchmarking HPMC simulations.