How to determine the most efficient device#

Execute benchmarks of your simulation on a variety of device configurations then compare the results to determine which is the most efficient. Your simulation model, parameters, system size, and available hardware all impact the resulting performance. When benchmarking, make sure that all GPU kernels have completed autotuning and that the memory caches have been warmed up before measuring performance.

For example:

import hoomd
import argparse

kT = 1.2

# Parse command line arguments.
parser = argparse.ArgumentParser()
parser.add_argument('--device', default='CPU')
parser.add_argument('--replicate', default=1, type=int)
parser.add_argument('--steps', default=10_000, type=int)
args = parser.parse_args()

# Create WCA MD simulation
device = getattr(hoomd.device, args.device)()
simulation = hoomd.Simulation(device=device, seed=1)
simulation.create_state_from_gsd(filename='spheres.gsd')
simulation.state.replicate(
    nx=args.replicate,
    ny=args.replicate,
    nz=args.replicate,
)
simulation.state.thermalize_particle_momenta(filter=hoomd.filter.All(), kT=kT)

cell = hoomd.md.nlist.Cell(buffer=0.2)
lj = hoomd.md.pair.LJ(nlist=cell)
lj.params[('A', 'A')] = dict(sigma=1, epsilon=1)
lj.r_cut[('A', 'A')] = 2**(1 / 6)

constant_volume = hoomd.md.methods.ConstantVolume(
    filter=hoomd.filter.All(),
    thermostat=hoomd.md.methods.thermostats.Bussi(kT=kT))

simulation.operations.integrator = hoomd.md.Integrator(
    dt=0.001, methods=[constant_volume], forces=[lj])

# Wait until GPU kernel parameter autotuning is complete.
if args.device == 'GPU':
    simulation.run(100)
    while not simulation.operations.is_tuning_complete:
        simulation.run(100)

# Warm up memory caches and pre-computed quantities.
simulation.run(args.steps)

# Run the benchmark and print the performance.
simulation.run(args.steps)
device.notice(f'TPS: {simulation.tps:0.5g}')

Example Results (N=2048)#

On AMD EPYC 7742 (PSC Bridges-2) and NVIDIA A100 (NCSA Delta), this script reports ($ mpirun -n $P python3 determine-the-most-efficient-device.py --device $PROCESSOR):

Processor

P

TPS

CPU

1

2699

CPU

2

4868

CPU

4

8043

CPU

8

12585

CPU

16

18168

CPU

32

22394

CPU

64

25031

GPU

1

15955

The optimal device selection depends on the metric. When the metric is wall clock time only, choose the highest performance benchmark. When the metric is a cost, choose based on the efficiency of each device configuration.

One cost metric is compute time. Most HPC resources assign a cost by CPU core hours. Some HPC resources may assign an effective cost to GPUs. When this is not the case, use the ratio of available GPU hours to CPU core hours as a substitute. This example will assign a relative cost of 1 GPU hour to 64 CPU core hours. The efficiency is:

\[\begin{split}\eta = \begin{cases} \frac{S_\mathrm{P\ CPUs}}{S_\mathrm{1\ CPU}} \cdot \frac{1}{P} & \mathrm{CPU} \\ \frac{S_\mathrm{P\ GPUs}}{S_\mathrm{1\ CPU}} \cdot \frac{1}{64 P} & \mathrm{GPU} \\ \end{cases}\end{split}\]

where \(S\) is the relevant performance metric.

Performance and efficiency of 2048 particle WCA simulations.

With 2048 particles in this example, the CPU is always more efficient than the GPU and the CPU is faster than the GPU when \(P \ge 16\). Therefore, the CPU is always the optimal choice. Choose a number of ranks \(P\) depending on project needs and budgets. Larger values of \(P\) will provide results with lower latency at the cost of more CPU core hours. In this example, \(P=8\) (\(\eta \sim 0.6\)) is a middle ground providing a significant reduction in time to solution at a moderate extra cost in CPU core hours.

Example Results (N=131,072)#

The results are very different with 131,072 particles ($ mpirun -n $P python3 determine-the-most-efficient-device.py --device $PROCESSOR --replicate=4):

Processor

P

TPS

CPU

1

36.072

CPU

2

61.988

CPU

4

143.25

CPU

8

281.35

CPU

16

502.48

CPU

32

910.58

CPU

64

1451.5

CPU

128

2216.1

CPU

256

2706.8

GPU

1

7276.5

Performance and efficiency of 131,072 particle WCA simulations.

At a this system size, the GPU is always both faster and more efficient than the CPU.

Compare the two examples and notice that the TPS achieved by the GPU is only cut in half when the system size is increased by a factor of 64. This signals that the smaller system size was not able to utilize all the parallel processing units on the GPU.

Note

Use trial moves per second (hoomd.hpmc.integrate.HPMCIntegrator.mps) as the performance metric when benchmarking HPMC simulations.