How to determine the most efficient device¶
Execute benchmarks of your simulation on a variety of device configurations then compare the results to determine which is the most efficient. Your simulation model, parameters, system size, and available hardware all impact the resulting performance. When benchmarking, make sure that all GPU kernels have completed autotuning and that the memory caches have been warmed up before measuring performance.
For example:
import hoomd
import argparse
kT = 1.2
# Parse command line arguments.
parser = argparse.ArgumentParser()
parser.add_argument("--device", default="CPU")
parser.add_argument("--replicate", default=1, type=int)
parser.add_argument("--steps", default=10_000, type=int)
args = parser.parse_args()
# Create WCA MD simulation
device = getattr(hoomd.device, args.device)()
simulation = hoomd.Simulation(device=device, seed=1)
simulation.create_state_from_gsd(filename="spheres.gsd")
simulation.state.replicate(
nx=args.replicate,
ny=args.replicate,
nz=args.replicate,
)
simulation.state.thermalize_particle_momenta(filter=hoomd.filter.All(), kT=kT)
cell = hoomd.md.nlist.Cell(buffer=0.2)
lj = hoomd.md.pair.LJ(nlist=cell)
lj.params[("A", "A")] = dict(sigma=1, epsilon=1)
lj.r_cut[("A", "A")] = 2 ** (1 / 6)
constant_volume = hoomd.md.methods.ConstantVolume(
filter=hoomd.filter.All(), thermostat=hoomd.md.methods.thermostats.Bussi(kT=kT)
)
simulation.operations.integrator = hoomd.md.Integrator(
dt=0.001, methods=[constant_volume], forces=[lj]
)
# Wait until GPU kernel parameter autotuning is complete.
if args.device == "GPU":
simulation.run(100)
while not simulation.operations.is_tuning_complete:
simulation.run(100)
# Warm up memory caches and pre-computed quantities.
simulation.run(args.steps)
# Run the benchmark and print the performance.
simulation.run(args.steps)
device.notice(f"TPS: {simulation.tps:0.5g}")
Example Results (N=2048)¶
On AMD EPYC 7742 (PSC Bridges-2) and NVIDIA A100 (NCSA Delta), this script reports
($ mpirun -n $P python3 determine-the-most-efficient-device.py --device $PROCESSOR
):
Processor |
P |
TPS |
---|---|---|
CPU |
1 |
2699 |
CPU |
2 |
4868 |
CPU |
4 |
8043 |
CPU |
8 |
12585 |
CPU |
16 |
18168 |
CPU |
32 |
22394 |
CPU |
64 |
25031 |
GPU |
1 |
15955 |
The optimal device selection depends on the metric. When the metric is wall clock time only, choose the highest performance benchmark. When the metric is a cost, choose based on the efficiency of each device configuration.
One cost metric is compute time. Most HPC resources assign a cost by CPU core hours. Some HPC resources may assign an effective cost to GPUs. When this is not the case, use the ratio of available GPU hours to CPU core hours as a substitute. This example will assign a relative cost of 1 GPU hour to 64 CPU core hours. The efficiency is:
where \(S\) is the relevant performance metric.
With 2048 particles in this example, the CPU is always more efficient than the GPU and the CPU is faster than the GPU when \(P \ge 16\). Therefore, the CPU is always the optimal choice. Choose a number of ranks \(P\) depending on project needs and budgets. Larger values of \(P\) will provide results with lower latency at the cost of more CPU core hours. In this example, \(P=8\) (\(\eta \sim 0.6\)) is a middle ground providing a significant reduction in time to solution at a moderate extra cost in CPU core hours.
Example Results (N=131,072)¶
The results are very different with 131,072 particles
($ mpirun -n $P python3 determine-the-most-efficient-device.py --device $PROCESSOR --replicate=4
):
Processor |
P |
TPS |
---|---|---|
CPU |
1 |
36.072 |
CPU |
2 |
61.988 |
CPU |
4 |
143.25 |
CPU |
8 |
281.35 |
CPU |
16 |
502.48 |
CPU |
32 |
910.58 |
CPU |
64 |
1451.5 |
CPU |
128 |
2216.1 |
CPU |
256 |
2706.8 |
GPU |
1 |
7276.5 |
At a this system size, the GPU is always both faster and more efficient than the CPU.
Compare the two examples and notice that the TPS achieved by the GPU is only cut in half when the system size is increased by a factor of 64. This signals that the smaller system size was not able to utilize all the parallel processing units on the GPU.
Note
Use trial moves per second (hoomd.hpmc.integrate.HPMCIntegrator.mps
) as the performance
metric when benchmarking HPMC simulations.