Domain Decomposition¶

Overview¶

Questions¶

What is a MPI rank?
How does HOOMD-blue divide the simulation among the ranks?
What limitations prevent parallel execution?
How should I structure my scripts?

Objectives¶

Demonstrate how MPI ranks are assigned to processes.
Explain how HOOMD-blue divides the simulation State with a domain decomposition and how operations execute only on the local particles.
Demonstrate the minimum domain size.
Discuss how particles are placed in domains and how this can lead to uneven load balancing.
Emphasize that scripts are a single program that can execute in serial or parallel.
Show how to avoid deadlock when using the HOOMD-blue API.

Ranks and Processes¶

When you call mpirun -n 4 python3 script.py, mpirun launches 4 separate instances of python all executing script.py at the same time. For example, a script that prints a message will repeat the output:

[1]:

%pycat hello_world.py

print('Hello, world')

[2]:

!mpirun -n 4 python3 hello_world.py

Hello, world
Hello, world
Hello, world
Hello, world

MPI launches n separate processes. These may or may not be on the same node in a HPC cluster, depending on how you request resources in your job script. Each process launched this way is called a rank and is given a rank index. In HOOMD-blue, the Communicator class (created by default with Device) gives you access to the rank index.

[3]:

%pycat hello_hoomd.py

import os

import hoomd

device = hoomd.device.CPU()
rank = device.communicator.rank
pid = os.getpid()
print(f'Hello HOOMD-blue rank {rank} from process id {pid}')

[4]:

!mpirun -n 4 python3 hello_hoomd.py

Hello HOOMD-blue rank 2 from process id 965308
Hello HOOMD-blue rank 3 from process id 965309
Hello HOOMD-blue rank 0 from process id 965306
Hello HOOMD-blue rank 1 from process id 965307

os.getpid is a Python method that returns the process id, a number assigned to every executing process by the operating system.

Domain Decomposition¶

When you create the State object in an MPI simulation on more than 1 rank, HOOMD-blue splits the simulation box into k x l x m domains. The product of k, l and m is equal to the number of ranks you execute. Choose n values that factorize given the constraints of your HPC system, such as the number of cores per node. The domains are defined by planes that split the box. By default, the planes are evenly spaced and chosen to minimize the surface area between the domains.

[5]:

%pycat domain_decomposition.py

import hoomd

# Initialize the system.
device = hoomd.device.CPU()
sim = hoomd.Simulation(device=device)
sim.create_state_from_gsd(filename='random.gsd')

# Print the domain decomposition.
domain_decomposition = sim.state.domain_decomposition
device.notice(f'domain_decomposition={domain_decomposition}')

# Print the location of the split planes.
split_fractions = sim.state.domain_decomposition_split_fractions
device.notice(f'split_fractions={split_fractions}')

# Print the number of particles on each rank.
with sim.state.cpu_local_snapshot as snap:
    N = len(snap.particles.position)
    print(f'{N} particles on rank {device.communicator.rank}')

[6]:

!mpirun -n 4 python3 domain_decomposition.py

notice(2): Using domain decomposition: n_x = 1 n_y = 2 n_z = 2.
1749 particles on rank 2
1695 particles on rank 1
domain_decomposition=(1, 2, 2)
1707 particles on rank 3
split_fractions=([], [0.5], [0.5])
1761 particles on rank 0

For example, this script chooses a 1 x 2 x 2 decomposition with the split planes in the center of the box when launched with 4 ranks. domain_decomposition_split_fractions reports relative values between 0 and 1, so in this case a hypothetical 10 x 10 x 10 box would have split planes at y=0 and z=0 creating 4 domains.

Each rank is assigned one of these domains and stores the particles located inside it. The operations execute on the particles local to each rank. When the density of the system is uniform, each rank has approximately the same number of particles (as in the example above). This is what allows the parallel simulations to run with faster performance: the same operation is being run on fewer particles so it takes less time.

However, when the density of the system is not uniform the default split planes lead to an uneven load balancing with a much greater number of particles on one rank compared to the others. The performance of the overall simulation is limited by that of the slowest rank. In the extreme case, imagine all the particles in the lower left of a very large box. In this 1 x 2 x 2 domain decomposition, all particles would be on one rank and the parallel simulation would take just as much time to execute as one rank alone.

Some computations, such as pair forces in MD or hard particle overlap checks in HPMC, need to compute interactions with particles from a neighboring domain. This establishes a lower limit on the domain size. Given an interaction range r_interaction (for MD, this is the sum of the largest pair potential r_cut and the neighbor list buffer), each x,y,z dimension of the domain must be larger than 2 * r_interaction. HOOMD-blue raises an exception when this is violated. For example, here is the Lennard-Jones script run on the random.gsd file before replicating to a larger size:

[7]:

%pycat lj_domain_error.py

import hoomd

device = hoomd.device.CPU()
sim = hoomd.Simulation(device=device, seed=1)
sim.create_state_from_gsd(
    filename='../01-Introducing-Molecular-Dynamics/random.gsd')

integrator = hoomd.md.Integrator(dt=0.005)
cell = hoomd.md.nlist.Cell(buffer=0.4)
lj = hoomd.md.pair.LJ(nlist=cell)
lj.params[('A', 'A')] = dict(epsilon=1, sigma=1)
lj.r_cut[('A', 'A')] = 2.5
integrator.forces.append(lj)
nvt = hoomd.md.methods.ConstantVolume(
    filter=hoomd.filter.All(),
    thermostat=hoomd.md.methods.thermostats.Bussi(kT=1.5))
integrator.methods.append(nvt)
sim.operations.integrator = integrator
sim.run(0)

[8]:

!mpirun -n 2 python3 lj_domain_error.py

notice(2): Using domain decomposition: n_x = 1 n_y = 1 n_z = 2.
Traceback (most recent call last):
  File "03-Parallel-Simulations-With-MPI/lj_domain_error.py", line 19, in <module>
Traceback (most recent call last):
  File "03-Parallel-Simulations-With-MPI/lj_domain_error.py", line 19, in <module>
    simulation.run(0)
  File "/home/joaander/build/hoomd/hoomd/simulation.py", line 462, in run
    simulation.run(0)
  File "/home/joaander/build/hoomd/hoomd/simulation.py", line 462, in run
    self._cpp_sys.run(steps_int, write_at_start)
RuntimeError: Communication error -
Simulation box too small for domain decomposition.
r_ghost_max: 2.9
d.z/2: 2.275

    self._cpp_sys.run(steps_int, write_at_start)
RuntimeError: Communication error -
Simulation box too small for domain decomposition.
r_ghost_max: 2.9
d.z/2: 2.275

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cheme-hodges:965676] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[cheme-hodges:965676] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Single Program¶

HOOMD-blue scripts must be written as a single program. All ranks must load the same input file, define the same operations with the same parameters and triggers, and run the same number of time steps. HOOMD-blue requires this because it splits the system into smaller domains, one assigned to each rank, and executes the same operations on each domain.

While there are many processes executing the same Python script in parallel, they are not independent. The ranks send messages back and forth as needed to combine the decomposed parts of the simulation into a whole. If your script does not follow the single program requirement, it is likely at least one rank will deadlock while it waits for a message to be sent from another rank that will never be sent. A deadlock means that the execution continues while waiting for a condition that will never be true.

While you must create all HOOMD-blue operations, access properties, and call methods on all ranks, this may not be the case for other libraries used in your script. For example, calling print on all ranks results in duplicated output. The same would occur when using open() to open and write to a file. In cases like these, place your code in a if device.communicator.rank == 0: check so that it only runs once on rank 0. For example, this script prints the total kinetic energy of the system only once:

[9]:

%pycat lj_kinetic_energy.py

import hoomd

# Initialize the simulation.
device = hoomd.device.CPU()
sim = hoomd.Simulation(device=device, seed=1)
sim.create_state_from_gsd(filename='random.gsd')

# Set the operations for a Lennard-Jones particle simulation.
integrator = hoomd.md.Integrator(dt=0.005)
cell = hoomd.md.nlist.Cell(buffer=0.4)
lj = hoomd.md.pair.LJ(nlist=cell)
lj.params[('A', 'A')] = dict(epsilon=1, sigma=1)
lj.r_cut[('A', 'A')] = 2.5
integrator.forces.append(lj)
nvt = hoomd.md.methods.ConstantVolume(
    filter=hoomd.filter.All(),
    thermostat=hoomd.md.methods.thermostats.Bussi(kT=1.5))
integrator.methods.append(nvt)
sim.operations.integrator = integrator

# Instantiate a ThermodyanmicQuantities object to compute kinetic energy.
thermodynamic_properties = hoomd.md.compute.ThermodynamicQuantities(
    filter=hoomd.filter.All())
sim.operations.computes.append(thermodynamic_properties)

# Run the simulation.
sim.run(1000)

# Access the system kinetic energy on all ranks.
kinetic_energy = thermodynamic_properties.kinetic_energy

# Print the kinetic energy only on rank 0.
if device.communicator.rank == 0:
    print(kinetic_energy)

[10]:

!mpirun -n 4 python3 lj_kinetic_energy.py

notice(2): Using domain decomposition: n_x = 1 n_y = 2 n_z = 2.
15406.242760147874

The pattern used here is important:

kinetic_energy = thermodynamic_properties.kinetic_energy
if device.communicator.rank == 0:
    print(kinetic_energy)

The property thermodynamic_properties.kinetic_energy is accessed on all ranks, but printed only on rank 0. You must use this same pattern any time you access operation’s properties or call their methods, not just when calling print.

To see why this is important, try the following code in an interactive job:

if device.communicator.rank == 0:
    print(thermodynamic_properties.kinetic_energy)

When you execute this, you will find that it prints nothing and the execution continues indefinitely (press Ctrl-C to stop it).

Each rank stores only a fraction of the total particles in the system. ThermodynamicProperties first computes the kinetic energy from the particles local to each rank, then communicates between the ranks to sum the total system kinetic energy. When you access kinetic_energy only on rank 0, rank 0 sums the local kinetic energy and then deadlocks while it waits for messages from the other ranks. The messages will never arrive because none of the other ranks access the kinetic_energy property, so they do not compute the kinetic energy on their local particles, nor do they communicate with the other ranks.

So, be careful using if device.communicator.rank == 0:. HOOMD-blue has a rich Python API, but any property access or method call on a HOOMD-blue object may result in a MPI communication that will deadlock when inside this condition.

Scripts using if device.communicator.rank == 0: are compatible with serial execution where rank is always 0.

This demonstration uses print as an example for pedagogical purposes. Note that you can use device.notice(f'{thermodynamic_properties.kinetic_energy}')) to print messages as well. In this case, notice checks the rank index itself and only prints on one rank.

Summary¶

In this section, you have seen how MPI ranks run as independent processes, learned how HOOMD splits particles across domains and why HOOMD-blue scripts need to execute all operations identically on all ranks, and identified how to to print output only once in MPI simulations without causing deadlock. The next section of this tutorial shows you how to access the system configuration in MPI simulations.