Continuing Simulations¶
Overview¶
Questions¶
How do I continue running a simulation?
Objectives¶
Explain why you may want to continue running a simulation, such as wall time limits for cluster jobs.
Describe what you need to consider when writing a workflow step that can continue.
Demonstrate how to append to trajectory files, write needed data to a restart file and limit the simulation run to a given wall time.
Boilerplate Code¶
[1]:
import math
import flow
import hoomd
Workflow Steps From the Previous Section¶
The code in the next block collects the workflow steps from the previous tutorial section to define the whole workflow.
[2]:
def create_simulation(job):
cpu = hoomd.device.CPU()
simulation = hoomd.Simulation(device=cpu, seed=job.statepoint.seed)
mc = hoomd.hpmc.integrate.ConvexPolyhedron()
mc.shape["octahedron"] = dict(
vertices=[
(-0.5, 0, 0),
(0.5, 0, 0),
(0, -0.5, 0),
(0, 0.5, 0),
(0, 0, -0.5),
(0, 0, 0.5),
]
)
simulation.operations.integrator = mc
return simulation
class Project(flow.FlowProject):
pass
@Project.pre.true("initialized")
@Project.post.true("randomized")
@Project.operation
def randomize(job):
simulation = create_simulation(job)
simulation.create_state_from_gsd(filename=job.fn("lattice.gsd"))
simulation.run(10e3)
hoomd.write.GSD.write(
state=simulation.state, mode="xb", filename=job.fn("random.gsd")
)
job.document["randomized"] = True
@Project.pre.after(randomize)
@Project.post.true("compressed_step")
@Project.operation
def compress(job):
simulation = create_simulation(job)
simulation.create_state_from_gsd(filename=job.fn("random.gsd"))
a = math.sqrt(2) / 2
V_particle = 1 / 3 * math.sqrt(2) * a**3
initial_box = simulation.state.box
final_box = hoomd.Box.from_box(initial_box)
final_box.volume = (
simulation.state.N_particles * V_particle / job.statepoint.volume_fraction
)
compress = hoomd.hpmc.update.QuickCompress(
trigger=hoomd.trigger.Periodic(10), target_box=final_box
)
simulation.operations.updaters.append(compress)
periodic = hoomd.trigger.Periodic(10)
tune = hoomd.hpmc.tune.MoveSize.scale_solver(
moves=["a", "d"],
target=0.2,
trigger=periodic,
max_translation_move=0.2,
max_rotation_move=0.2,
)
simulation.operations.tuners.append(tune)
while not compress.complete and simulation.timestep < 1e6:
simulation.run(1000)
if not compress.complete:
message = "Compression failed to complete."
raise RuntimeError(message)
hoomd.write.GSD.write(
state=simulation.state, mode="xb", filename=job.fn("compressed.gsd")
)
job.document["compressed_step"] = simulation.timestep
Motivation¶
Let’s say your workflow’s equilibration step takes 96 hours to complete and your HPC resource limits wall times to 24 hours. What do you do?
One solution is to write the equilibration step so that it can continue where it left off. When you execute the workflow, each incomplete signac job will move toward completing the step’s post condition. After several rounds of submissions, all signac jobs will be complete.
This section of the tutorial teaches you how to write a workflow step that can limit its run time and continue. The next section will teach you how to run workflow steps in cluster jobs on HPC resources.
Considerations¶
You must carefully design your workflow step so that it can continue from where it left off:
Write the current state of the system to a GSD file and dynamic parameters to the job document (or other appropriate storage location).
Perform this write in a
finally:
block to ensure that it is written even when an exception is thrown.Use the saved state when continuing the workflow step.
Open output files in append mode so that the final file includes output from the first and all continued executions.
Use absolute time step values for triggers so they run consistently before and after continuing the workflow step.
Check the elapsed wall time in a loop and stop executing before the cluster job’s wall time limit. Provide some buffer to write the simulation state and exit cleanly.
Here is the equilibration code from the Introducing HOOMD-blue tutorial as a signac-flow operation that can continue:
[3]:
N_EQUIL_STEPS = 200000 # Number of timesteps to run during equilibration.
HOOMD_RUN_WALLTIME_LIMIT = 30 # Time in seconds at which to stop the operation.
@Project.pre.after(compress) # Execute after compress completes.
# Complete after N_EQUIL_STEPS made by this workflow step.
@Project.post(
lambda job: job.document.get("timestep", 0) - job.document["compressed_step"]
>= N_EQUIL_STEPS
)
@Project.operation
def equilibrate(job):
end_step = job.document["compressed_step"] + N_EQUIL_STEPS
simulation = create_simulation(job)
# Restore the tuned move size parameters from a previous execution.
simulation.operations.integrator.a = job.document.get("a", {})
simulation.operations.integrator.d = job.document.get("d", {})
if job.isfile("restart.gsd"):
# Read the final system configuration from a previous execution.
simulation.create_state_from_gsd(filename=job.fn("restart.gsd"))
else:
# Or read `compressed.gsd` for the first execution of equilibrate.
simulation.create_state_from_gsd(filename=job.fn("compressed.gsd"))
# Write `trajectory.gsd` in append mode.
gsd_writer = hoomd.write.GSD(
filename=job.fn("trajectory.gsd"),
trigger=hoomd.trigger.Periodic(10_000),
mode="ab",
)
simulation.operations.writers.append(gsd_writer)
# Tune move for the first 5000 steps of the equilibration step.
tune = hoomd.hpmc.tune.MoveSize.scale_solver(
moves=["a", "d"],
target=0.2,
trigger=hoomd.trigger.And(
[
hoomd.trigger.Periodic(100),
hoomd.trigger.Before(job.document["compressed_step"] + 5_000),
]
),
)
simulation.operations.tuners.append(tune)
try:
# Loop until the simulation reaches the target timestep.
while simulation.timestep < end_step:
# Run the simulation in chunks of 10,000 time steps.
simulation.run(min(10_000, end_step - simulation.timestep))
# End the workflow step early if the next run exceeds the
# alotted walltime. Use the walltime of the current run as
# an estimate for the next.
if (
simulation.device.communicator.walltime + simulation.walltime
>= HOOMD_RUN_WALLTIME_LIMIT
):
break
finally:
# Write the state of the system to `restart.gsd`.
hoomd.write.GSD.write(
state=simulation.state, mode="wb", filename=job.fn("restart.gsd")
)
# Store the current timestep and tuned trial move sizes.
job.document["timestep"] = simulation.timestep
job.document["a"] = simulation.operations.integrator.a.to_base()
job.document["d"] = simulation.operations.integrator.d.to_base()
walltime = simulation.device.communicator.walltime
simulation.device.notice(
f"{job.id} ended on step {simulation.timestep} after {walltime} seconds"
)
When this workflow step is executed, it stores the trial move sizes a
, d
and the current timestep in the job document as well as the state of the simulation in restart.gsd
. It reads these when starting again to continue from where the previous execution stopped. This is a large code block, see the comments for more details on how this workflow step can continue from where it stopped.
To limit the execution time, it splits the total simulation length into chunks and executes them in a loop. After each loop iteration, it checks to see whether the next call to run
is likely to exceed the given time limit. sim.device.communicator.walltime
gives the elapsed time from the start of the workflow step’s execution, and is identical on all MPI ranks. Using another source of time might lead to deadlocks. As a pedagogical example, this tutorial sets a 30 second wall time
limit and uses 10,000 timestep chunks - in practice you will likely set limits from hours to days and use larger 100,000 or 1,000,000 step sized chunks depending on your simulation’s performance. You should set the chunk size large enough to avoid the small overhead from each call to run
while at the same time breaking the complete execution into many chunks.
The equilibrate step is ready to execute:
[4]:
project = Project()
project.print_status(
overview=False, detailed=True, hide_progress=True, parameters=["volume_fraction"]
)
Detailed View:
job id operation/group volume_fraction labels
-------------------------------- ----------------- ----------------- --------
59363805e6f46a715bc154b38dffc4e4 equilibrate [U] 0.6
972b10bd6b308f65f0bc3a06db58cf9d equilibrate [U] 0.4
c1a59a95a0e8b4526b28cf12aa0a689e equilibrate [U] 0.5
[U]:unknown [R]:registered [I]:inactive [S]:submitted [H]:held [Q]:queued [A]:active [E]:error [GR]:group_registered [GI]:group_inactive [GS]:group_submitted [GH]:group_held [GQ]:group_queued [GA]:group_active [GE]:group_error
Execute it:
[5]:
project.run()
972b10bd6b308f65f0bc3a06db58cf9d ended on step 42000 after 28.671408 seconds
59363805e6f46a715bc154b38dffc4e4 ended on step 33000 after 28.103241 seconds
c1a59a95a0e8b4526b28cf12aa0a689e ended on step 32000 after 23.221612 seconds
The equilibrate step executed for less than HOOMD_RUN_WALLTIME_LIMIT
seconds for each of the signac jobs in the dataspace. In a production environment, you would run the project repeatedly until it completes.
See that the equilibrate step produced the trajectory.gsd
file and the 'a'
, 'd'
, and 'timestep'
items in the job document:
[6]:
!ls workspace/*
workspace/59363805e6f46a715bc154b38dffc4e4:
compressed.gsd restart.gsd trajectory.gsd
lattice.gsd signac_job_document.json
random.gsd signac_statepoint.json
workspace/972b10bd6b308f65f0bc3a06db58cf9d:
compressed.gsd restart.gsd trajectory.gsd
lattice.gsd signac_job_document.json
random.gsd signac_statepoint.json
workspace/c1a59a95a0e8b4526b28cf12aa0a689e:
compressed.gsd restart.gsd trajectory.gsd
lattice.gsd signac_job_document.json
random.gsd signac_statepoint.json
[7]:
job = project.open_job(dict(N_particles=128, volume_fraction=0.6, seed=20))
print(job.document)
{'initialized': True, 'randomized': True, 'compressed_step': 13000, 'timestep': 33000, 'a': {'octahedron': 0.043724190744099806}, 'd': {'octahedron': 0.024415899029095592}}
Summary¶
In this section of the tutorial, you defined the workflow step to equilibrate the hard particle simulation. It stores dynamic parameters and the state of the system needed to continue execution when executed again. Now, the directory for each simulation contains trajectory.gsd, and would be ready for analysis after executed to completion.
The next section in this tutorial will show you how to implement this workflow on the command line and submit cluster jobs that effectively use dense nodes.
This tutorial only teaches the basics of signac-flow. Read the signac-flow documentation to learn more.