Continuing Simulations¶
Overview¶
Questions¶
How do I run a single simulation over multiple submissions?
Objectives¶
Explain why you may want to continue running a simulation, such as wall time limits for cluster jobs.
Show how to write an action that can continue.
Demonstrate how to append to trajectory files, write needed data to a restart file and limit the simulation run to a given wall time.
Boilerplate code¶
[1]:
import signac
Motivation¶
Let’s say your workflow’s equilibration action takes 96 hours to complete and your HPC system limits individual job wall times to 24 hours. How can you complete your simulations?
One solution is to write the equilibration action so that it can continue from where it left off. Every time you submit, each incomplete directory will move toward completion. After several rounds, all directories will be complete.
This section of the tutorial teaches you how to write an action that can limit its run time and continue running where it left off. The next section will teach you how to run workflow steps in cluster jobs on HPC system.
Implementation¶
You must carefully design your action so that it can continue from where it left off. It must:
Exit cleanly before the job scheduler kills it.
Write the current state of the system to a GSD file and dynamic parameters to the job document (or other appropriate storage location).
Use the saved state when continuing the action on the same directory.
Open output files in append mode so that the final file includes output from the first and all continued executions.
Use absolute time step values for triggers so that they run consistently before and after continuing the action.
Here is the equilibration code from the Introducing HOOMD-blue, modified to work as a row action that can continue:
[3]:
%pycat equilibrate.py
import os
import hoomd
from create_simulation import create_simulation
# Set simulation parameters.
N_EQUILIBRATION_STEPS = 200000
# Row passes the job's partition configuration and walltime via environment
# variables.
RANKS_PER_PARTITION = int(os.environ.get("ACTION_PROCESSES_PER_DIRECTORY", "1"))
CLUSTER_JOB_WALLTIME_MINUTES = int(os.environ.get("ACTION_WALLTIME_IN_MINUTES", "60"))
# Allow up to 10 minutes for Python to launch and files to be written at the end.
# You may need to increase this buffer time on HPC systems with slow filesystems.
HOOMD_RUN_WALLTIME_LIMIT_SECONDS = CLUSTER_JOB_WALLTIME_MINUTES * 60 - 600
def equilibrate(*jobs):
# Execute N_PARTITIONS job(s) in parallel on N_PARTITIONS * RANKS_PER_PARTITION
# MPI ranks.
communicator = hoomd.communicator.Communicator(
ranks_per_partition=RANKS_PER_PARTITION
)
job = jobs[communicator.partition]
simulation = create_simulation(job, communicator)
# Determine the final timestep of the equilibration process. Use the recorded value
# of `compress_step` so that `end_step` is set consistently when continuing.
end_step = job.document["compressed_step"] + N_EQUILIBRATION_STEPS
# Restore tuned trial moves from a previous execution of equilibrate.
simulation.operations.integrator.a = job.document.get("a", {})
simulation.operations.integrator.d = job.document.get("d", {})
# Continue from a previous `restart.gsd` or start from the end of the *compress*
# action.
if job.isfile("restart.gsd"):
simulation.create_state_from_gsd(filename=job.fn("restart.gsd"))
else:
simulation.create_state_from_gsd(filename=job.fn("compressed.gsd"))
# Append trajectory frames to `trajectory.gsd.in_progress`.
gsd_writer = hoomd.write.GSD(
filename=job.fn("trajectory.gsd.in_progress"),
trigger=hoomd.trigger.Periodic(10_000),
mode="ab",
)
simulation.operations.writers.append(gsd_writer)
# Tune trial move sizes during the first 5,000 steps of the equilibration.
tune = hoomd.hpmc.tune.MoveSize.scale_solver(
moves=["a", "d"],
target=0.2,
trigger=hoomd.trigger.And(
[
hoomd.trigger.Periodic(100),
hoomd.trigger.Before(job.document["compressed_step"] + 5_000),
]
),
)
simulation.operations.tuners.append(tune)
# Run the simulation in chunks. After each call to `run` completes, continue
# running only if the next run is expected to complete in within the allotted time.
while simulation.timestep < end_step:
simulation.run(min(10_000, end_step - simulation.timestep))
if (
simulation.device.communicator.walltime + simulation.walltime
>= HOOMD_RUN_WALLTIME_LIMIT_SECONDS
):
break
# Save the current state of the simulation to the filesystem.
hoomd.write.GSD.write(
state=simulation.state, mode="wb", filename=job.fn("restart.gsd")
)
job.document["a"] = simulation.operations.integrator.a.to_base()
job.document["d"] = simulation.operations.integrator.d.to_base()
walltime = simulation.device.communicator.walltime
simulation.device.notice(
f"{job.id} ended on step {simulation.timestep} after {walltime} seconds"
)
# Row will mark the equilibrate action complete when the file `trajectory.gsd`
# exists. To accomplish this, rename `trajectory.gsd.in_progress` when the
# completion condition is met.
if simulation.timestep >= end_step:
os.rename(job.fn("trajectory.gsd.in_progress"), job.fn("trajectory.gsd"))
When this action is executed, it stores the trial move sizes a
and d
in the job document as well as the state of the simulation in restart.gsd
. It reads these when starting again to continue from where the previous execution stopped. This is a large code block, see the comments for more details on how this action can continue from where it stopped.
To limit the execution time, it splits the total simulation length into chunks and executes them in a loop. After each loop iteration, it checks to see whether the next call to run
is likely to exceed the given time limit. sim.device.communicator.walltime
gives the elapsed time from the start of the action’s execution, and is identical on all MPI ranks (using another source of time will lead to deadlocks).
If you noticed that equilibrate doesn’t loop over jobs
like randomize and compress did, that is very observant! The next section of the tutorial will explain how to use partitions.
Here is the workflow.toml
that describes the equilibrate action (along with the previously covered actions):
[4]:
with open("workflow.toml", "w") as workflow:
workflow.write("""
[default.action]
command = "python project.py --action $ACTION_NAME {directories}"
[[action]]
name = "randomize"
products = ["random.gsd"]
resources.walltime.per_directory = "00:05:00"
[[action]]
name = "compress"
previous_actions = ["randomize"]
products = ["compressed.gsd"]
resources.walltime.per_directory = "00:10:00"
[[action]]
name = "equilibrate"
previous_actions = ["compress"]
products = ["trajectory.gsd"]
[action.group]
maximum_size = 1
[action.resources]
walltime.per_submission = "00:11:00"
""")
As a pedagogical example, this tutorial sets a 11-minute wall time limit (of which equilibrate allows up to 10 minutes of overhead) and executes the run in 10,000 timestep chunks. In practice, you will likely set limits from hours to days and use larger 100,000 or 1,000,000 step sized chunks depending on your simulation’s performance. You should set the chunk size large enough to avoid the overhead from each call to run
while at the same time breaking the complete execution into many
chunks.
This tutorial section sets action.group.maximum_size = 1
so that row will execute equilibrate
on one directory at a time interactively. The next section will explain how to configure groups on HPC systems.
The equilibrate step is eligible to execute:
[5]:
! row show status
Action Completed Submitted Eligible Waiting Remaining cost
randomize 3 0 0 0
compress 3 0 0 0
equilibrate 0 0 3 0 1 CPU-hours
Submit one directory (on a workstation, this executes interactively):
[6]:
! row submit -n 1
Submitting 1 job that may cost up to 0 CPU-hours.
[1/1] Submitting action 'equilibrate' on directory 59363805e6f46a715bc154b38dffc4e4 (0ms).
59363805e6f46a715bc154b38dffc4e4 ended on step 73000 after 52.133713 seconds
As expected, the equilibrate step executed for less than 60 seconds. The wall time set in the row action is 11 minutes. equilibrate.py
reads this via the ACTION_WALLTIME_IN_MINUTES
environment variable and then subtracts 10 minutes to allow for the time it takes Python to launch and to write output files after run
completes.
The equilibrate step produced the trajectory.gsd.in_progress
file and the 'a'
, 'd'
keys in the job document:
[7]:
! ls workspace/*
workspace/59363805e6f46a715bc154b38dffc4e4:
compressed.gsd signac_job_document.json
lattice.gsd signac_statepoint.json
random.gsd trajectory.gsd.in_progress
restart.gsd
workspace/972b10bd6b308f65f0bc3a06db58cf9d:
compressed.gsd random.gsd signac_statepoint.json
lattice.gsd signac_job_document.json
workspace/c1a59a95a0e8b4526b28cf12aa0a689e:
compressed.gsd random.gsd signac_statepoint.json
lattice.gsd signac_job_document.json
[8]:
project = signac.init_project()
job = project.open_job(dict(N_particles=128, volume_fraction=0.6, seed=20))
print(job.document)
{'compressed_step': 13000, 'a': {'octahedron': 0.04611356829439997}, 'd': {'octahedron': 0.025170334040645358}}
Check the workflow status:
[9]:
! row show status
Action Completed Submitted Eligible Waiting Remaining cost
randomize 3 0 0 0
compress 3 0 0 0
equilibrate 0 0 3 0 1 CPU-hours
There are still 3 equilibrate actions eligible to execute. Submitting again would continue the first from where it left off. Eventually, a submission will reach the completion condition which renames trajectory.gsd.in_progress
to trajectory.gsd
and row will mark the equilibrate action complete for that directory.
Summary¶
In this section of the tutorial, you defined an action to equilibrate a hard particle simulation. It stores dynamic parameters and the state of the system needed to continue execution when executed again. After submitting these actions multiple times, the directory for each simulation will contain trajectory.gsd
and would be ready for analysis.
The next section in this tutorial will show you how to submit cluster jobs that effectively use dense nodes on HPC systems.
This tutorial only teaches the basics of row. Read the row documentation to learn more.