Restartable jobs¶
Overview¶
The ideal restartable job is a single job script that can be resubmitted over and over again to the job queue system. Each time the job starts, it picks up where it left off the last time and continues running until it is done. You can put all the logic necessary to do this in the hoomd python script itself, keeping the submission script simple:
# job.sh
mpirun hoomd run.py
With a properly configured python script, qsub job.sh
is all that is necessary to submit the first run,
continue a previous job that exited cleanly, and continue one that was prematurely killed.
Elements of a restartable script¶
A restartable needs to:
- Choose between an initial condition and the restart file when initializing.
- Write a restart file periodically.
- Set a phase for all analysis, dump, and update commands.
- Use
hoomd.run_upto()
to skip over time steps that were run in previous job submissions.- Use only commands that are restart capable.
- Optionally ensure that jobs cleanly exit before the job walltime limit.
Choose the appropriate initialization file¶
Let’s assume that the initial condition for the simulation is in init.gsd
, and restart.gsd
is saved periodically
as the job runs. A single hoomd.init.read_gsd()
command will load the restart file if it exists, otherwise it will load
the initial file. It is easiest to think about dump files, temperature ramps, etc… if init.gsd
is at time step 0:
init.read_gsd(filename='init.gsd', restart='restart.gsd')
If you generate your initial configuration in python, you will need to add some logic to read restart.gsd
if it
exists or generate if not. This logic is left as an exercise to the reader.
Write restart files¶
You cannot predict when a hardware failure will cause your job to fail, so you need to save restart files at regular intervals as your run progresses. You will also need periodic restart files at a fast rate if you don’t manage wall time to ensure clean job exits.
First, you need to select a restart period. The compute center you run on may offer a tool to help you determine an optimal restart period in minutes. A good starting point is to write a restart file every hour. Based on performance benchmarks, select a restart period in time steps:
dump.gsd(filename="restart.gsd", group=group.all(), truncate=True, period=10000, phase=0)
Use the phase option¶
Set a a phase >= 0
for all analysis routines, file dumps, and updaters you use with period > 1 (the default is 0).
With phase >= 0
, these routines will continue to run in a restarted job on the correct timesteps as if the job had
not been restarted.
Do not use, phase=-1
, as then these routines will start running immediately when a restart job begins:
dump.dcd(filename="trajectory.dcd", period=1e6, phase=0)
analyze.log(filename='temperature.log', quantities=['temperature'], period=5000, phase=0)
zeroer= update.zero_momentum(period=1e6, phase=0)
Use run_upto¶
hoomd.run_upto()
runs the simulation up to timestep n
. Use this in restartable jobs to allow them to run a
given number of steps, independent of the number of submissions needed to reach that:
run_upto(100e6)
Use restart capable commands¶
Most commands in hoomd that output to files are capable of appending to the end of a file so that restarted jobs continue adding data to the file as if the job had never been restarted.
However, not all features in hoomd are capable of restarting. Some are not even capable of appending to files. See the documentation for each individual command you use to tell whether it is compatible with restartable jobs. For those that are restart capable, do not set overwrite=True, or each time the job restarts it will erase the file and start writing a new one.
Some analysis routines in HOOMD-blue store internal state and may require a period that is commensurate with the restart period. See the documentation on the individual command you use to see if this is the case.
Cleanly exit before the walltime limit¶
Job queues will kill your job when it reaches the walltime limit. HOOMD can stop your run before that happens and
give your job time to exit cleanly. Set the environment variable HOOMD_WALLTIME_STOP
to enable this.
Any hoomd.run()
or hoomd.run_upto()
command will exit before the specified time is reached.
HOOMD monitors run performance and tries to ensure that it will end before HOOMD_WALLTIME_STOP
.
Set the variable to a unix epoch time. For example in a job script that should run 12 hours, set HOOMD_WALLTIME_STOP
to 12 hours from now, minus 10 minutes to allow for job cleanup:
# job.sh
export HOOMD_WALLTIME_STOP=$((`date +%s` + 12 * 3600 - 10 * 60))
mpirun hoomd run.py
When using HOOMD_WALLTIME_STOP
, hoomd.run()
will throw the exception WalltimeLimitReached
when it exits due to the walltime
limit. Catch this exception so that your job can exit cleanly. Also, make sure to write out a final restart file
at the end of your job so you have the final system state to continue from. Set the limit_multiple
for the run to
the restart period so that any analyzers that must run commensurate with the restart file have a chance to run. If you
don’t use any such commands, you can omit limit_multiple
and the run will be free to end on any time step:
gsd_restart = dump.gsd(filename="restart.gsd", group=group.all(), truncate=True, period=10000, phase=0)
try:
run_upto(1e6, limit_multiple=10000)
# Perform additional actions here that should only be done after the job has completed all time steps.
except WalltimeLimitReached:
# Perform actions here that need to be done each time you run into the wall clock limit, or just pass
pass
gsd_restart.write_restart()
# Perform additional job cleanup actions here. These will be executed each time the job ends due to reaching the
# walltime limit AND when the job completes all of its time steps.
Examples¶
Simple example¶
Here is a simple example that puts all of these elements together:
# job.sh
export HOOMD_WALLTIME_STOP=$((`date +%s` + 12 * 3600 - 10 * 60))
mpirun hoomd run.py
# run.py
from hoomd import *
from hoomd import md
context.initialize()
init.read_gsd(filename='init.gsd', restart='restart.gsd')
lj = md.pair.lj(r_cut=2.5)
lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)
md.integrate.mode_standard(dt=0.005)
md.integrate.nvt(group=group.all(), T=1.2, tau=0.5)
gsd_restart = dump.gsd(filename="restart.gsd", group=group.all(), truncate=True, period=10000, phase=0)
dump.dcd(filename="trajectory.dcd", period=1e5, phase=0)
analyze.log(filename='temperature.log', quantities=['temperature'], period=5000, phase=0)
try:
run_upto(1e6, limit_multiple=10000)
except WalltimeLimitReached:
pass
gsd_restart.write_restart()
Temperature ramp¶
Runs often have temperature ramps. These are trivial to make restartable using a variant. Just be sure to set
the zero=0
option so that the ramp starts at timestep 0 and does not begin at the top every time the job is submitted.
The only change needed from the previous simple example is to use the variant in integrate.nvt()
:
T_variant = variant.linear_interp(points = [(0, 2.0), (2e5, 0.5)], zero=0)
integrate.nvt(group=group.all(), T=T_variant, tau=0.5)
Multiple stage jobs¶
Not all ramps or staged job protocols can be expressed as variants. However, it is easy to implement multi-stage jobs
using run_upto and HOOMD_WALLTIME_STOP
. Here is an example of a more complex job that involves multiple stages:
# run.py
from hoomd import *
from hoomd import md
context.initialize()
init.read_gsd(filename='init.gsd', restart='restart.gsd')
lj = md.pair.lj(r_cut=2.5)
lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)
md.integrate.mode_standard(dt=0.005)
gsd_restart = dump.gsd(filename="restart.gsd", group=group.all(), truncate=True, period=10000, phase=0)
try:
# randomize at high temperature
nvt = md.integrate.nvt(group=group.all(), T=5.0, tau=0.5)
run_upto(1e6, limit_multiple=10000)
# equilibrate
nvt.set_params(T=1.0)
run_upto(2e6, limit_multiple=10000)
# switch to nve and start saving data for the production run
nvt.disable();
md.integrate.nve(group=group.all())
dump.dcd(filename="trajectory.dcd", period=1e5, phase=0)
analyze.log(filename='temperature.log', quantities=['temperature'], period=5000, phase=0)
run_upto(12e6);
except WalltimeLimitReached:
pass
gsd_restart.write_restart()
And here is another example that changes interaction parameters:
try:
for i in range(1,11):
lj.pair_coeff.set('A', 'A', epsilon=0.1*i)
run_upto(1e6*i);
except WalltimeLimitReached:
pass
Multiple hoomd invocations¶
HOOMD_WALLTIME_STOP
is a global variable set at the start of a job script. So you can launch hoomd scripts multiple times
from within a job script and any of those individual runs will exit cleanly when it reaches the walltime. You need
to take care that you don’t start any new scripts once the first exits due to a walltime limit.
The BASH script logic necessary to implement this behavior is workflow dependent and left as an exercise to
the reader.