Continuing Simulations

Overview

Questions

How do I continue running a simulation?

Objectives

Explain why you may want to continue running a simulation, such as wall time limits for cluster jobs.
Describe what you need to consider when writing a workflow step that can continue.
Demonstrate how to append to trajectory files, write needed data to a restart file and limit the simulation run to a given wall time.

Boilerplate code

[1]:

import math

import flow
import hoomd
import signac

Workflow steps from the previous section

The code in the next block collects the workflow steps the previous tutorial section to define the whole workflow.

[2]:

def create_simulation(job):
    cpu = hoomd.device.CPU()
    sim = hoomd.Simulation(device=cpu, seed=job.statepoint.seed)
    mc = hoomd.hpmc.integrate.ConvexPolyhedron()
    mc.shape['octahedron'] = dict(vertices=[
        (-0.5, 0, 0),
        (0.5, 0, 0),
        (0, -0.5, 0),
        (0, 0.5, 0),
        (0, 0, -0.5),
        (0, 0, 0.5),
    ])
    sim.operations.integrator = mc

    return sim


class Project(flow.FlowProject):
    pass


@Project.operation
@Project.pre.true('initialized')
@Project.post.true('randomized')
def randomize(job):
    sim = create_simulation(job)
    sim.create_state_from_gsd(filename=job.fn('lattice.gsd'))
    sim.run(10e3)
    hoomd.write.GSD.write(state=sim.state,
                          mode='xb',
                          filename=job.fn('random.gsd'))
    job.document['randomized'] = True


@Project.operation
@Project.pre.after(randomize)
@Project.post.true('compressed_step')
def compress(job):
    sim = create_simulation(job)
    sim.create_state_from_gsd(filename=job.fn('random.gsd'))

    a = math.sqrt(2) / 2
    V_particle = 1 / 3 * math.sqrt(2) * a**3

    initial_box = sim.state.box
    final_box = hoomd.Box.from_box(initial_box)
    final_box.volume = (sim.state.N_particles * V_particle
                        / job.statepoint.volume_fraction)
    compress = hoomd.hpmc.update.QuickCompress(
        trigger=hoomd.trigger.Periodic(10), target_box=final_box)
    sim.operations.updaters.append(compress)

    periodic = hoomd.trigger.Periodic(10)
    tune = hoomd.hpmc.tune.MoveSize.scale_solver(moves=['a', 'd'],
                                                 target=0.2,
                                                 trigger=periodic,
                                                 max_translation_move=0.2,
                                                 max_rotation_move=0.2)
    sim.operations.tuners.append(tune)

    while not compress.complete and sim.timestep < 1e6:
        sim.run(1000)

    if not compress.complete:
        raise RuntimeError("Compression failed to complete")

    hoomd.write.GSD.write(state=sim.state,
                          mode='xb',
                          filename=job.fn('compressed.gsd'))
    job.document['compressed_step'] = sim.timestep

Motivation

Let’s say your workflow’s equilibration step takes 96 hours to complete and your HPC resource limits wall times to 24 hours. What do you do?

One solution is to write the equilibration step so that it can continue where it left off. When you execute the workflow, each incomplete signac job will move toward completing the step’s post condition. After several rounds of submissions, all signac jobs will be complete.

This section of the tutorial teaches you how to write a workflow step that can limit its run time and continue. The next section will cover effectively run workflow steps in cluster jobs on HPC resources.

Considerations

You must carefully design your workflow step so that it can continue from where it left off: * Write the current state of the system to a GSD file and dynamic parameters to the job document (or other appropriate storage location). * Perform this write in a finally: block to ensure that it is written even when an exception is thrown. * Use the saved state when continuing the workflow step. * Open output files in append mode so that the final file includes output from the first and all continued executions. * Use absolute time step values for triggers so they run consistently before and after continuing the workflow step. * Check the elapsed wall time in a loop and stop executing before the cluster job’s wall time limit. Provide some buffer to write the simulation state and exit cleanly.

Here is the equilibration code from the Introducing HOOMD-blue tutorial as a signac-flow operation that can continue:

[3]:

N_EQUIL_STEPS = 200000  # Number of timesteps to run during equilibration.
HOOMD_RUN_WALLTIME_LIMIT = 30  # Time in seconds at which to stop the operation.


@Project.operation
@Project.pre.after(compress)  # Execute after compress completes.
# Complete after N_EQUIL_STEPS made by this workflow step.
@Project.post(lambda job: job.document.get('timestep', 0) - job.document[
    'compressed_step'] >= N_EQUIL_STEPS)
def equilibrate(job):
    end_step = job.document['compressed_step'] + N_EQUIL_STEPS

    sim = create_simulation(job)

    # Restore the tuned move size parameters from a previous execution.
    sim.operations.integrator.a = job.document.get('a', {})
    sim.operations.integrator.d = job.document.get('d', {})

    if job.isfile('restart.gsd'):
        # Read the final system configuration from a previous execution.
        sim.create_state_from_gsd(filename=job.fn('restart.gsd'))
    else:
        # Or read `compressed.gsd` for the first execution of equilibrate.
        sim.create_state_from_gsd(filename=job.fn('compressed.gsd'))

    # Write `trajectory.gsd` in append mode.
    gsd_writer = hoomd.write.GSD(filename=job.fn('trajectory.gsd'),
                                 trigger=hoomd.trigger.Periodic(10_000),
                                 mode='ab')
    sim.operations.writers.append(gsd_writer)

    # Tune move for the first 5000 steps of the equilibration step.
    tune = hoomd.hpmc.tune.MoveSize.scale_solver(
        moves=['a', 'd'],
        target=0.2,
        trigger=hoomd.trigger.And([
            hoomd.trigger.Periodic(100),
            hoomd.trigger.Before(job.document['compressed_step'] + 5_000)
        ]))
    sim.operations.tuners.append(tune)

    try:
        # Loop until the simulation reaches the target timestep.
        while sim.timestep < end_step:
            # Run the simulation in chunks of 10,000 time steps.
            sim.run(min(10_000, end_step - sim.timestep))

            # End the workflow step early if the next run would exceed the
            # alotted walltime. Use the walltime of the current run as
            # an estimate for the next.
            if (sim.device.communicator.walltime + sim.walltime >=
                    HOOMD_RUN_WALLTIME_LIMIT):
                break
    finally:
        # Write the state of the system to `restart.gsd`.
        hoomd.write.GSD.write(state=sim.state,
                              mode='wb',
                              filename=job.fn('restart.gsd'))

        # Store the current timestep and tuned trial move sizes.
        job.document['timestep'] = sim.timestep
        job.document['a'] = sim.operations.integrator.a.to_base()
        job.document['d'] = sim.operations.integrator.d.to_base()

        if sim.device.communicator.rank == 0:
            walltime = sim.device.communicator.walltime
            print(f'{job.id} ended on step {sim.timestep} '
                  f'after {walltime} seconds')

When this workflow step is executed, it stores the trial move sizes a, d and the current timestep in the job document as well as the the state of the simulation in restart.gsd. It reads these when starting again to continue from where the previous execution stopped. This is a large code block, see the comments for more details on how this workflow step can continue from where it stopped.

To limit the execution time, it splits the total simulation length into chunks and executes them in a loop. After each loop iteration, it checks to see whether the next call to run is likely to exceed the given time limit. sim.device.communicator.walltime gives the elapsed time from the start of the workflow step’s execution, and is identical on all MPI ranks. Using another source of time might lead to deadlocks. As a pedagogical example, this tutorial sets a 30 second wall time limit and uses 10,000 timestep chunks - in practice you will likely set limits from hours to days and use larger 100,000 or 1,000,000 step sized chunks depending on your simulation’s performance. You should set the chunk size large enough to avoid the small overhead from each call to run while at the same time breaking the complete execution into many chunks.

The equilibrate step is ready to execute:

[4]:

project = Project()
project.print_status(overview=False,
                     detailed=True,
                     parameters=['volume_fraction'])


Detailed View:

job id                            operation          volume_fraction  labels
--------------------------------  ---------------  -----------------  --------
59363805e6f46a715bc154b38dffc4e4  equilibrate [U]                0.6
972b10bd6b308f65f0bc3a06db58cf9d  equilibrate [U]                0.4
c1a59a95a0e8b4526b28cf12aa0a689e  equilibrate [U]                0.5

[U]:unknown [R]:registered [I]:inactive [S]:submitted [H]:held [Q]:queued [A]:active [E]:error

Execute it:

[5]:

project.run()

972b10bd6b308f65f0bc3a06db58cf9d ended on step 42000 after 29.11949 seconds
59363805e6f46a715bc154b38dffc4e4 ended on step 33000 after 27.763527 seconds
c1a59a95a0e8b4526b28cf12aa0a689e ended on step 32000 after 23.5945 seconds

The equilibrate step executed for less than HOOMD_RUN_WALLTIME_LIMIT seconds for each of the signac jobs in the dataspace. In a production environment, you would run the project repeatedly until it completes.

See that equilibrate step produced the trajectory.gsd file and the 'a', 'd', and 'timestep' items in the job document:

[6]:

!ls workspace/*

workspace/59363805e6f46a715bc154b38dffc4e4:
compressed.gsd           restart.gsd              trajectory.gsd
lattice.gsd              signac_job_document.json
random.gsd               signac_statepoint.json

workspace/972b10bd6b308f65f0bc3a06db58cf9d:
compressed.gsd           restart.gsd              trajectory.gsd
lattice.gsd              signac_job_document.json
random.gsd               signac_statepoint.json

workspace/c1a59a95a0e8b4526b28cf12aa0a689e:
compressed.gsd           restart.gsd              trajectory.gsd
lattice.gsd              signac_job_document.json
random.gsd               signac_statepoint.json

[7]:

job = project.open_job(dict(N_particles=128, volume_fraction=0.6, seed=20))
print(job.document)

{'initialized': True, 'randomized': True, 'compressed_step': 13000, 'timestep': 33000, 'a': {'octahedron': 0.04564840324176478}, 'd': {'octahedron': 0.02567136340037109}}

Summary

In this section of the tutorial, you defined the workflow step to equilibreate the hard particle simulation. It stores dynamic parameters and the state of the system needed to continue execution when executed again. Now, the directory for each simulation contains trajectory.gsd, and would be ready for analysis after executed to completion.

The next section in this tutorial will show you how to implement this workflow on the command line and submit cluster jobs that effectively use dense nodes.

This tutorial only teaches the basics of signac-flow. Read the signac-flow documentation to learn more.