<aside>

This functionality requires adding valid payment information.

</aside>

Modal supports running a training job across several coordinated H100 containers. Each container can saturate the available H100 GPU devices on its host (a.k.a node) and communicate with peer containers which do the same. By scaling a training job from a single GPU to 16 GPUs you can achieve nearly 16x improvements in training time.

Cluster compute capability

Modal H100 clusters provide:

The rest of this guide will walk you through how the Modal client library enables multi-node training and integrates with torchrun.

@clustered

Unlike standard Modal serverless containers, containers in a multi-node training job must be able to:

  1. Perform fast, direct network communication between each other
  2. Be scheduled together, all or nothing, at the same time.

The @clustered decorator enables this behavior.

@app.function(
    gpu=modal.gpu.H100(count=8),
    timeout=60 * 60 * 24,
    retries=modal.Retries(initial_delay=0.0, max_retries=10),
)
@modal.experimental.clustered(size=4)
def train_model(train_args):
    cluster_info = modal.experimental.get_cluster_info()
    
    container_rank = cluster_info.rank
    main_addr = cluster_info.container_ips[0]
    world_size = len(cluster_info.container_ips)

    run(
        train_args,
	      group_size,
	      world_size,
	      container_rank,
	      main_addr,
	      nprocs_per_node=8, # matches GPU count
    )

Applying this decorator under @app.function modifies the Function so that remote calls to it are serviced by a multi-node container group. This configuration creates a group of four containers each having 8 H100 GPU devices, for a total of 32 devices.

Scheduling