Multi-node docs

<aside>

This doc is superseded by https://modal.com/docs/guide/multi-node-training

</aside>

Modal supports running a training job across several coordinated H100 containers. Each container can saturate the available H100 GPU devices on its host (a.k.a node) and communicate with peer containers which do the same. By scaling a training job from a single GPU to 16 GPUs you can achieve nearly 16x improvements in training time.

Cluster compute capability

Modal H100 clusters provide:

A 100 Gbps IPv6 private network for orchestration, dataset downloading.
A 3,200 Gbps RDMA scale-out network (RoCE).
Up-to 32 H100 SXM devices. (64 coming soon!)
At least 1TB of RAM and 4TB of local NVMe SSD per node.
Deep burn-in testing.
Interopability with all Modal platform functionality (Volumes, Dicts, Tunnels, etc.).

The rest of this guide will walk you through how the Modal client library enables multi-node training and integrates with torchrun.

@clustered

Unlike standard Modal serverless containers, containers in a multi-node training job must be able to:

Perform fast, direct network communication between each other
Be scheduled together, all or nothing, at the same time.

The @clustered decorator enables this behavior.

@app.function(
    gpu=modal.gpu.H100(count=8),
    timeout=60 * 60 * 24,
    retries=modal.Retries(initial_delay=0.0, max_retries=10),
)
@modal.experimental.clustered(size=4)
def train_model(train_args):
    cluster_info = modal.experimental.get_cluster_info()
    
    container_rank = cluster_info.rank
    main_addr = cluster_info.container_ips[0]
    world_size = len(cluster_info.container_ips)

    run(
        train_args,
	      group_size,
	      world_size,
	      container_rank,
	      main_addr,
	      nprocs_per_node=8, # matches GPU count
    )

Applying this decorator under @app.function modifies the Function so that remote calls to it are serviced by a multi-node container group. This configuration creates a group of four containers each having 8 H100 GPU devices, for a total of 32 devices.

Cluster compute capability

@clustered

Scheduling