Torch (NYU HPC) Slurm Training/Eval With Apptainer¶

This repo includes Slurm harness scripts that run training inside the published PhysicsNeMo container via Apptainer.

Prereqs¶

You have access to the torch cluster and can submit Slurm jobs.
apptainer (or singularity) is available on the cluster.
Data is available on the cluster filesystem (no S3/OSN copying in this workflow).
You have a container published to GitHub Container Registry (GHCR) (or you use an existing tag).

Container Build/Publish (GitHub Actions)¶

The container build workflow is .github/workflows/container-physicsnemo.yml.

It builds containers/Dockerfile.physicsnemo-25.11.
It publishes to GHCR on:
push to main, or
workflow_dispatch (manual run).

Recommended for branch work:

Push your code changes to your branch.
Run the workflow manually (Actions UI): Container PhysicsNeMo 25.11 with workflow_dispatch on your branch.
Use the resulting image tag(s):
ghcr.io/<owner>/ocean-emulator-physicsnemo:25.11-<git_sha>
ghcr.io/<owner>/ocean-emulator-physicsnemo:25.11-manual-<branch-name>

On torch, scripts/slurm_apptainer_train.sbatch can pull by:

CONTAINER_HASH=<git_sha> (expands to tag 25.11-<git_sha>), or
CONTAINER_TAG=25.11-manual-..., or
IMAGE_REF=ghcr.io/...:<tag> (takes precedence over the two above)

Training Harness¶

Main script:

scripts/slurm_apptainer_train.sbatch

Typically you will have this repo cloned on a scratch space so you can use this script. But note that the actual training run will use the code and configs baked into the container (under /workspace/src and /workspace/configs). It does not bind-mount your host checkout into the container for training. This keeps runs pinned to a container tag (and avoids accidental drift from host edits). This means host-side config file edits are ignored unless they are baked into a new container image; for quick, run-specific tweaks, use CLI overrides via ARGS.

It expects environment variables:

CONFIG (required): config path inside the container image. Relative paths are resolved under /workspace/, e.g. configs/samudra_om4/train.yaml.
NAME_SUFFIX (required): populates the run name by prepending the current date; you can also set NAME directly if you prefer.
DATA_ROOT (optional): host data path passed to --experiment.data_root (default: /scratch/<current_user>/data/om4_onedeg_v3)
OUTPUT_BASE (optional): host output base dir passed to --experiment.base_output_dir (default: /scratch/<current_user>/runs)
ARGS (optional): extra CLI overrides, e.g. --batch_size=1
NSYS_ARGS (optional): if set, wrap the training launch with nsys profile
WANDB_API_KEY (optional): if set and WANDB_MODE unset, defaults to W&B online
WANDB_MODE (optional): online or disabled (if unset, defaults based on whether WANDB_API_KEY is present)

Key behavior:

Refuses to run if ${OUTPUT_BASE}/$NAME already exists (forces unique run names).
Fails early if either ${DATA_ROOT} or ${OUTPUT_BASE} does not exist, with instructions to set the corresponding env var.
Uses the container venv explicitly (/workspace/.venv/bin/python) to avoid missing deps.
To change training code or YAML configs, rebuild/publish a new container tag and point the harness at it (e.g. via CONTAINER_HASH=<git_sha>).
Caches the pulled SIF under ${REPO_DIR}/.apptainer-images/ by default.
If NSYS_ARGS is set and does not include -o/--output, reports are written under ${OUTPUT_BASE}/${NAME}/nsys/.
Defaults to a 8-hour walltime in scripts/slurm_apptainer_train.sbatch.
Our 1-degree jobs take around 4-6 hours, so this is safe; you should probalby increase it by data size for 1/2-degree (i.e. 4x more data = 4x more time) etc.

Example: 1 Node, 8x RTX6000 on the NYU Torch HPC¶

For Torch RTX6000 nodes, size CPU and memory proportionally to GPUs. If you request all GPUs on a node, also request the node's full CPU and memory.

Current gr102 capacity: - 8 GPUs (rtx6000) - 128 CPUs total - 1,400G memory available via SLURM

So, sizing rule for this node when using our sbatch script which spawns a process per GPU within a task: - --cpus-per-task=16 * <num_gpus> - --mem=175G * <num_gpus>

ie for an 8-GPU run, use --cpus-per-task=128 --mem=1400G.

Partition guidance: - Do not set --partition by default. - Let Slurm place the job unless you have a specific partition requirement.

export CONFIG=configs/samudra_om4/train.yaml
export NAME_SUFFIX=om4_samudra_baseline
export ARGS="--batch_size=1"
# Optional overrides (defaults are /scratch/$USER/data/om4_onedeg_v3 and /scratch/$USER/runs)
# export DATA_ROOT=/scratch/$USER/data/om4_onedeg_v3
# export OUTPUT_BASE=/scratch/$USER/runs

# Container selection (pick one)
export CONTAINER_HASH=<git_sha>
# export CONTAINER_TAG=25.11-manual-<branch>
# export IMAGE_REF=ghcr.io/<owner>/ocean-emulator-physicsnemo:25.11-<git_sha>

sbatch \
  --account=torch_pr_347_courant \
  --nodes=1 \
  --ntasks-per-node=1 \
  --cpus-per-task=128 \
  --mem=1400G \
  --gres=gpu:rtx6000:8 \
  --time=24:00:00 \
  scripts/slurm_apptainer_train.sbatch

To enable profiling for a run, you typically want something like this:

export NSYS_ARGS="--trace=cuda,nvtx,osrt,nccl --sample=cpu --delay=300 --duration=120"

Monitoring¶

After submission:

Slurm stdout: slurm-<jobid>.out in the submission directory (usually the repo root on torch).
Training log: ${OUTPUT_BASE}/${NAME:-$(date +%Y-%m-%d)-${NAME_SUFFIX}}/experiment.log

Useful commands:

squeue -j <jobid> -o '%.18i %.2t %.10M %R'
tail -f slurm-<jobid>.out
tail -f "${OUTPUT_BASE}/${NAME:-$(date +%Y-%m-%d)-${NAME_SUFFIX}}/experiment.log"

Interactive And Batch Checks On Torch¶

Interactive allocations and TTY-driven srun sessions are available on Torch. For quick probes, use a short interactive srun command. For reproducible checks with saved logs, prefer short sbatch jobs and inspect their outputs.

Example interactive hostname probe:

srun \
  --account=torch_pr_347_courant \
  --nodes=1 \
  --ntasks=1 \
  --time=00:02:00 \
  --pty bash -lc 'hostname'

Equivalent batch hostname probe:

sbatch \
  --account=torch_pr_347_courant \
  --nodes=1 \
  --ntasks=1 \
  --time=00:01:00 \
  --output="$HOME/oe-hostname-%j.out" \
  --wrap="/bin/hostname"

sacct -j <jobid> --format=JobID,State,Partition,NodeList%40,Elapsed,ExitCode -n
cat "$HOME/oe-hostname-<jobid>.out"

GPU status inside an allocation:

srun --overlap --jobid=<jobid> -N1 -n1 nvidia-smi

Evaluation Harness¶

Main script:

scripts/slurm_apptainer_eval.sbatch

The eval harness runs one process (single-node, single-GPU by default) inside the PhysicsNeMo container and executes:

python -m ocean_emulators.eval <CONFIG> ...

It expects environment variables:

CONFIG (required): eval config path inside the container image. Relative paths resolve under /workspace/, e.g. configs/samudra_om4/eval.yaml.
NAME_SUFFIX (required): populates the eval run name by prepending the current date; you can also set NAME directly if you prefer.
One checkpoint selector (required):
TARGET_CHECKPOINT: checkpoint path relative to ${OUTPUT_BASE}, or
CKPT_PATH: absolute checkpoint path on host (relative host paths are also accepted).
DATA_ROOT (optional): host data path passed to --experiment.data_root (default: /scratch/<current_user>/data/om4_onedeg_v3)
OUTPUT_BASE (optional): host output base dir passed to --experiment.base_output_dir (default: /scratch/<current_user>/runs)
ARGS (optional): extra CLI overrides
WANDB_API_KEY (optional): if set and WANDB_MODE unset, defaults to W&B online
WANDB_MODE (optional): online or disabled (if unset, defaults based on whether WANDB_API_KEY is present)
BACKEND (optional): eval backend (default cuda)

Key behavior:

Refuses to run if ${OUTPUT_BASE}/$NAME already exists (forces unique run names).
Fails early if either ${DATA_ROOT} or ${OUTPUT_BASE} does not exist, with instructions to set the corresponding env var.
Verifies the checkpoint exists before launching.
Binds checkpoint parent paths automatically when checkpoint files live outside ${DATA_ROOT}/${OUTPUT_BASE}.

Example: 1 Node, 1x RTX6000 Eval¶

export CONFIG=configs/samudra_om4/eval.yaml
export NAME_SUFFIX=om4_samudra_baseline_eval
export TARGET_CHECKPOINT=2026-02-22-om4_samudra_baseline/saved_nets/ema_ckpt.pt
# Optional overrides (defaults are /scratch/$USER/data/om4_onedeg_v3 and /scratch/$USER/runs)
# export DATA_ROOT=/scratch/$USER/data/om4_onedeg_v3
# export OUTPUT_BASE=/scratch/$USER/runs
# export WANDB_MODE=online
# export WANDB_API_KEY=...

# Container selection (pick one)
export CONTAINER_HASH=<git_sha>
# export CONTAINER_TAG=25.11-manual-<branch>
# export IMAGE_REF=ghcr.io/<owner>/ocean-emulator-physicsnemo:25.11-<git_sha>

sbatch \
  --account=torch_pr_347_courant \
  --partition=rtx6000_lzanna \
  --nodes=1 \
  --ntasks-per-node=1 \
  --cpus-per-task=8 \
  --mem=128GB \
  --gres=gpu:rtx6000:1 \
  --time=04:00:00 \
  scripts/slurm_apptainer_eval.sbatch

NCCL Gotcha On RTX6000 Nodes¶

On Torch RTX6000 nodes we observed NCCL hangs for 8-GPU single-node training unless P2P is disabled.

Recommended env vars:

export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

Symptom without the above:

job prints distributed init ... world_size 8 and then stalls
GPUs show high utilization but low memory usage
eventually you may see an NCCL watchdog timeout

Apptainer Caching / Pulling¶

By default the harness will cache pulled SIFs in SIF_DIR, which defaults to ${REPO_DIR}/.apptainer-images. using a unqiue name based on the container you've specified.

You can also point it directly to a SIF_PATH. If the SIF_PATH does not exist, the harness will pull your specified container from GHCR to that path.

Private GHCR Images¶

If the image is private, set:

export GHCR_USERNAME=...
export GHCR_TOKEN=...

The harness maps these to the environment variables Apptainer uses for registry auth.