Megatron-LM#
Megatron-LM is NVIDIA’s framework
for training and fine-tuning large transformer models with tensor, pipeline, and expert
parallelism. Megatron Bridge
sits on top of Megatron-LM and provides a recipe-based interface — instead of passing
dozens of CLI flags, you write a short Python recipe that returns a config object.
Recipes can load HuggingFace pretrained weights
directly via the hf_path parameter, so you can start from any checkpoint without
manual conversion.
In this note, we demonstrate how to use Megatron Bridge to load HuggingFace pretrained weights and train with Megatron-LM. The examples use the scripts in src/megatron.
How to Use Megatron Bridge#
The container image is built with Docker and exported as an
enroot squashfs (.sqsh) file. Enroot is a
lightweight container runtime designed for HPC — it converts Docker images into
unprivileged sandboxes that integrate with SLURM via the
pyxis plugin. When srun.sh runs, it passes
--container-image and --container-mounts to srun, and pyxis handles importing
the .sqsh and launching each task inside the container.
# Build Docker image and export to enroot sqsh
make build
# This produces megatron-lm+latest.sqsh in the current directory.
# srun.sh picks it up via the SQSH env var (default: ./megatron-lm+latest.sqsh).
You can override the image path and container mounts with environment variables:
Variable |
Default |
Description |
|---|---|---|
|
|
Path to enroot image |
|
|
Container mounts |
|
|
GPUs per node |
A recipe is a Python file that calls a Megatron Bridge config function and returns a
ConfigContainer. For example,
deepseek_v2_lite_pretrain.py:
from megatron.bridge.recipes.deepseek.deepseek_v2 import (
deepseek_v2_lite_pretrain_config,
)
def configure(hf_path=None, moe_token_dispatcher_type=None):
cfg = deepseek_v2_lite_pretrain_config(
**({"hf_path": hf_path} if hf_path else {}),
tensor_model_parallel_size=8,
pipeline_model_parallel_size=1,
expert_model_parallel_size=2,
sequence_parallel=True,
seq_length=4096,
train_iters=500,
global_batch_size=64,
micro_batch_size=1,
eval_interval=100,
lr_warmup_iters=50,
save_interval=0,
)
cfg.model.moe_permute_fusion = False
if moe_token_dispatcher_type == "deepep":
cfg.model.moe_token_dispatcher_type = "flex"
cfg.model.moe_flex_dispatcher_backend = "deepep"
cfg.model.moe_enable_deepep = True
cfg.model.moe_shared_expert_overlap = False
return cfg
Launch a pretrain job with srun.sh, which wraps srun + pyxis to run inside the
enroot container:
# 2-node DeepSeek V2 Lite pretrain
salloc -N 2
./srun.sh recipes/deepseek_v2_lite_pretrain.py \
hf_path=/fsx/models/deepseek-ai/DeepSeek-V2-Lite
# Override config with Hydra-style args
./srun.sh recipes/deepseek_v2_lite_pretrain.py \
train.train_iters=1000
The entrypoint.py
loads the recipe, applies CLI overrides, and calls pretrain(). The srun.sh script
sets up EFA environment variables (FI_PROVIDER=efa, NCCL_NET_PLUGIN, etc.) and
maps SLURM_PROCID / SLURM_LOCALID to RANK / LOCAL_RANK so that
Megatron’s distributed init works without torchrun.
How to Enable Nsys Profiling#
Pass --nsys to srun.sh and set the profiling overrides in the recipe:
./srun.sh --nsys recipes/deepseek_v2_lite_pretrain.py \
hf_path=/fsx/models/deepseek-ai/DeepSeek-V2-Lite \
profiling.use_nsys_profiler=true \
profiling.profile_step_start=10 \
profiling.profile_step_end=15 \
profiling.profile_ranks=[0]
When --nsys is enabled, srun.sh prepends nsys profile with
--capture-range=cudaProfilerApi so that only the steps between
profile_step_start and profile_step_end are captured. The output
.nsys-rep files are written to nsys-megatron/ inside the container mount.
On AWS, if you want to monitor EFA network traffic in the Nsys timeline, add
--enable efa_metrics to the nsys command (already included in srun.sh). For a
detailed walkthrough on monitoring EFA with NCCL GIN and Nsys, refer to
Monitoring EFA with NCCL GIN and Nsys.
Why python Instead of torchrun#
The srun.sh script launches python3 entrypoint.py directly rather than using
torchrun. This is intentional. torchrun spawns worker processes via
multiprocessing, and the spawn boundary can interfere with Nsys profiling — the
profiler sometimes fails to capture the cudaProfilerStart and cudaProfilerStop
calls issued by the child process, resulting in empty or incomplete traces.
By running python directly under srun --ntasks-per-node=<GPUS>, each GPU gets
its own process managed by SLURM. The RANK, LOCAL_RANK, and WORLD_SIZE
environment variables are derived from SLURM_PROCID, SLURM_LOCALID, and
SLURM_NTASKS respectively. This avoids the spawn layer entirely, giving Nsys (and
other profilers like VizTracer) a clean, single-process view of each rank.
Custom Profilers (VizTracer)#
Megatron Bridge’s profiling hooks can be extended to support custom profilers. The
viztracer_plugin.py
shows how to do this by monkey-patching megatron.bridge.training.profiling:
Add a new field (e.g.
use_viztracer) toProfilingConfigviadataclasses.fieldso OmegaConf recognizes it as a valid override.Patch
handle_profiling_stepto start the profiler atprofile_step_start.Patch
handle_profiling_stopto stop and save atprofile_step_end.
The plugin is loaded in entrypoint.py before any Megatron imports:
import viztracer_plugin
viztracer_plugin.install()
Then pass the override on the command line:
./srun.sh recipes/deepseek_v2_lite_pretrain.py \
hf_path=/fsx/models/deepseek-ai/DeepSeek-V2-Lite \
profiling.use_viztracer=true \
profiling.profile_step_start=10 \
profiling.profile_step_end=15 \
profiling.profile_ranks=[0]
The same pattern works for any profiler — implement start() / stop() / save()
in the patched hooks and register a config field so it can be toggled via Hydra overrides.