NVSHMEM Multi-NIC Support with AWS EFA#

Date:

2026-03-27

Abstract#

NVSHMEM 3.6.5 introduces multi-NIC support for the libfabric transport with round-robin NIC selection—a collaborative effort between NVIDIA and Amazon Annapurna Labs. NVSHMEM has gained significant attention since DeepEP demonstrated that implementing MoE layer dispatch and combine operations with NVSHMEM can substantially improve performance for large language models (LLMs) employing Mixture-of-Experts (MoE) architectures. However, on AWS, DeepEP compatibility was limited because it relies on InfiniBand GPUDirect Async (IBGDA), whereas AWS GPU instances support only the proxy thread transport via libfabric. Furthermore, prior to version 3.6.5, NVSHMEM supported only a single NIC, meaning each GPU could utilize only one EFA NIC for data transmission. This article reviews the new multi-NIC feature and presents experiments exploring the potential performance improvements on AWS.

Introduction#

Mixture-of-Experts (MoE) architectures rely on all-to-all collective communication to dispatch and combine tokens across expert replicas. However, conventional all-to-all implementations often become a bottleneck during LLM training due to limited overlap with computation and suboptimal bandwidth utilization. NVSHMEM addresses this by exposing a device-side API that enables developers to implement custom all-to-all kernels—such as pplx-kernels and DeepEP—with GPU-initiated networking, eliminating costly GPU–CPU context switches.

Despite these advantages, prior NVSHMEM versions (before 3.6.5) supported only a single NIC per GPU on the libfabric transport. On AWS instances equipped with multiple EFA NICs, this meant that NVSHMEM-based MoE kernels could not fully utilize the available network bandwidth, and in practice did not outperform ordinary all-to-all collective communication.

To evaluate the impact of multi-NIC support, we use the NVSHMEM device all-to-all performance tool to compare single-NIC and multi-NIC configurations. We also benchmark pplx-kernels to assess whether MoE dispatch and combine operations benefit from multi-NIC EFA on the libfabric backend.

NVSHMEM Device All-to-All#

To verify that multi-NIC round-robin is functioning correctly, we use rdmatop to monitor RDMA traffic across all EFA NICs during NVSHMEM benchmarks. The experiments follow the NVSHMEM examples in the rdmatop repository.

We run two benchmarks from the NVSHMEM perftest suite on a Slurm cluster. The first measures point-to-point put bandwidth between a single GPU per node, and the second measures device-initiated all-to-all latency across all 8 GPUs per node.

Put Bandwidth (Inter-Node, 1 GPU per Node)#

In this experiment, we launch a point-to-point job with one sender and one receiver to verify whether a single GPU can fully utilize all available EFA NICs via round-robin NIC selection.

salloc -N 2 NTASKS_PER_NODE=1 \
  bash examples/nvshmem/nvshmem.sbatch \
  /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw \
  -b 8 -e 128M -f 2 -n 1000 -w 100

All-to-All Latency (Device, All 8 GPUs per Node)#

This experiment uses the NVSHMEM device all-to-all performance tool to determine whether all-to-all communication can fully saturate all EFA NICs when all 8 GPUs per node participate.

NODES=2 # 4, 8, 16
salloc -N ${NODES} bash examples/nvshmem/nvshmem.sbatch \
  /opt/nvshmem/bin/perftest/device/coll/alltoall_latency \
  -b 16 -e 1G -f 2 -n 1000 -s all

pplx-kernels All-to-All#

To assess whether real-world MoE workloads benefit from multi-NIC support, we benchmark pplx-kernels—an NVSHMEM-based implementation of MoE dispatch and combine operations used in production serving systems such as vLLM. The experiment follows the pplx example in the rdmatop repository.

salloc -N 2 bash examples/pplx/pplx.sbatch \
  python3 -m tests.bench_all_to_all

Result#

The following subsections present the results for point-to-point put bandwidth and device all-to-all latency, comparing NVSHMEM 3.5.21 (single NIC) against 3.6.5 (multi-NIC).

Put Bandwidth (Point-to-Point)#

In the point-to-point put bandwidth experiment, NVSHMEM 3.5.21 uses only a single EFA NIC per GPU for both Tx and Rx, as shown by the rdmatop output below. This confirms that, prior to multi-NIC support, each GPU was limited to the bandwidth of one NIC regardless of how many were available on the instance.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-put-3.5.21.gif

With NVSHMEM 3.6.5, the same experiment shows traffic distributed across all 4 EFA NICs via round-robin selection. This allows a single GPU to aggregate bandwidth from multiple NICs, significantly increasing the achievable point-to-point throughput.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-put-3.6.5.gif

All-to-All Latency (Device)#

In the device all-to-all experiment, NVSHMEM 3.5.21 shows that each GPU utilizes only a single NIC to transfer data, consistent with the point-to-point results above.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-3.5.21.gif

With NVSHMEM 3.6.5, rdmatop confirms that all Tx NICs carry traffic, demonstrating that multi-NIC round-robin is active during the all-to-all operation. However, we observed that Rx traffic was imbalanced across NICs.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-3.6.5.gif

This imbalance occurs because all PEs compute the same round-robin target NIC at the same time, causing all traffic to converge on the same NIC. A fix is proposed in the PR libfabric: fix multi-NIC RX imbalance for EFA transport. After applying the fix, we can observe that Rx traffic is balanced across all NICs in the all-to-all experiment.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-fix.gif

We also compare the all-to-all bandwidth across different node counts for NVSHMEM 3.5.21, NVSHMEM 3.6.5, and NVSHMEM 3.6.5 with the Rx imbalance fix. With the fix applied, throughput is significantly higher than both other configurations. Notably, without the fix, even though NVSHMEM 3.6.5 utilizes multiple NICs for Tx, the Rx imbalance becomes a bottleneck—resulting in throughput that can be lower than single-NIC mode, as observed in the 2-node and 4-node cases.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-all2all-bandwidth.png

pplx-kernels All-to-All#

With NVSHMEM 3.6.5 (without the Rx imbalance fix), all Tx and Rx NICs are active. Although Rx traffic is slightly imbalanced, the MoE dispatch kernel itself may distribute tokens unevenly across experts, contributing to the imbalance.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-all2all-pplx-original.gif

With NVSHMEM 3.6.5 and the Rx imbalance fix applied, all Tx and Rx NICs remain active, and Rx traffic appears relatively more balanced. However, the MoE dispatch and combine bandwidth shows no significant difference compared to the unfixed version.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/nvshmem-all2all-pplx.gif

Comparing the two configurations, NVSHMEM bandwidth is relatively higher with the Rx imbalance fix applied. However, other factors such as kernel computation may dominate the overall MoE kernel performance, resulting in similar dispatch and combine throughput between the two versions.

https://raw.githubusercontent.com/crazyguitar/pysheeet/blog/nvshmem/docs/_static/appendix/nvshmem/pplx_bandwidth.png

Conclusion#

NVSHMEM 3.6.5 introduces multi-NIC support for the libfabric transport, enabling round-robin NIC selection on AWS instances equipped with multiple EFA NICs. Our experiments confirm that this feature allows both point-to-point and all-to-all NVSHMEM operations to distribute traffic across all available NICs, significantly increasing aggregate bandwidth compared to older NVSHMEM versions.

However, we identified an Rx imbalance issue in the initial 3.6.5 release, where all PEs select the same round-robin target NIC simultaneously, causing Rx traffic to converge on a single NIC. This imbalance can degrade all-to-all throughput to levels below single-NIC mode in certain configurations. A proposed fix addresses this by offsetting the round-robin selection per PE, resulting in balanced Rx traffic and substantially improved all-to-all bandwidth.

For pplx-kernels, multi-NIC support successfully distributes traffic across all EFA NICs. While NVSHMEM-level bandwidth improves with the Rx fix, the end-to-end MoE dispatch and combine throughput shows minimal difference, suggesting that kernel computation rather than network bandwidth is the dominant factor in these workloads.

Overall, multi-NIC support in NVSHMEM 3.6.5 is an important step toward enabling NVSHMEM-based MoE kernels to fully utilize the network capacity of AWS GPU instances, and can potentially improve MoE kernel throughput for large-scale LLM training and serving workloads.

References#