Install and Test NCCL for Ubuntu20.04

Download NCCL

Ubuntu20.04

  • CUDA12.8

    1
    https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2004/x86_64/nccl-local-repo-ubuntu2004-2.26.2-cuda12.8_1.0-1_amd64.deb/
  • CUDA12.4

    1
    https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2004/x86_64/nccl-local-repo-ubuntu2004-2.26.2-cuda12.4_1.0-1_amd64.deb/

Ubuntu22.04

  • CUDA12.8

    1
    https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.26.2-cuda12.8_1.0-1_amd64.deb/
  • CUDA12.4

    1
    https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.26.2-cuda12.4_1.0-1_amd64.deb/
  • CUDA12.2

    1
    https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.26.2-cuda12.2_1.0-1_amd64.deb/

Ubuntu24.04

  • CUDA12.8

    1
    https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2404/x86_64/nccl-local-repo-ubuntu2404-2.26.2-cuda12.8_1.0-1_amd64.deb/

Install NCCL(Local installers)

For a local NCCL repository:

1
2
3
4
sudo dpkg -i nccl-repo-<version>.deb

sudo apt update
sudo apt install libnccl2 libnccl-dev

Test NCCL

Install nccl-tests

1
2
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests

Since i only use single node

1
make MPI=0

Test GPU ↔︎️ GPU Communication

Use all_reduce_perf to measure GPU-to-GPU bandwidth and latency.

1
./build/all_reduce_perf -b 8 -e 512M -f 2

Options explained:

  • -b 8 : minimum message size (bytes)
  • -e 512M : maximum message size
  • -f 2 : increase by power of two

To test specific GPUs, for example GPU 0 and GPU 1:

1
CUDA_VISIBLE_DEVICES=0,1 ./build/all_reduce_perf -b 8 -e 512M -f 2

This way you can test PCIe or NVLINK connection depending on which GPUs you select.

Test CPU ↔︎️ GPU Communication

NCCL is mainly for GPU ↔︎️ GPU communication.

For CPU ↔︎️ GPU bandwidth, you can write a simple benchmark using CUDA or PyTorch.

Here’s a simple Python script using PyTorch to test CPU ↔︎️ GPU transfers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
import time

def benchmark_transfer(size_in_mb):
"""Benchmark CPU to GPU and GPU to CPU transfer speed.

Args:
size_in_mb (int): Size of the tensor to transfer, in megabytes.
"""
# Create a random tensor on CPU
size = size_in_mb * 1024 * 1024 // 4 # float32, 4 bytes
cpu_tensor = torch.randn(size, dtype=torch.float32)

# Warm-up to avoid cold-start overhead
gpu_tensor = cpu_tensor.to('cuda')
_ = gpu_tensor.to('cpu')

# CPU to GPU transfer timing
start = time.time()
gpu_tensor = cpu_tensor.to('cuda')
torch.cuda.synchronize()
elapsed = time.time() - start
print(f'CPU -> GPU: {size_in_mb / elapsed:.2f} MB/s')

# GPU to CPU transfer timing
start = time.time()
cpu_tensor2 = gpu_tensor.to('cpu')
torch.cuda.synchronize()
elapsed = time.time() - start
print(f'GPU -> CPU: {size_in_mb / elapsed:.2f} MB/s')

# Example: test with 100 MB tensor
benchmark_transfer(100)

To understand how your GPUs are connected, use:

1
nvidia-smi topo -m

You’ll see output like:

1
2
3
4
        GPU0    GPU1    GPU2    CPU Affinity
GPU0 X NV2 PHB 0-19
GPU1 NV2 X PHB 0-19
GPU2 PHB PHB X 0-19

Legend:

  • NV2 : NVLINK connection
  • PHB : PCIe connection

This helps you target the correct GPUs for testing specific links.