Install and Test NCCL for Ubuntu20.04

Posted on 2024-04-10 Edited on 2025-05-24 In notes , Linux Disqus:

Download NCCL

Ubuntu20.04

CUDA12.8

1	https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2004/x86_64/nccl-local-repo-ubuntu2004-2.26.2-cuda12.8_1.0-1_amd64.deb/

CUDA12.4

1	https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2004/x86_64/nccl-local-repo-ubuntu2004-2.26.2-cuda12.4_1.0-1_amd64.deb/

Ubuntu22.04

CUDA12.8

1	https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.26.2-cuda12.8_1.0-1_amd64.deb/

CUDA12.4

1	https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.26.2-cuda12.4_1.0-1_amd64.deb/

CUDA12.2

1	https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.26.2-cuda12.2_1.0-1_amd64.deb/

Ubuntu24.04

CUDA12.8

1	https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.26.2/ubuntu2404/x86_64/nccl-local-repo-ubuntu2404-2.26.2-cuda12.8_1.0-1_amd64.deb/

Install NCCL(Local installers)

For a local NCCL repository:

sudo dpkg -i nccl-repo-<version>.deb

sudo apt update
sudo apt install libnccl2 libnccl-dev

Test NCCL

Install nccl-tests

1 2	git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests

Since i only use single node

1	make MPI=0

Test GPU ↔︎️ GPU Communication

Use all_reduce_perf to measure GPU-to-GPU bandwidth and latency.

1	./build/all_reduce_perf -b 8 -e 512M -f 2

Options explained:

-b 8 : minimum message size (bytes)
-e 512M : maximum message size
-f 2 : increase by power of two

To test specific GPUs, for example GPU 0 and GPU 1:

1	CUDA_VISIBLE_DEVICES=0,1 ./build/all_reduce_perf -b 8 -e 512M -f 2

This way you can test PCIe or NVLINK connection depending on which GPUs you select.

Test CPU ↔︎️ GPU Communication

NCCL is mainly for GPU ↔︎️ GPU communication.

For CPU ↔︎️ GPU bandwidth, you can write a simple benchmark using CUDA or PyTorch.

Here’s a simple Python script using PyTorch to test CPU ↔︎️ GPU transfers:

import torch
import time

def benchmark_transfer(size_in_mb):
    """Benchmark CPU to GPU and GPU to CPU transfer speed.

    Args:
        size_in_mb (int): Size of the tensor to transfer, in megabytes.
    """
    # Create a random tensor on CPU
    size = size_in_mb * 1024 * 1024 // 4  # float32, 4 bytes
    cpu_tensor = torch.randn(size, dtype=torch.float32)

    # Warm-up to avoid cold-start overhead
    gpu_tensor = cpu_tensor.to('cuda')
    _ = gpu_tensor.to('cpu')

    # CPU to GPU transfer timing
    start = time.time()
    gpu_tensor = cpu_tensor.to('cuda')
    torch.cuda.synchronize()
    elapsed = time.time() - start
    print(f'CPU -> GPU: {size_in_mb / elapsed:.2f} MB/s')

    # GPU to CPU transfer timing
    start = time.time()
    cpu_tensor2 = gpu_tensor.to('cpu')
    torch.cuda.synchronize()
    elapsed = time.time() - start
    print(f'GPU -> CPU: {size_in_mb / elapsed:.2f} MB/s')

# Example: test with 100 MB tensor
benchmark_transfer(100)

Check GPU Topology (to know PCIe vs. NVLINK)

To understand how your GPUs are connected, use:

1	nvidia-smi topo -m

You’ll see output like:

        GPU0    GPU1    GPU2    CPU Affinity
GPU0     X      NV2     PHB     0-19
GPU1    NV2      X      PHB     0-19
GPU2    PHB     PHB      X      0-19

Legend:

NV2 : NVLINK connection
PHB : PCIe connection

This helps you target the correct GPUs for testing specific links.