NUMA and sub-NUMA at Leonardo

October 4, 2023 · 5 min read

anadynamics developer

Leonardo is a petascale supercomputer located in Bologna, Italy. It was inaugurated in November 2022 and is currently in the top fourth HPC cluster.

Its "Booster" module has 3456 nodes, each with an Ice Lake Xeon of 32 cores and 4 "custom" A100 GPUs. Using our imagination, it should look something like this:


|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA ---------    On   | 00000000:10:00.0 Off |                    0 |
| N/A   43C    P0    63W / 472W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA ---------    On   | 00000000:52:00.0 Off |                    0 |
| N/A   43C    P0    62W / 470W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA ---------    On   | 00000000:8A:00.0 Off |                    0 |
| N/A   43C    P0    61W / 455W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA ---------    On   | 00000000:CB:00.0 Off |                    0 |
| N/A   43C    P0    62W / 454W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |

What does "custom" mean here? Well, the manufacturing process of the silicon die is not perfect and each die may end up with less, or sometimes more, threads than expected. These A100 were selected among those few that ended up with more tensor cores than expected. A bit of snobbery, if you ask me.

Back to the processors. There's only 1 per node, but it's partitioned in 2 in a clustering mode called sub-NUMA. NUMA (Non-Unified Memory Access) it's a design idea to deal with multiple CPU cores and their access to memory.

For us users, it's important to be NUMA-aware, that is, understand that some cores have faster access to certain regions of memory and slower access to other ones. The same applies to the relationship between CPUs and GPUs.

To know your NUMA layout, just run this command::

$ numactl  --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 256924 MB
node 0 free: 249438 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 257999 MB
node 1 free: 256694 MB
node distances:
node   0   1
  0:  10  11
  1:  11  10

This means we have 2 NUMA regions, but you may notice that the distances between the nodes are not that big. That's because there's still just 1 socket, so the penalty for sharing memory won't be that bad.

While cache memory access is important, we also care about GPU-CPU affinity, that is, to which threads the GPUs are closer. Let's look at the next command to answer this::

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV4     NV4     NV4     PXB     NODE    NODE    NODE    0-15    0
GPU1    NV4      X      NV4     NV4     NODE    PXB     NODE    NODE    0-15    0
GPU2    NV4     NV4      X      NV4     NODE    NODE    PXB     NODE    0-15    0
GPU3    NV4     NV4     NV4      X      NODE    NODE    NODE    PXB     0-15    0
mlx5_0  PXB     NODE    NODE    NODE     X      NODE    NODE    NODE
mlx5_1  NODE    PXB     NODE    NODE    NODE     X      NODE    NODE
mlx5_2  NODE    NODE    PXB     NODE    NODE    NODE     X      NODE
mlx5_3  NODE    NODE    NODE    PXB     NODE    NODE    NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

There is a lot of information on the output. Notice that the 4 last rows/columns of the matrix refer to the Mellanox adapters and how each of them is directly connected to their corresponding GPU, but have to go through the CPU to access the other ones.

Anyways, right now we only care about the CPU Affinity column. This tells us that all 4 GPUs have a direct link to cores 0-15. Does this mean that cores 16-31 are isolated from all GPUs? Wouldn't that be a very bad design decision?

Actually, what's happening here it's that the sub-NUMA partitioning is messing ip with the NVIDIA dirver. The NVIDIA driver only sees 16 cores for each GPU and the numbering doesn't correspond to the physical reality of those cores. Other systems with, say, 2 sockets and regular NUMA partitioning don't show this behaviour and the NVIDIA driver is able to distinguish between them and inform the affinity correctly. For example, a node with 2 sockets of 16 cores each and 4 GPUs may have a direct connection between GPUs 0 and 1 and cores 0-15 and another one between GPUs 2 and 3 and cores 16-31. Crossing this barrier (eg: using threads 0-15 to launch kernels on GPU 3), will result in increased memory latency. This sub-NUMA feature from Ice Lake Xeons seems to catch the NVIDIA driver off-guard.

The conclusion is that given there's only 1 socket, all 4 GPUs have direct connection to all 32 cores and there won't be slowdowns when attaching different threads to different GPUs, but there will be slowdons if a job tries to use cores from the range 0-15 and the range 16-31 while expecting to share memory between them.

References​

References