Skip to main content

NUMA and sub-NUMA at Leonardo

ยท 5 min read
German P. Barletta

Leonardo is a petascale supercomputer located in Bologna, Italy. It was inaugurated in November 2022 and is currently in the top fourth HPC cluster.

Its "Booster" module has 3456 nodes, each with an Ice Lake Xeon of 32 cores and 4 "custom" A100 GPUs. Using our imagination, it should look something like this:


|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA --------- On | 00000000:10:00.0 Off | 0 |
| N/A 43C P0 63W / 472W | 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA --------- On | 00000000:52:00.0 Off | 0 |
| N/A 43C P0 62W / 470W | 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA --------- On | 00000000:8A:00.0 Off | 0 |
| N/A 43C P0 61W / 455W | 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA --------- On | 00000000:CB:00.0 Off | 0 |
| N/A 43C P0 62W / 454W | 0MiB / 65536MiB | 0% Default |
| | | Disabled |

What does "custom" mean here? Well, the manufacturing process of the silicon die is not perfect and each die may end up with less, or sometimes more, threads than expected. These A100 were selected among those few that ended up with more tensor cores than expected. A bit of snobbery, if you ask me.

Back to the processors. There's only 1 per node, but it's partitioned in 2 in a clustering mode called sub-NUMA. NUMA (Non-Unified Memory Access) it's a design idea to deal with multiple CPU cores and their access to memory.

For us users, it's important to be NUMA-aware, that is, understand that some cores have faster access to certain regions of memory and slower access to other ones. The same applies to the relationship between CPUs and GPUs.

To know your NUMA layout, just run this command::

$ numactl  --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 256924 MB
node 0 free: 249438 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 257999 MB
node 1 free: 256694 MB
node distances:
node 0 1
0: 10 11
1: 11 10

This means we have 2 NUMA regions, but you may notice that the distances between the nodes are not that big. That's because there's still just 1 socket, so the penalty for sharing memory won't be that bad.

While cache memory access is important, we also care about GPU-CPU affinity, that is, to which threads the GPUs are closer. Let's look at the next command to answer this::

$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0 X NV4 NV4 NV4 PXB NODE NODE NODE 0-15 0
GPU1 NV4 X NV4 NV4 NODE PXB NODE NODE 0-15 0
GPU2 NV4 NV4 X NV4 NODE NODE PXB NODE 0-15 0
GPU3 NV4 NV4 NV4 X NODE NODE NODE PXB 0-15 0
mlx5_0 PXB NODE NODE NODE X NODE NODE NODE
mlx5_1 NODE PXB NODE NODE NODE X NODE NODE
mlx5_2 NODE NODE PXB NODE NODE NODE X NODE
mlx5_3 NODE NODE NODE PXB NODE NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

There is a lot of information on the output. Notice that the 4 last rows/columns of the matrix refer to the Mellanox adapters and how each of them is directly connected to their corresponding GPU, but have to go through the CPU to access the other ones.

Anyways, right now we only care about the CPU Affinity column. This tells us that all 4 GPUs have a direct link to cores 0-15. Does this mean that cores 16-31 are isolated from all GPUs? Wouldn't that be a very bad design decision?

Actually, what's happening here it's that the sub-NUMA partitioning is messing ip with the NVIDIA dirver. The NVIDIA driver only sees 16 cores for each GPU and the numbering doesn't correspond to the physical reality of those cores. Other systems with, say, 2 sockets and regular NUMA partitioning don't show this behaviour and the NVIDIA driver is able to distinguish between them and inform the affinity correctly. For example, a node with 2 sockets of 16 cores each and 4 GPUs may have a direct connection between GPUs 0 and 1 and cores 0-15 and another one between GPUs 2 and 3 and cores 16-31. Crossing this barrier (eg: using threads 0-15 to launch kernels on GPU 3), will result in increased memory latency. This sub-NUMA feature from Ice Lake Xeons seems to catch the NVIDIA driver off-guard.

The conclusion is that given there's only 1 socket, all 4 GPUs have direct connection to all 32 cores and there won't be slowdowns when attaching different threads to different GPUs, but there will be slowdons if a job tries to use cores from the range 0-15 and the range 16-31 while expecting to share memory between them.

References

  1. https://www.cineca.it/en/hot-topics/Leonardo
  2. https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2%3A+LEONARDO+UserGuide