NUMA and sub-NUMA at Leonardo

October 4, 2023 · 5 min read

anadynamics developer

Leonardo is a petascale supercomputer located in Bologna, Italy. It was inaugurated in November 2022 and is currently in the top fourth HPC cluster.

Its "Booster" module has 3456 nodes, each with an Ice Lake Xeon of 32 cores and 4 "custom" A100 GPUs. Using our imagination, it should look something like this:


|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA ---------    On   | 00000000:10:00.0 Off |                    0 |
| N/A   43C    P0    63W / 472W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA ---------    On   | 00000000:52:00.0 Off |                    0 |
| N/A   43C    P0    62W / 470W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA ---------    On   | 00000000:8A:00.0 Off |                    0 |
| N/A   43C    P0    61W / 455W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA ---------    On   | 00000000:CB:00.0 Off |                    0 |
| N/A   43C    P0    62W / 454W |      0MiB / 65536MiB |      0%      Default |
|                               |                      |             Disabled |

What does "custom" mean here? Well, the manufacturing process of the silicon die is not perfect and each die may end up with less, or sometimes more, threads than expected. These A100 were selected among those few that ended up with more tensor cores than expected. A bit of snobbery, if you ask me.

Back to the processors. There's only 1 per node, but it's partitioned in 2 in a clustering mode called sub-NUMA. NUMA (Non-Unified Memory Access) it's a design idea to deal with multiple CPU cores and their access to memory.

For us users, it's important to be NUMA-aware, that is, understand that some cores have faster access to certain regions of memory and slower access to other ones. The same applies to the relationship between CPUs and GPUs.

To know your NUMA layout, just run this command::

$ numactl  --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 256924 MB
node 0 free: 249438 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 257999 MB
node 1 free: 256694 MB
node distances:
node   0   1
  0:  10  11
  1:  11  10

This means we have 2 NUMA regions, but you may notice that the distances between the nodes are not that big. That's because there's still just 1 socket, so the penalty for sharing memory won't be that bad.

While cache memory access is important, we also care about GPU-CPU affinity, that is, to which threads the GPUs are closer. Let's look at the next command to answer this::

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV4     NV4     NV4     PXB     NODE    NODE    NODE    0-15    0
GPU1    NV4      X      NV4     NV4     NODE    PXB     NODE    NODE    0-15    0
GPU2    NV4     NV4      X      NV4     NODE    NODE    PXB     NODE    0-15    0
GPU3    NV4     NV4     NV4      X      NODE    NODE    NODE    PXB     0-15    0
mlx5_0  PXB     NODE    NODE    NODE     X      NODE    NODE    NODE
mlx5_1  NODE    PXB     NODE    NODE    NODE     X      NODE    NODE
mlx5_2  NODE    NODE    PXB     NODE    NODE    NODE     X      NODE
mlx5_3  NODE    NODE    NODE    PXB     NODE    NODE    NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

There is a lot of information on the output. Notice that the 4 last rows/columns of the matrix refer to the Mellanox adapters and how each of them is directly connected to their corresponding GPU, but have to go through the CPU to access the other ones.

Anyways, right now we only care about the CPU Affinity column. This tells us that all 4 GPUs have a direct link to cores 0-15. Does this mean that cores 16-31 are isolated from all GPUs? Wouldn't that be a very bad design decision?

Actually, what's happening here it's that the sub-NUMA partitioning is messing ip with the NVIDIA dirver. The NVIDIA driver only sees 16 cores for each GPU and the numbering doesn't correspond to the physical reality of those cores. Other systems with, say, 2 sockets and regular NUMA partitioning don't show this behaviour and the NVIDIA driver is able to distinguish between them and inform the affinity correctly. For example, a node with 2 sockets of 16 cores each and 4 GPUs may have a direct connection between GPUs 0 and 1 and cores 0-15 and another one between GPUs 2 and 3 and cores 16-31. Crossing this barrier (eg: using threads 0-15 to launch kernels on GPU 3), will result in increased memory latency. This sub-NUMA feature from Ice Lake Xeons seems to catch the NVIDIA driver off-guard.

The conclusion is that given there's only 1 socket, all 4 GPUs have direct connection to all 32 cores and there won't be slowdowns when attaching different threads to different GPUs, but there will be slowdons if a job tries to use cores from the range 0-15 and the range 16-31 while expecting to share memory between them.

References

More type information in C++

September 25, 2023 · 5 min read

German P. Barletta

anadynamics developer

The standard way

C+11 introduced the typeid() operator as a way to get runtime type information (RTTI) on an object. It returns a std::type_info with a .name() member function that outputs the type information as a const char*.

Obviously, this "name" is implementation-dependent and needs proper demangling in order to be human-readable. This can be done through a library like Boost's demangle, but the standard library provides another solution.

std::type_index is a std::type_info wrapper class whose sole function is to be hashed into a mapping object as a key, and associated to the human-readable version of the type that's inside the std::type_info object.

Let's see an example of this word salad:

// type_index as the key
std::unordered_map<std::type_index, std::string> type_names;
 
// the user decides how to call each type
type_names[std::type_index(typeid(int))] = "int";
type_names[std::type_index(typeid(A))] = "A";
type_names[std::type_index(typeid(B*))] = "pointer to B";
type_names[std::type_index(typeid(A*))] = "pointer to A";

int i = 1;
A a;
B *b;

fmt::print("i is {}\n", type_names[std::type_index(typeid(i))]);
fmt::print("a is {}\n", type_names[std::type_index(typeid(a))]);
fmt::print("b is {}\n", type_names[std::type_index(typeid(b))]);
fmt::print("b casted to the base class is {}\n", type_names[std::type_index(
    typeid(dynamic_cast<A*>(b)))]);

This would give the following output. You can also check it on godbolt:

`i` is int
`a` is A
`b` is pointer to B
`b` casted to the base class is pointer to A

This method bypasses demangling and assures no errors will be made when understandig the object types. The downside, on the other hand, is evident. The user has to define the type names for all classes, even the built-in ones. As if this was a python's __repr__() member function, but global and much more convoluted.

Another issue/gotcha with the typeid operator is that ignores cvr (const, volatile and reference, &) qualifiers. For example:

std::unordered_map<std::type_index, std::string> type_names;

type_names[std::type_index(typeid(int))] = "int";

int i = 1;
fmt::print("i is {}\n", type_names[std::type_index(typeid(i))]);

int &ref_to_i = i;
fmt::print("i& is not {}\n", type_names[std::type_index(typeid(ref_to_i))]);
fmt::print("i&& is not {}\n", type_names[std::type_index(typeid(std::move(i)))]);

int const const_i = i;
fmt::print("const_i is not {}\n", type_names[std::type_index(typeid(const_i))]);

And the output would be:

`i` is int
`i&` is not int
`i&&` is not int
`const_i` is not int

In this example we've only stored typeid(int) on our type_names map, yet all the other types matched with it. A more extensive example of this, can be seen on this godbolt.

Boost's way

Boost provides the type_index.hpp header, with a drop-in replacement for typeid() that also solves the mangling issues, at least on the platforms I've tried it on.

type_id_runtime() replaces the typeid() operator and instead of returning a std::type_info, it returns a boost::typeindex::type_info object that has a .pretty_name() member function for demangling. On the other hand, cvr qualifiers are still ignored.

This code:

int i = 1;
fmt::print("`i` is:  {}\n",
            boost::typeindex::type_id_runtime(i).pretty_name());
int &lref_to_i = i;
fmt::print("`lref_to_i` is:  {}\n",
            boost::typeindex::type_id_runtime(lref_to_i).pretty_name());
fmt::print("`i` casted to a r-value reference is:  {}\n",
            boost::typeindex::type_id_runtime(std::move(i)).pretty_name());

will give the following output:

`i` is:  int
`lref_to_i` is:  int
`i` casted to a r-value reference is:  int

Over on this godbolt you can see how well pretty_name() does with user made data types.

Now, since Boost calls the equivalent of typeid() operator, type_id_runtime(), there must be a compile-time equivalent, right?

Types at compile-time

Boost also provides type information at compile time with type_id<T>() and it works just like type_id_runtime(), but instead of providing the query object as a function parameter, you pass it as a template parameter, not without wrapping it inside a decltype() call. This makes sure that the object's type is known at compile time.

Here's the corresponding godbolt example:

int i = 1;
fmt::print("i is:  {}\n",
            boost::typeindex::type_id<decltype(i)>().pretty_name());

int &lref_to_i = i;
fmt::print("lref_to_i is not:  {}\n",
            boost::typeindex::type_id<decltype(lref_to_i)>()
                .pretty_name());

fmt::print("`i` casted to an r-value reference is not:  {}\n",
            boost::typeindex::type_id<decltype(std::move(i))>()
                .pretty_name());

And its output:

`i` is:  int
`lref_to_i` is not:  int
`i` casted to an r-value reference is not:  int

And even better is the fact that we now have access to info on the cvr qualifiers. We can get it with type_id_with_cvr<T>():

See the godbolt example:

int i = 1;
fmt::print("i is:  {}\n",
            boost::typeindex::type_id_with_cvr<decltype(i)>().pretty_name());

int &lref_to_i = i;
fmt::print("lref_to_i is:  {}\n",
            boost::typeindex::type_id_with_cvr<decltype(lref_to_i)>()
                .pretty_name());

fmt::print("`i` casted to an r-value reference is:  {}\n",
            boost::typeindex::type_id_with_cvr<decltype(std::move(i))>()
                .pretty_name());

And its output:

`i` is:  int
`lref_to_i` is:  int&
`i` casted to an r-value reference is:  int&&

Internally, type_id_with_cvr<T>() calls the typeid() operator from the standard, only after performing some template magic with the type T to determine the cvr qualifiers. Through partial template class specialization sprinkled with some public inheritance, one can determine if a type has cvr qualifiers or if it's a pointer. Perhaps that's material for another post.

In the end, we know that type information as a string at compile-time is not as useful as during runtime, but nevertheless it's at least a good learning tool. For example, we could use it to inspect types in godbolt, something we couldn't do without it, as there's no godbolt debugger (and it shouldn't be).

Maybe we'll do just that in the next post.

References

Migrating to Apptainer

September 11, 2023 · 4 min read

German P. Barletta

anadynamics developer

As we said in our last last singularity post, Singularity is now Apptainer.

We'll now redo our container locuaz.sif, but this time using Apptainer.

Installing Apptainer

First, we get some dependencies that are not usually in a linux desktop. On a ubuntu-based system we do:

apt install fuse2fs squashfuse fuse-overlayfs

Then we download Apptainer from the repo, and install it:

sudo dpkg -i apptainer_<version>_amd64.deb

Notice that for some reason the Singularity and Apptainer packages are incompatible, so you'll have to remove Singularity to install Apptainer. Yeah, the break-up wasn't amicable.

The definition file

This is the first piece of good news, the definition file stays the same!

Building the container

In this case, the apptainer command is just a drop-in replacement of singularity. So for locuaz we do:

sudo apptainer build locuaz.sif locuaz.def

Signing and verifying your container

In our previous post we used sylabs endpoint to store our key so users could verify our signature. This time we'll choose the "open" way to do it.

The steps to generate your key are the same as before, just replace singularity with apptainer and follow the steps:

apptainer key newpair

After finishing the wizard to create your key, you can sign you image. This is how I signed locuaz:

apptainer sign locuaz.sif
INFO:    Signature created and applied to image 'locuaz.sif'

Now, when you created your key, you got a fingerprint, if you missed it, just list your keys:

apptainer key list

Push the fingerprint to openpgp:

apptainer push <FINGERPRINT>

This'll work because keys.openpgp will be your default after installing apptainer. If you're not sure of this, list your remotes:

$ apptainer remote list
Cloud Services Endpoints
========================

NAME           URI                  ACTIVE  GLOBAL  EXCLUSIVE  INSECURE
DefaultRemote  cloud.apptainer.org  YES     YES     NO         NO
SylabsCloud    cloud.sylabs.io      NO      YES     NO         NO

Keyservers
==========

URI                       GLOBAL  INSECURE  ORDER
https://keys.openpgp.org  YES     NO        1*

* Active cloud services keyserver

Authenticated Logins
=================================

URI               INSECURE
docker://ghcr.io  NO

After pushing a new key you'll get an email to the account you set when you created the key with apptainer key newpair. It'll offer you to publicly link the email with the fingerprint, so users can look you up with the email instead of using the fingerprint:

They could download the public key from there, but it's much easier to do it on the command line, supplying the openpgp url. For example, to verify locuaz.sif:

$ apptainer verify --url https://keys.openpgp.org locuaz.sif 
INFO:    Verifying image with PGP key material
[LOCAL]   Signing entity: Patricio Barletta <pbarletta@gmail.com>
[LOCAL]   Fingerprint: 8AD02DE471F2282E508C78973F7A36C74361A111
Objects verified:
ID  |GROUP   |LINK    |TYPE
------------------------------------------------
1   |1       |NONE    |Def.FILE
2   |1       |NONE    |JSON.Generic
3   |1       |NONE    |JSON.Generic
4   |1       |NONE    |FS
[REMOTE]   Signing entity: Patricio Barletta <pbarletta@gmail.com>
[REMOTE]   Fingerprint: 8AD02DE471F2282E508C78973F7A36C74361A111
Objects verified:
ID  |GROUP   |LINK    |TYPE
------------------------------------------------
1   |1       |NONE    |Def.FILE
2   |1       |NONE    |JSON.Generic
3   |1       |NONE    |JSON.Generic
4   |1       |NONE    |FS
INFO:    Verified signature(s) from image 'locuaz.sif'

Uploading to GitHub packages (ghcr)

Finally, we upload our container to a registry. GitHub Packages are available for everyone and chances are your code is on GitHub already and having everything put together in one place is nice.

We first get our Personal Access Token (PAT) from GitHub. GitHub docs were written for docker users, so our command lines will be a bit different. This is how I did it:

apptainer remote login --username pgbarletta docker://ghcr.io

And then pasted my token. Now you should be good to push your container:

apptainer push <APPTAINER-CONTAINER>.sif oras://ghcr.io/<NAMESPACE>/<APPTAINER-CONTAINER>.sif:<VERSION>

This is how it looked in my case:

apptainer push locuaz.sif oras://ghcr.io/pgbarletta/locuaz.sif:0.5.3

As of version 1.2.2 Apptainer shows no progress bar or anything like it, so if it looks like it hanged, just have faith.

And that's it! You can then go to your packages and link it to its corresponding repo. I'll post again if I find something better but for now this is my chosen protocol.

References

Uploading the singularity container

September 4, 2023 · 2 min read

German P. Barletta

anadynamics developer

This is probably the last of a series of blog posts on singularity.

We know how to build a simple container with CUDA support and we know how to build a more complex one that also has conda support. We'll now sign our container (locuaz.sif) and upload it for others to download. Honestly, this is the easiest part of all and I'm only writing it down for future reference.

Intro (digression)

Before starting, we have to clarify something: Singularity is no more. In a weird turn of events, a company forked it and called it to Singularity CE, while the original project had to rename itself to Apptainer and is now under the umbrella of the Linux Software Foundation, which I guess protects it from stuff like this happening again, but, honestly, I have no idea.

As of now, an HPC cluster can be expected to have a singularity module but I haven't found one with apptainer built-in, so we'll stay with sylabs, at least for now.

The sylabs way

Creating a key to sign containers

In order to sign something you need a signature and Singularity's docs are straightforward in this regard, just do what they say

Creating a sylabs token to verify containers

You'll also want the ability to verify a signature. Creating an account on sylabs allows you to verify containers from others and upload yours. Sadly, they don't let you create an account, but force you to integrate their credentials with github, or gmail, etc.

Signing and uploading with sylabs

After all of that, it's just:

singularity sign locuaz.sif

and to verify the signature, users will do:

$ singularity verify locuaz.sif 
INFO:    Verifying image with PGP key material
[LOCAL]   Signing entity: Patricio Barletta <pbarletta@gmail.com>
Objects verified:
ID  |GROUP   |LINK    |TYPE
------------------------------------------------
1   |1       |NONE    |Def.FILE
2   |1       |NONE    |JSON.Generic
3   |1       |NONE    |JSON.Generic
4   |1       |NONE    |FS
INFO:    Verified signature(s) from image 'locuaz.sif'

Finally, push it to your sylabs library. In my case, that looks like:

singularity push locuaz.sif library://pgbarletta/remote-builds/locuaz-0.5.3

Sylabs will give you 11Gb of storage for free, so you'll be good for a couple of images. It'd be fun to try and see if an Apptainer image gets accepted, my guess is a big resounding ....

References

Singularity + CUDA + conda

August 24, 2023 · 4 min read

German P. Barletta

anadynamics developer

In the previous post we reviewed a simple Singularity definition file for a container that gave support for CUDA development.

We'll now see an example that uses the CUDA toolkit runtime and a conda environment, or more precisely, mambaforge which includes mamba as a conda replacement and conda-forge as the default channel.

mamba works almost identically to conda and it's an order of magnitude faster. I think many of us would've dropped conda as a package manager and virtual environment if it wasn't for mamba's solver.

Writing the definition file

This is the Singularity definition file I used to containerize the locuaz optimization protocol. We'll skip the details about locuaz, suffice to say that it's an antibody optimization protocol that carries a lot of dependencies, some of them cannot be installed through pip, others cannot be installed through conda so we end up using both, which is not ideal. This is precisely why I think trying to containerize it is a good challenge.

Let's check the .def definition file:

Bootstrap: docker
From: nvcr.io/nvidia/cuda:11.7.0-runtime-ubuntu22.04

%post
    export LC_ALL=C
    export DEBIAN_FRONTEND=noninteractive
    apt update
    apt install -y wget libopenmpi-dev
    mkdir /opt/concept
    cd /opt/concept
    mv ../usr_deps.yaml ./
    wget https://github.com/conda-forge/miniforge/releases/download/23.3.1-0/Mambaforge-23.3.1-0-Linux-x86_64.sh
    bash Mambaforge-23.3.1-0-Linux-x86_64.sh -p /opt/mambaforge  -b 
    . /opt/mambaforge/bin/activate 
    mamba env create -f usr_deps.yaml
    conda activate concept
    pip install locuaz --root-user-action=ignore
    
%runscript
    locuaz
 
%environment
    export LC_ALL=C
    source /opt/mambaforge/bin/activate /opt/mambaforge/envs/concept
    
%files
    usr_deps.yaml opt/

Some explanations about the non-obvious lines:

We install wget to download the mambaforge installer script and libopenmpi-dev since locuaz uses MPI to launch multiple GROMACS MD runs.
We run the mambaforge installer with the -b flag to skip the license agreement question and -p to specify the install dir.
. /opt/mambaforge/bin/activate to activate conda and then use the mamba executable to build the environment. The yaml file was included with the container.
And after creating and activating the environment, we install locuaz with a flag that was added to pip for the specific case of containerized builds, where we usually are the root user: --root-user-action=ignore, to silence the pip warning coming from installing with root privileges

Now, the major pain point when including mambaforge is the activation of the environment.

You can't run source /root/.bashrc after installing mambaforge since source is note available during the execution of %post. You can't run conda init or mamba init either, since it'll ask you to restart your shell.

The solution is to run the activation script in a way any UNIX system should support, that is, using the syntax: . script. Then on %environment we get a full bash interpreter and we can source the activation script and point it to the folder where our environment resides: source /opt/mambaforge/bin/activate /opt/mambaforge/envs/<your_environment>.

Finally, we build it:

sudo singularity build locuaz.sif locuaz.def

Actually running it

There's another obstacle when running from a container, and this is the binding of host directories. Singularity includes some host dirs by default, but if your containerized workflow needs additional access, you'll need to include it with the --bind flag.

In our case, it's GROMACS that needs many additional locations. This is how the singularity call command ends up looking looks on my machine:

singularity exec --nv --bind /usr/local/gromacs,/lib/x86_64-linux-gnu,/usr/local/cuda-12.2/lib64,/etc/alternatives locuaz.sif locuaz daux/config_ligand.yaml

Notice the --nv flag to be able to run with GPU support and how we include a plethora of comma separated directories, after the --bind flag. After locuaz.sif, our actual container, we call the locuaz program with a configuration file as argument.

The /usr/local/gromacs directory is where the GROMACS installations resides. The rest of the directories are the locations of the libraries that the gmx binary links to. These will be specific to each machine, but you'll easily find them by checking which libraries does the gmx binary link to.

For example, on my machine:

ldd `which gmx`
        linux-vdso.so.1 (0x00007ffc8817a000)
        libgromacs.so.8 => /usr/local/gromacs/lib/libgromacs.so.8 (0x00007f74c0200000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f74bfe00000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f74c5417000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f74bfa00000)
        libcufft.so.11 => /usr/local/cuda-12.2/lib64/libcufft.so.11 (0x00007f74b4c00000)
        libopenblas.so.0 => /lib/x86_64-linux-gnu/libopenblas.so.0 (0x00007f74b27b0000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f74c532e000)
        libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f74c52da000)
        libmuparser.so.2 => /usr/local/gromacs/lib/libmuparser.so.2 (0x00007f74c017d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f74c5471000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f74c52d5000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f74c52d0000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f74c0178000)
        libgfortran.so.5 => /lib/x86_64-linux-gnu/libgfortran.so.5 (0x00007f74b2400000)

If you wish to make the run line shorter, you can store these locations on the dedicated environment variable $SINGULARITY_BIND:

export SINGULARITY_BIND="/usr/local/gromacs,/lib/x86_64-linux-gnu,/usr/local/cuda-12.2/lib64,/etc/alternatives/"

And then just run:

singularity exec --nv locuaz.sif locuaz daux/config_ligand.yaml

This does make everything a bit uglier for the user and I'd argue that creating a conda environment and running pip is not that much harder, but hey, some people love containers.

Hopefully in the future I'll get to benchmark the containerized version of the protocol, but I don't expect significative slowdowns.

Singularity sample file

August 16, 2023 · 2 min read

German P. Barletta

anadynamics developer

Here's a simple starting file that's based on a CUDA 11 docker image for ubuntu 22.04. It includes the vectorAdd.cu file from cuda-samples/Samples/0_Introduction/vectorAdd and some header files from cuda-samples/Common. These sample files used to come with the CUDA toolkit, but now they have to be downloaded from a repo.

Bootstrap: docker
From: nvcr.io/nvidia/cuda:11.7.0-devel-ubuntu22.04

%post
    export DEBIAN_FRONTEND=noninteractive
    apt update
    cd /opt
    nvcc -ccbin g++ -m64 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 vectorAdd.cu -o local_vectorAdd -I./
    
%runscript
    /opt/local_vectorAdd

%environment
    export LC_ALL=C

%files
    helper_cuda.h /opt/
    helper_string.h /opt/
    vectorAdd.cu /opt/

Some explanations about the non-obvious lines:

export DEBIAN_FRONTEND=noninteractive is an almost mandatory line in all containers. It prevents the system from expecting user input, which in our case would hald the container build.
export LC_ALL=C: so Perl doesn't complain about localization if we launch the container as a shell.
We're asking nvcc to generate PTX and SASS for all currently supported architectures.

After saving this into the definition file singu.def and building it on my machine:

$ sudo singularity build singu.sif singu.def

I uploaded it onto Leonardo, which is a Red Hat 8.6 system, and ran it:

(base) [pbarlett@lrdn3433 ~]$ singularity run --nv singu.sif 
INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /etc/localtime required more than 50 (76) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (387) bind mounts
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
INFO:    Cleaning up image...

GLIBC versions between my computer (2.35) and Leonardo's (2.28) also differ, so we can only hope we don't run into any issues later on.

C++ template initialization is memoized

August 14, 2023 · 5 min read

German P. Barletta

anadynamics developer

C++ has a history of solving the same problem in different ways, and sometimes the provided solutions come to replace older ones, at least partially, that are deemed less efficient, correct, and/or ergonomic. But established solutions still have a few advantages, and one of them is that people have been dealing with them and their drawbacks for longer, so in practice they may end up working better than expected. We'll review an example of such cases.

Templates in C++ allow functions, classes and variables to operate with generic types and even values, with the introduction of non-type template parameters. They are the tools that allow for parametric polymorphism in C++. Since they guarantee static computation, they've also been used to pre-compute values during compilation so they're readily available at runtime. For this specific last application, another tool has been introduced: the constexpr keyword indicates that an expression can be evaluated at compile time. consteval takes it one step forward and restricts an expression to be computed only at compile-time.

Why were constexpr and consteval introduced? Because templates were never designed to perform computation, but to customize behaviour. The usage of templates for computation was either impossible, or demanded the usage of complex "template meta-programming" techniques that when taken together, become a pseudo-language within the C++ language. constexpr allows compile-time computation with the same run-time syntax, plus some constexpr sprinkled on top. Let's see consteval applied on a typical example, the calculation of a Fibonacci sub-sequence.

Calculating a Fibonacci element

The equations that define the Fibonacci sequence are the following:

F_{n} = F_{n-1} + F_{n-2}

F_{0} = F_{1} = 1

where,

$F_{i}$ : value of the element {i} from the Fibonacci sequence.

Given the sub-problem structure of the problem, the calculation of an element $i$ demands the calculation of the previous elements, from $0$ to $i$ . This makes the calculation of a Fibonacci element an intensive task in its naive form and a starting example when teaching Dynamic Programming.

The simplest most naive implementation in C++ would be:

#include <iostream>

auto fibo(int val) -> int
{
    if (val == 0) return 0;
    if (val == 1) return 1;
    return fibo(val-1) + fibo(val-2);
}

int main() {

    const int val = fibo(38);
    std:: cout << val << std::endl; 
}

When compiled and ran with the UNIX's time program, these code takes this time on my machine:

real    0m0.312s
user    0m0.308s
sys     0m0.004s

Now, with the simple addition of a consteval keyword:

#include <iostream>

consteval auto fibo(int val) -> int
{
    if (val == 0) return 0;
    if (val == 1) return 1;
    return fibo(val-1) + fibo(val-2);
}

int main() {

    const int val = fibo(38);
    std:: cout << val << std::endl; 
}

we can bring this timing down to:

real    0m0.006s
user    0m0.001s
sys     0m0.005s

since the value of the Fibonacci element 38 would be precomputed by the time the program it's run. The same run-time performance can be achieved with templates, though the code is a bit more contrived:

#include <iostream>

template<int I> struct Fib
{
    static int const val = Fib<I-1>::val + Fib<I-2>::val;
};

template<> struct Fib<0>
{
    static int const val = 0;
};

template<> struct Fib<1>
{
    static int const val = 1;
};

int main() {

    Fib<38> fibo;
    std::cout << fibo.val << std::endl; 
}

Using non-type template parameters, the instantiation of the Fib<38> structure launched the instantiation of Fib<37> (which demands Fib<36> and Fib<35>) and Fib<36> (which also demands Fib<35> and Fib<34>), and so forth. This tree of computations can get very wide ( $2^{37}$ ), so one would assume that all these instantiations take a long time, right? Let's time the compilation step now. The run-time and templated versions take about the same:

real    0m0.344s
user    0m0.298s
sys     0m0.045s

while the consteval version:

real    0m8.089s
user    0m7.960s
sys     0m0.122s

What happened? The run-time and consteval versions basically do the same, but in very different environments, which makes consteval take a big hit in performance. It's almost instant at run-time, but it takes a while to compile. But the templated version gives the best of both worlds. It's instant in its running time and it doesn't take long to compile. Why? Well, because as the title says, the computation is memoized. The superposition of subproblems of the Fibonacci calculation allowed the template instantiation machinery to calculate each of the Fib<i> structs (and .vals) only once, by storing the template instantiations in a data structure (probably a hash-map), and then reusing them, a solution we would have to implement ourselves if we wanted to do the same at run-time; though there's probably a library that could help us, if we're feeling lazy.

So, that's it. constexpr and consteval being relatively new facilities, run in a non-optimized environment, while template instantiation is the complete opposite. Does this mean we should go back to template meta-programming to perform our compile-time calculations? No. No chance. The syntax for our template solution was awful.

typeof() for C++

December 8, 2021 · 3 min read

German P. Barletta

anadynamics developer

While C++ programmers are used to its poor ergonomics, getting the type of a variable in C++ is a surprisingly convoluted process.

References​

The standard way​

Boost's way​

Types at compile-time​

Installing Apptainer​

The definition file​

Building the container​

Signing and verifying your container​

Uploading to GitHub packages (ghcr)​

References​

Intro (digression)​

The sylabs way​

Creating a key to sign containers​

Creating a sylabs token to verify containers​

Signing and uploading with sylabs​

References​

Writing the definition file​

Actually running it​

Calculating a Fibonacci element​

References

The standard way

Boost's way

Types at compile-time

Installing Apptainer

The definition file

Building the container

Signing and verifying your container

Uploading to GitHub packages (ghcr)

References

Intro (digression)

The sylabs way

Creating a key to sign containers

Creating a sylabs token to verify containers

Signing and uploading with sylabs

References

Writing the definition file

Actually running it

Calculating a Fibonacci element