







# Formal Methods for GPU Software Development Using Ada SPARK

<u>Leonidas Kosmidis</u>, Dimitris Aspetakis, Matina Maria Trompouki

ESA STAR AO 2-1856/22/NL/GLC/ov

**ESA Technical Officer: Gabor Marosy** 

**Software Product Assurance Workshop 2023** 

### **Outline**

- Introduction and motivation
- Background
  - The GPGPU programming architecture
- Project Contributions
  - Device code verification
  - Buffer overflow detection
- Conclusions



# **Project Objectives**



#### **Objectives**

Develop an open source infrastructure in which GPU code (device code and code communicating with the host CPU interface) annotated with a specification language (Ada SPARK) for properties like pre-conditions, post-conditions, loop-invariants etc. can be used with an automatic proof system to prove the correctness of the code and the absence of runtime errors



# **Project Objectives**



Identify limitations of formal methods for GPU code, so that can be addressed by tool vendors or/and the potential GPU users understand what these tools can and cannot do



# Safety Critical Systems







- Used in avionics, automotive and aerospace industries
- Correct execution is of paramount importance
  - Any malfunction may be dangerous
  - Designed to comply with functional safety standards:
    - Automotive: ISO 26262, Avionics: DO-178C, Aerospace ECSS
- Traditionally rely on very old and simple single core processors
  - Cannot provide the performance required for new advanced functionalities

# Need for Higher Performance in Aerospace Systems

- Airbus: Automatic Taxi, Take-Off and Landing (ATTOL)
- ESA: Φ-Sat-1, OPSAT Al and automatic cloud screening









# Need for Higher Performance in Safety Critical Systems

- Legacy hardware used for safety critical systems cannot provide the required performance
- Embedded Graphics Processing Units (GPUs) are:
  - Designed to comply with safety critical functional safety standards e.g. ISO 26262
  - Very attractive candidate platforms for safety critical systems
  - GPU4S (GPU for Space) project funded by the European Space Agency at BSC shown very promising performance results on space relevant processing, especially on NVIDIA GPUs like NVIDIA Xavier and TX2





# Need for Safe Programming Models

- ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming models, e.g. cubA [1]
  ...are (ASIL-D) needs to programming mod



#### Ada SPARK

- High level programming language
- Appropriate for low-level programming like C
  - Similar performance but safer
  - Strong typing, bound checks in arrays
  - Even programming in Ada protects against common C programming mistakes
- Widely used in safety critical systems, especially in the aerospace domain as well as for security
- SPARK is a safe subset of Ada
- Can be used with Formal methods Tools
  - Prove the absence of runtime errors
  - It can formally verify program specifications



### Ada SPARK — Adoption Levels











Bronze

- Data and flow analysis
- Prevents null dereferences, ensures proper data flow



Silver

 Guarantees absence of runtime errors (including buffer and numerical overflow, division-by-zero)



- Gold
  - Proves key integrity properties (e.g. pre/post-conditions)



- Platinum
  - Full functional proofs of the requirements





#### Ada SPARK — CUDA backend

SPARK

Expanding the boundaries of safe and secure programming.

- On-going collaboration between NVIDIA and AdaCore
  - NVIDIA has adopted SPARK for the development of the secure hypervisor (CPU) for their Embedded GPU platforms
- On-going development of a CUDA Backend for Ada
  - Allows to use Ada instead of CUDA C for programming both the host and the GPU code
  - Currently under closed beta
  - AdaCore donated a license and support for any issues we discovered
- Current version of the tools do not support all language features yet (e.g. shared memory, thread synchronisation)
- The AdaCore CUDA backend is not yet integrated with SPARK tools
  - Figuring out how to use it was part of the project contributions







# The GPGPU Programming Architecture



## **GPUs: Programmer's View**

GPU / Device code
GPU Programming Language



- Single instruction multiple threads
- Threads organised in up-to 3D groups
  - Unique thread identifiers
- Several memory types
- Memory Transfers to/from host CPU
  - DMA accesses over PCIe
  - Explicit memory allocation and transfers
  - Raw pointers

# Example: Vector Addition Kernel in CUDA

```
Device Code
_host__
void vecAddKernel(int* A, int* B, int* C, int n) {
   int i = threadIdx.x + blockDim.x * blockIdx.x

if (i < n) {
        C[i] = A[i] + B[i];
        };
}</pre>
```



# Example: Vector Addition Kernel Launch in CUDA (Host Code)

#### **Host Code**

The ceiling function makes sure that there are enough threads to cover all elements.

```
#define BLOCK_SIZE 16.0
__host__ void vecAdd(int *A, int* B, int* C, int n) {
    int *d_A, *d_B, *d_C;
    cudaMalloc((void **)&d_A, n, * sizeof(int));
   cudaMalloc((void **)&d_B, n, * sizeof(int));
    cudaMalloc((void **)&d_C, n, * sizeof(int));
    dim3 DimBlock(BLOCK_SIZE, 1, 1);
   dim3 DimGrid(ceil(n/BLOCK_SIZE), 1, 1);
    cudaMemcpy(d_A, A, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, n * sizeof(int), cudaMemcpyHostToDevice);
   vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
    cudaMemcpy(C, d_C, n * sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
```



#### **CUDA Kernel Execution in a Nutshell**

```
#define BLOCK_SIZE 16.0
                                                              __global__ void vecAddKernel(int *A, int *B, int *C, int n) {
__host__ void vecAdd(int *A, int* B, int* C, int n) {
                                                                  int i = blockIdx.x * blockDim.x + threadIdx.x;
    (\ldots)
                                                                  if(i < n)
    dim3 DimBlock(BLOCK_SIZE, 1, 1);
                                                                      C[i] = A[i] + B[i];
    dim3 DimGrid(ceil(n/BLOCK_SIZE), 1, 1);
    (\ldots)
    vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
    (\ldots)
                                        Blk 0
                                                                                 Blk N-1
                                                             Mk
                                                             RAM
```



# Project Contributions on GPU Software Verification



# Methodology

- Created small examples containing specific classes of GPU programming mistakes
- Examined whether the SPARK formal tools are able to identify them
- Identified how the programmer can guide the tools to detect them
- Developed programming methodologies for Ada SPARK GPU programming
- Developed case studies applying our methodology:
  - Max
  - Histogram
  - Ported GPU4S Bench benchmarks to Ada SPARK GPU, achieving at minimum stone level of SPARK adoption and in several cases bronze level



# Device code verification



## Guarantees from the Typing System

```
type Int_10 is new Integer range -10 .. 10;
type Int_20 is new Integer range -20 .. 20;

type Vector is array (Natural range $\infty$) of Int_10;
type Fat_Vector is array (Natural range $\infty$) of Int_20;
```

#### Using only the default Integer type:



#### **Arithmetic Overflows and Underflows**

# SPARK verification over the silver level guarantees us <u>freedom of runtime errors</u>:

```
if X ≤ A'Last then
    C (X) := Int_20 (A (X) + B (X) + 1);
    C (X) := Int_20 (A (X) + B (X) - 1);
end if;
```

#### Prover output:



# Arithmetic case study from GPU4S Bench

- GR740's FPU does not support denormal numbers as input
- Underflow exception raised under RTEMS

```
for (int i=0; i < size_k; ++i)
{
    #ifdef INT
    kernel[i] = rand() % (NUMBER_BASE * 100);
    #else
    kernel[i] = (bench_t)rand()/(bench_t)(RAND_MAX/NUMBER_BASE);
    #endif
}</pre>
```

Currently working on statically detecting of this issue in Ada SPARK



### Division-by-Zero

Divisions are common in kernel computations.

The possibility of dividing with zero is usually left unchecked.

We might get runtime errors if coverage is not sufficient.

```
if X ≤ A'Last then
    C (X) := A (X) / B (X);
end if;
```

SPARK's control flow analysis is able to guarantee absence of the error:

```
if X ≤ A'Last and then B (X) /= 0 then
    C (X) := A (X) / B (X);
end if;
```



#### **Functional Correctness Guarantees**

```
procedure VectorMax
  (A : Vector_Device_Access; B : Vector_Device_Access;
   C : Vector Device Access)
is
  X : Natural := Cuda_Index (Block_Dim.X, Block_Idx.X, Thread_Idx.X);
   (\dots)
   function Max (A, B : Integer) return Integer with
     Post ⇒ (Max'Result ≥ A and Max'Result ≥ B)
   is
   begin
      if A > B then
         return A;
      else
         return B:
      end if:
   end Max:
begin
  if X ≤ A'Last then
      C(X) := Max(A(X), B(X));
      pragma Assert (C (X) \geqslant A (X) and C (X) \geqslant B (X));
   end if;
end VectorMax;
```

- Preconditions / Postconditions
- Loop Invariants
- Assertions

Assertions can be evaluated at runtime, but in SPARK code they are also used as static proof targets.

A typo in the Max function, like A<B instead of A>B, results in this:



#### Fixed-Point Arithmetic

- Floating point numbers are a source of inaccuracies.
- Accumulated errors result in silent bugs.
- Fixed point types are predictable.
- The prover treats them as scaled integers.

No warnings are reported from the prover on this example:

```
type Grade is delta 0.1 digits 3 range 0.0 .. 10.0;
type Grade_Vector is array (Natural range ♦) of Grade;
procedure AvgGrades
  (A : Grade Vector Device Access; B : Grade Vector Device Access;
  C : Grade_Vector_Device_Access)
is
  X : Integer := Cuda Index (Block Dim.X, Block Idx.X, Thread Idx.X);
  (\dots)
  type Grade_20 is delta 0.1 digits 3 range 0.0 .. 20.0;
  tmp : Grade 20;
begin
  if X ≤ A'Last then
           := Grade 20 (A (X) + B (X));
           := tmp / 2;
     C(X) := Grade(tmp);
  end if:
end AvgGrades;
```

# **Buffer Overflow Detection**



# Detecting buffer overflow errors

- Buffer overflow errors are very common in GPU kernels.
- They can result in (possibly silent) bugs.
- There may be a problem inside the kernel with indexing.
- There might also be a problem with the dimensions we give at the kernel invocation.
- We need a way to detect such errors.



#### **GPU4S** Bench Identified Issues

Incomplete initialisation due to wrong size

```
unsigned int size A = arguments_parameters->size;
38
           unsigned int mem size A = sizeof(bench t) * size A;
39
                bench t* A = (bench t*) malloc(mem size A);
40
               // B output <u>matrix</u>
41
                unsigned int size B = arguments parameters->size + arguments parameters->kernel size - 1;
42
           unsigned int mem size B = sizeof(bench t) * size B;
43
                bench t* h B = (bench t*) malloc(mem size B);
44
                bench t* d B = (bench t*) malloc(mem size B);
45
                        for (int i=0; i<arguments parameters->size; i++){
60
                                #ifdef INT
61
                                A[i] = rand() \% (NUMBER BASE * 100);
62
                        for (int i=0; i<arguments_parameters->size; i++){
69
                                h B[i] = 0;
70
                                d B[i] = 0;
```

#### **GPU4S** Bench Identified Issues

- Out of bounds accesses in kernel execution due to incorrectly specified sizes
- Code runs without a crash or output verification in other platforms under RTEMS and Linux!

```
void execute kernel(GraficObject *device object, unsigned int n, unsigned int m, unsigned int w, unsigned int kernel_size)
        // Start compute timer
    struct timespec start, end;
    clock gettime(CLOCK MONOTONIC RAW, &start);
        const unsigned int kernel_rad = kernel_size / 2;
        const unsigned int output size = n + kernel size - 1;
        for(unsigned int i = 0; i k output size; ++i)
                for (unsigned int j = 0; j < kernel size; ++j)</pre>
                        if (i + (j - kernel size + 1) >= 0 && i + (j - kernel size + 1) < n)
                        device object->d B[i] += device object->kernel[kernel size - j - 1]
                                                                                                device object->d A[i +(j - kernel size + 1) ];
```

## A Programming Pattern for Buffer Overflow Detection

#### Step 01

Construct a wrapper for the CUDA kernel invocation and the data transfers before and after it.

#### Step 02

Add preconditions in the wrapper's specification that dictate invariants among the vectors' ranges and the given block and grid dimensions.



## A Programming Pattern for Buffer Overflow Detection

#### Step 03

Reflect the wrapper's preconditions with Ada-SPARK assumptions in the declaration part of the kernel's body.

```
function Cuda_Index
  (Block Dim, Block Idx, Thread Idx: unsigned) return Natural with
  SPARK Mode ⇒ Off
is
begin
   return Natural (Block_Dim * Block_Idx + Thread_Idx);
end Cuda Index;
procedure VectorAdd
 (A, B : not null Vector_Device_Constant_Access;
     : not null Fat_Vector_Device_Access)
is
   ----- Mirror wrapper's precondition semantics with assumptions -----
  X : Natural := Cuda_Index (Block_Dim.X, Block_Idx.X, Thread_Idx.X);
   pragma Assume (A'First = 0 and B'First = 0 and C'First = 0);
   pragma Assume (A'Last = B'Last and then B'Last = C'Last);
   pragma Assume (A'Last ≤ Integer'Last - 31);
  Max_X: Integer := ((A'Last + 31) / 32) * 32;
   pragma Assume (X in 0 .. Max_X);
begin
  if X ≤ A'Last then
      C(X) := Int_{20}(A(X) + B(X));
  end if;
end VectorAdd;
```



## A Programming Pattern for Buffer Overflow Detection

Should we forget to check for the upper boundary of our vectors, we'll get an error like this:

```
function Cuda_Index
  (Block Dim, Block Idx, Thread Idx: unsigned) return Natural with
  SPARK Mode ⇒ Off
is
begin
   return Natural (Block_Dim * Block_Idx + Thread_Idx);
end Cuda Index;
procedure VectorAdd
 (A, B : not null Vector Device Constant Access;
  C : not null Fat_Vector_Device_Access)
is
   ----- Mirror wrapper's precondition semantics with assumptions -----
  X : Natural := Cuda_Index (Block_Dim.X, Block_Idx.X, Thread_Idx.X);
  pragma Assume (A'First = 0 and B'First = 0 and C'First = 0);
   pragma Assume (A'Last = B'Last and then B'Last = C'Last);
  pragma Assume (A'Last ≤ Integer'Last - 31);
  Max X : Integer := ((A'Last + 31) / 32) * 32;
   pragma Assume (X in 0 .. Max_X);
begin
  if X ≤ A'Last then
     C(X) := Int 20 (A(X) + B(X));
  end if;
end VectorAdd;
```



## **GPU4s Benchmark Suite Port (Use cases)**

Seven benchmarks have been ported first to Ada for CPU and then GPU:

```
matrix_multiplication_bench
convolution_2D_bench
fir_filtering
max_pooling_bench
relu_bench
softmax_bench
correlation_2D
fast_fourier_transform_bench
(int + float version)
```

Even without any specific SPARK verification attempts, the benchmarks reach stone level verification.

For several of them, we also reached bronze adoption level. Gradually we are going to increase the adoption level.



# **Conclusions and Contributions**

- We found a way to run the Ada-SPARK prover on Ada code targeting CUDA
- We developed examples showcasing SPARK's viability on kernel code
- We constructed a pattern for buffer overflow detection across host and device code
- We ported GPU4s benchmark suite to AdaCore's CUDA backend
  - applying our developed methodologies
  - achieving at minimum stone level SPARK adoption level
  - Latent errors found in C/CUDA version do not exist in the Ada SPARK version
  - All our developments are released as open source [1][2]



# **Backup Slides**



# Extending Semantics to Better Reflect the CUDA Model



# **Emulating Kernel Invocation Semantics**

We lack information that could be useful in specification verification.

Our kernel code is going to be run multiple times according to the grid and block dimensions.

The use of Ada-SPARK's *ghost* constructs helps us emulate the missing semantics.

```
Barcelona
Supercomputing
Center
Centro Nacional de Supercomputación
```

```
procedure VectorMax
 (A, B, C : Vector Device Access)
   -- Mirror wrapper's precondition semantics with assumptions
   procedure Max Apply All (A, B : Vector; C : in out Vector) with Ghost,
      (A'First = 0 and B'First = 0 and C'First = 0)
      and then (A'Last = B'Last and A'Last = C'Last),
     Post \Rightarrow (for all I in C'Range \Rightarrow C (I) \geqslant A (I) and C (I) \geqslant B (I))
   begin
      for I in C'Range loop
         C(I) := Integer'Max(A(I), B(I));
         pragma Loop_Invariant (for all Idx in C'First .. I ⇒
                                  C (Idx) = Integer'Max (A (Idx), B (Idx)));
      end loop;
   end Max Apply All;
   C Ghost : Vector := C.all with Ghost;
begin
  if X ≤ A'Last then
      C(X) := Integer'Max(A(X), B(X));
      pragma Assert (C (X) \geqslant A (X) and C (X) \geqslant B (X));
   end if:
   Max Apply All (A.all, B.all, C_Ghost);
   pragma Assert
     (for all X in C_Ghost'Range \Rightarrow C_Ghost (X) \geqslant A (X) and C_Ghost (X) \geqslant B (X))
end VectorMax:
```

# Kernel Patterns Enjoying Proper Verification



#### Safe Kernel Patterns

The last missing information the prover lacks for proper verification are the asynchronous thread execution model semantics.

We identify two common kernel patterns that are inherently safe from data races and synchronization problems:

- A kernel that takes arbitrary read-only inputs, one write-only output vector, and within each thread writes a single and unique output cell.
- A kernel that takes arbitrary read-only inputs, and has either one write-only scalar, or one write-only non-scalar output that gets updated exclusively through atomic operations.

In both cases, the kernel must not make use of the GPU's shared memory.

