Behind the scenes of whatever Deep Learning application there's nothing else than billions over billions of matrix multiplications and operations requiring an high degree of parallelization.
The piece of hardware which enabled the AI revolution is the GPU, making feasible training models in a decent ammount of time.
NVIDIA is the most popular provider of GPUs worldwide, and CUDA is the programming model to work with their devces.
Parallel computation
Matrix transposition
Benchmarking
C
Makefile
Cuda
Explore the possibilities provided by the Cuda programming model analyzing a case study : matrix transposition.
The built application will be testing implementations built for CPU only and compare them with the GPU counterpart, analyzing the system behaviour on both instances.
After this preliminary task, we'll switch to another problem : Convolution on images, and again parallelize computation with GPU.
This project was part of the GPU computing course at University of Trento a.y. 2023/2024, on my first year of Master's degree in AI systems.
This represented my first times working on a parallel computing context, other than some simple multi-processor / multi-thread applications i've done in the past.
I also had the pleasure to be taught by an ex NVIDIA employee professor Flavio Vella.
Computing the transpose of a given matrix \(X^{n,m}\) having \(n\) rows and \(m\) columns means nothing more than computing \(t(X^{n,m}) = Y^{m,n} \) such that the first row of
\(X\) becomes the first column of \(Y\), same thing for the second row and second column, etc. etc...
This might look like a silly operation, but it actually has more uses than it might look...
For this case study we'll be dealing with sqaured matrix for simplicity, but the overall thinking also works for the rectangular case.
A naive implementation to solve this problem might be the following :
\BEGIN {algorithm} \CAPTION {- Naive matrix transposition} \BEGIN {algorithmic} \STATE \TEXTBF{Input:} X \STATE \TEXTBF{parameters:} matrix-size \STATE --- \FOR {i = 0 to matrix-size} \FOR {j = i+1 to matrix-size} \STATE swap X(i,j) with X(j,i) \ENDFOR \ENDFOR \STATE return X \END {algorithmic} \END {algorithm}
This is more or less a 1 to 1 translation of the in-place CPU block-based matrix transposition described above, but implemented with CUDA.
The most important aspect of this solution is the usage of shared memory which is \(\times100\) faster than the global memory (on no-cached accesses) and enables
coalesced writes.
Each thread block is in charge of copying 2 symmetrical blocks from \(X\) (one per side w.r.t the main diagonal) inside the shared memory, referred as \(A\) (superior block) and \(B\) (inferior block).
Once the copy is completed for both blocks each thread block proceeds copying \(A\) and \(B\) inside \(X\) following column major ordering on
reads and row major on writes. Notice also that write addresses on \(X\) of \(A\) and \(B\) are swapped.
For more info about our results an code, chechout the project's Report and GitHub page.
When applying a filter to an \(n\)-dimensional signal (\(n\in \mathbb{N}^+\)) one of the most common and effective operation is the convolution. In a 2-dimensional scenario, which is the case for images, operations such as like blurring, sharpening and edge extraction can be performed using convolutions. Some deep learning architectures also massively take advantage of convolution to automatically extract significant features from images. The wide-ranging applications of image convolution underscore the need for an efficient implementation of the convolution operation. For a filter \(w\) of size \(n \times m\) and an image \(f\) the convolution operation is defined as : \[ w(x,y) * f(x,y) = \sum_{s = 0}^{m-1} \sum_{t = 0}^{n-1} w(s,t) \cdot f(x-s,y-t) \]
Our design process went through several incremental optimization of the convolution operation based on a Divide-and-conquer type of paradigm
The first part on matrix transposition has been done entirely by myself. The image convolution part was done in collaboration with a colleague of mine : I've focussed mostly on the benchmarking and code base mantenance, but also helped in finding solutions to our problem.