I was starting the Hugging Face course on LLMs, which can be found here https://huggingface.co/learn/llm-course/chapter0/1?fw=pt, and I ran into the classical problem of having to set up a proper environment. I’ll spare you the details of updating my outdated Anaconda distribution but let’s zoom in on one step. I had to install CUDA to make the Hugging Face models run at a decent speed at my computer. Now, why was this necessary?

Nivdea’s CUDA allows machine-learning libraries to use graphics processing units (GPUs) for parallel execution of computational workload. Originally, GPUs, as the name suggests, were primarily used for graphics processing. Graphics processing is quite different from the sequential branching processes central processing units (CPUs) are meant to handle. A simple example of a sequential process is calculating all Fibonacci numbers with the formula F(n) = F(n-1) + F(n-2), with F(0) = F(1) =1. It turns out there are many operations occurring in the computer that need to occur sequentially. However, graphics processing is not one of them.

Graphics computing consists of repeating many similar matrix calculations that are quite independent of each other. To accommodate this, GPUs are optimized for running many similar calculations in parallel. Now, it turns out, training machine-learning models also requires doing many matrix calculations in parallel. Hence, using a GPU provides a massive boost to machine-learning libraries and CUDA enables this. For an intro to the subject, I’d recommend 3blue1brown’s playlist: https://www.3blue1brown.com/topics/neural-networks.

I used the script below to get a feeling of how much faster calculations with CUDA are than regular matrix operations. It uses the CuPy package with CUDA 12.8. This package works similarly to NumPy but uses CUDA to speed up the calculation.

import numpy as np
import cupy as cp
import time

# Create a large array
size = 100_000_000
data = np.random.randn(size).astype(np.float32)
gpu_data = cp.array(data) # Store in cuda format.

# NumPy calculation (CPU)
start = time.time()
cpu_result = np.sin(data) * np.cos(data) + np.exp(np.clip(data, -5, 5))
cpu_time = time.time() - start

# CuPy warm-up calculation (GPU)
start = time.time()
gpu_result = cp.sin(gpu_data) * cp.cos(gpu_data) + cp.exp(cp.clip(gpu_data, -5, 5))
cp.cuda.Stream.null.synchronize()
# Wait until cuda is done calculating.
gpu_warm_up_time = time.time() - start

# CuPy calculation (GPU)
start = time.time()
gpu_result = cp.sin(gpu_data) * cp.cos(gpu_data) + cp.exp(cp.clip(gpu_data, -5, 5))
cp.cuda.Stream.null.synchronize()
gpu_time = time.time() - start

print(f"Array size: {size} elements")
print(f"CPU (NumPy): {cpu_time:.3f} seconds")
print(f"GPU warm-up (CuPy): {gpu_time:.3f} seconds")
print(f"GPU (CuPy): {gpu_time:.3f} seconds")

print(f"Speedup: {cpu_time/gpu_time:.1f}x faster")

Now, for smallish calculation there was not much of an improvement but for large array it improves the calculation a lot. So, for a 100 million elements, we get:

Array size: 100000000 elements
CPU (NumPy): 1.543 seconds
GPU warm-up (CuPy): 0.024 seconds
GPU (CuPy): 0.024 seconds
Speedup: 65.5x faster

The results differed a lot for me on each run. So, play around a bit to see CUDA's effect on the calculation.