Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Russian hackers accused of assault on Poland electrical energy grid

    January 26, 2026

    Palantir Defends Work With ICE to Workers Following Killing of Alex Pretti

    January 26, 2026

    The Workers Who Quietly Maintain Groups Collectively

    January 26, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Writing Your First GPU Kernel in Python with Numba and CUDA
    Machine Learning & Research

    Writing Your First GPU Kernel in Python with Numba and CUDA

    Oliver ChambersBy Oliver ChambersAugust 19, 2025No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Writing Your First GPU Kernel in Python with Numba and CUDA
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Writing Your First GPU Kernel in Python with Numba and CUDA
    Picture by Writer | Ideogram

     

    GPUs are nice for duties the place it’s essential to do the identical operation throughout totally different items of information. This is called the Single Instruction, A number of Information (SIMD) method. In contrast to CPUs, which solely have a couple of highly effective cores, GPUs have hundreds of smaller ones that may run these repetitive operations suddenly. You will note this sample rather a lot in machine studying, for instance when including or multiplying giant vectors, as a result of every calculation is unbiased. That is the perfect state of affairs for utilizing GPUs to hurry up duties with parallelism.

    NVIDIA created CUDA as a means for builders to put in writing applications that run on the GPU as a substitute of the CPU. It’s based mostly on C and allows you to write particular capabilities known as kernels that may run many operations on the identical time. The issue is that writing CUDA in C or C++ isn’t precisely beginner-friendly. You need to cope with issues like handbook reminiscence allocation, thread coordination, and understanding how the GPU works at a low degree. This may be overwhelming particularly should you’re used to writing code in Python.

    That is the place Numba will help you. It permits writing CUDA kernels with Python utilizing the LLVM (Low Degree Digital Machine) compiler infrastructure to straight compile your Python code to CUDA-compatible kernels. With just-in-time (JIT) compilation, you may annotate your capabilities with a decorator, and Numba handles the whole lot else for you.

    On this article, we’ll use a typical instance of vector addition, and convert easy CPU code to a CUDA kernel with Numba. Vector addition is a perfect instance of parallelism, as addition throughout a single index is unbiased of different indices. That is the right SIMD state of affairs so all indices could be added concurrently to finish vector addition in a single operation.

     

    Notice that you’ll require a CUDA GPU to observe this text. You should utilize Colab’s free T4 GPU or a neighborhood GPU with NVIDIA toolkit and NVCC put in.

     

    # Setting Up the Setting and Putting in Numba

     
    Numba is obtainable as a Python package deal, and you may set up it with pip. Furthermore, we’ll use numpy for vector operations. Arrange the Python atmosphere utilizing the next instructions:

    python3 -m venv venv
    supply venv/bin/activate
    pip set up numba-cuda numpy

     

    # Vector Addition on the CPU

     
    Let’s take a easy instance of vector addition. For 2 given vectors, we add the corresponding values from every index to get the ultimate worth. We are going to use numpy to generate random float32 vectors and generate the ultimate output utilizing a for loop.

    import numpy as np 
    
    N = 10_000_000 # 10 million parts 
    a = np.random.rand(N).astype(np.float32) 
    b = np.random.rand(N).astype(np.float32) 
    c = np.zeros_like(a) # Output array 
    
    def vector_add_cpu(a, b, c): 
        """Add two vectors on CPU""" 
        for i in vary(len(a)): 
            c[i] = a[i] + b[i]

     

    Here’s a breakdown of the code:

    • Initialize two vectors every with 10 million random floating-point numbers
    • We additionally create an empty vector c to retailer the outcome
    • The vector_add_cpu perform merely loops via every index and provides the weather from a and b, storing the lead to c

    This can be a serial operation; every addition occurs one after one other. Whereas this works superb, it is not probably the most environment friendly method, particularly for big datasets. Since every addition is unbiased of the others, this can be a good candidate for parallel execution on a GPU.

    Within the subsequent part, you will note methods to convert this identical operation to run on the GPU utilizing Numba. By distributing every element-wise addition throughout hundreds of GPU threads, we are able to full the duty considerably quicker.

     

    # Vector Addition on the GPU with Numba

     
    You’ll now use Numba to outline a Python perform that may run on CUDA, and execute it inside Python. We’re doing the identical vector addition operation however now it will probably run in parallel for every index of the numpy array, resulting in quicker execution.

    Right here is the code for writing the kernel:

    from numba import config
    
    # Required for newer CUDA variations to allow linking instruments. 
    # Prevents CUDA toolkit and NVCC model mismatches.
    config.CUDA_ENABLE_PYNVJITLINK = 1
    
    from numba import cuda, float32
    
    @cuda.jit
    def vector_add_gpu(a, b, c):
    	"""Add two vectors utilizing CUDA kernel"""
    	# Thread ID within the present block
    	tx = cuda.threadIdx.x
    	# Block ID within the grid
    	bx = cuda.blockIdx.x
    	# Block width (variety of threads per block)
    	bw = cuda.blockDim.x
    
    	# Calculate the distinctive thread place
    	place = tx + bx * bw
    
    	# Be certain we do not exit of bounds
    	if place < len(a):
        	    c[position] = a[position] + b[position]
    
    def gpu_add(a, b, c):
    	# Outline the grid and block dimensions
    	threads_per_block = 256
    	blocks_per_grid = (N + threads_per_block - 1) // threads_per_block
    
    	# Copy information to the gadget
    	d_a = cuda.to_device(a)
    	d_b = cuda.to_device(b)
    	d_c = cuda.to_device(c)
    
    	# Launch the kernel
    	vector_add_gpu[blocks_per_grid, threads_per_block](d_a, d_b, d_c)
    
    	# Copy the outcome again to the host
    	d_c.copy_to_host(c)
    
    def time_gpu():
    	c_gpu = np.zeros_like(a)
    	gpu_add(a, b, c_gpu)
    	return c_gpu

     

    Let’s break down what is going on above.

     

    // Understanding the GPU Operate

    The @cuda.jit decorator tells Numba to deal with the next perform as a CUDA kernel; a particular perform that can run in parallel throughout many threads on the GPU. At runtime, Numba will compile this perform to CUDA-compatible code and deal with the C-API transpilation for you.

    @cuda.jit
    def vector_add_gpu(a, b, c):
    	...

     

    This perform will run on hundreds of threads on the identical time. However we want a means to determine which a part of the info every thread ought to work on. That’s what the subsequent few traces do:

    • tx is the thread’s ID inside its block
    • bx is the block’s ID throughout the grid
    • bw is what number of threads there are in a block

    We mix these to calculate a novel place, which tells every thread which factor of the arrays it ought to add. Notice that the threads and blocks may not all the time present a legitimate index, as they function in powers of two. This will likely result in invalid indices when the vector size just isn’t conforming to the underlying structure. Subsequently, we add a guard situation to validate the index, earlier than we carry out the vector addition. This prevents any out-of-bound runtime error when accessing the array.

    As soon as we all know the distinctive place, we are able to now add the values similar to we did for the CPU implementation. The next line will match the CPU implementation:

    c[position] = a[position] + b[position]

     

    // Launching the Kernel

    The gpu_add perform units issues up:

    • It defines what number of threads and blocks to make use of. You possibly can experiment with totally different values of block and thread sizes, and print the corresponding values within the GPU kernel. This will help you perceive how underlying GPU indexing works.
    • It copies the enter arrays (a, b, and c) from the CPU reminiscence to the GPU reminiscence, so the vectors are accessible within the GPU RAM.
    • It runs the GPU kernel with vector_add_gpu[blocks_per_grid, threads_per_block].
    • Lastly, it copies the outcome again from the GPU into the c array, so we are able to entry the values on the CPU.

     

    # Evaluating the Implementations and Potential Speedup

     
    Now that now we have each the CPU and GPU variations of vector addition, it’s time to see how they evaluate. It is very important confirm the outcomes and the execution enhance we are able to get with CUDA parallelism.

    import timeit
    
    c_cpu = time_cpu()
    c_gpu = time_gpu()
    
    print("Outcomes match:", np.allclose(c_cpu, c_gpu))
    
    cpu_time = timeit.timeit("time_cpu()", globals=globals(), quantity=3) / 3
    print(f"CPU implementation: {cpu_time:.6f} seconds")
    
    gpu_time = timeit.timeit("time_gpu()", globals=globals(), quantity=3) / 3
    print(f"GPU implementation: {gpu_time:.6f} seconds")
    
    speedup = cpu_time / gpu_time
    print(f"GPU speedup: {speedup:.2f}x")

     

    First, we run each implementations and examine if their outcomes match. That is necessary to verify our GPU code is working accurately and the output ought to be the identical because the CPU’s.

    Subsequent, we use Python’s built-in timeit module to measure how lengthy every model takes. We run every perform a couple of occasions and take the common to get a dependable timing. Lastly, we calculate what number of occasions quicker the GPU model is in comparison with the CPU. You must see a giant distinction as a result of the GPU can do many operations directly, whereas the CPU handles them one after the other in a loop.

    Right here is the anticipated output on NVIDIA’s T4 GPU on Colab. Notice that the precise speedup can differ based mostly on CUDA variations and the underlying {hardware}.

    Outcomes match: True
    CPU implementation: 4.033822 seconds
    GPU implementation: 0.047736 seconds
    GPU speedup: 84.50x

     

    This easy check helps reveal the ability of GPU acceleration and why it’s so helpful for duties involving giant quantities of information and parallel work.

     

    # Wrapping Up

     
    And that’s it. You’ve got now written your first CUDA kernel with Numba, with out truly writing any C or CUDA code. Numba permits a easy interface for utilizing the GPU via Python, and it makes it a lot less complicated for Python engineers to get began with CUDA programming.

    Now you can use the identical template to put in writing superior CUDA algorithms, that are prevalent in machine studying and deep studying. When you discover an issue following the SIMD paradigm, it’s all the time a good suggestion to make use of GPU to enhance execution.

    The entire code is obtainable on Colab pocket book you can entry right here. Be at liberty to check it out and make easy adjustments to get a greater understanding of how CUDA indexing and execution works internally.
     
     

    Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with drugs. She co-authored the e-book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions range and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    How CLICKFORCE accelerates data-driven promoting with Amazon Bedrock Brokers

    January 26, 2026

    5 Breakthroughs in Graph Neural Networks to Watch in 2026

    January 26, 2026

    AI within the Workplace – O’Reilly

    January 26, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Russian hackers accused of assault on Poland electrical energy grid

    By Declan MurphyJanuary 26, 2026

    On Dec. 29 and 30, the Polish electrical energy grid was subjected to a cyberattack…

    Palantir Defends Work With ICE to Workers Following Killing of Alex Pretti

    January 26, 2026

    The Workers Who Quietly Maintain Groups Collectively

    January 26, 2026

    Nike Knowledge Breach Claims Floor as WorldLeaks Leaks 1.4TB of Recordsdata On-line – Hackread – Cybersecurity Information, Knowledge Breaches, AI, and Extra

    January 26, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.