RunMat
GitHub
Back to Blog

How to Use GPUs in MATLAB: Complete NVIDIA Setup & Optimization Guide

Published01/28/2026
Updated 02/20/2026
14 min read

MATLAB GPU acceleration is powerful but brittle: one misplaced gather or an array that's too small and the speedup disappears. For the right workloads — large, vectorized array math — you can see 10x to 100x gains. This guide walks through setup, the core patterns, and the traps that erase those gains, so you can get GPU working in MATLAB and know when it will (and won't) help.

If you want GPU acceleration on any hardware without manual device management, skip ahead to the RunMat section.

TL;DR

  • MATLAB GPU acceleration requires the Parallel Computing Toolbox and a CUDA-capable NVIDIA GPU.
  • The core performance pattern is: gpuArray once, run vectorized GPU-enabled operations, then gather once.
  • Most slowdowns come from small arrays, too many tiny kernels, and frequent CPU-GPU transfers.
  • Start with single precision unless your numerics require double.
  • If you want automatic CPU/GPU routing without manual device management, RunMat is a cross-platform alternative.

Prerequisites: what you need for MATLAB GPU acceleration

Before writing any GPU code, you need three things in place: compatible hardware, the right MATLAB toolbox, and a working CUDA driver.

Hardware requirements

MATLAB GPU acceleration requires an NVIDIA GPU with CUDA compute capability 3.5 or higher. This includes most NVIDIA GPUs from the Kepler architecture (2012) onward: GeForce GTX 780+, Tesla K40+, Quadro K5000+, and all RTX-series cards. Check your GPU's compute capability on NVIDIA's CUDA GPUs page.

MATLAB's GPU support is built entirely on CUDA, which rules out AMD, Intel, and Apple Silicon GPUs (including M1/M2/M3/M4 Macs).

Software requirements

  • MATLAB R2023a or newer (recommended). Older versions work but have fewer GPU-enabled functions.
  • Parallel Computing Toolbox — this is a paid add-on (separate from the base MATLAB license). Without it, gpuArray and GPU-enabled functions are not available.
  • NVIDIA CUDA driver — MATLAB bundles a CUDA toolkit, but your system needs a compatible NVIDIA driver. See MathWorks' GPU support requirements for the version matrix.

Verifying GPU detection

Once everything is installed, check that MATLAB can see your GPU:

% How many GPUs does MATLAB see?
gpuDeviceCount
% ans = 1

% Details about the active GPU
gpuDevice
% CUDADevice with properties:
%   Name: 'NVIDIA GeForce RTX 4090'
%   ComputeCapability: '8.9'
%   ...

If gpuDeviceCount returns 0, check these common causes:

  • Driver mismatch. Update your NVIDIA driver to the version required by your MATLAB release.
  • Toolbox not installed. Run ver and confirm "Parallel Computing Toolbox" appears in the list.
  • Wrong GPU. Integrated Intel/AMD graphics won't appear — only discrete NVIDIA GPUs with CUDA support.
  • Multi-GPU systems. Use gpuDevice(n) to select a specific GPU by index.

Your first GPU computation

The core pattern in MATLAB is straightforward: upload data once, compute many steps on GPU, then gather the result once.

Here's the step-by-step:

  1. Put data on the GPU. Wrap your input with gpuArray() — e.g. x = gpuArray.rand(N, 1, 'single'); — or convert an existing array with gpuArray(x).
  2. Run vectorized operations. Use GPU-enabled functions (sin, .*, mean, etc.); they dispatch to the GPU automatically while x is a gpuArray.
  3. Gather only at the end. Call gather() once when you need a result on the CPU (e.g. for fprintf or further CPU work). Avoid calling gather inside loops.
% MATLAB gpuArray pattern: upload once, compute, gather once
rng(0);
x = gpuArray.rand(10_000_000, 1, 'single');
y = sin(x) .* x + 0.5;
m = mean(y, 'all');
fprintf("m = %.6f\n", gather(m));

Every operation on y and m runs on the GPU because x is a gpuArray. The only CPU-GPU transfer is the final gather(m) — a single scalar.

The most common mistake

The most common performance killer is accidentally forcing a sync and download inside a loop:

% Anti-pattern: sync + download every iteration
x = gpuArray.rand(10_000_000, 1, 'single');
y = x;
for k = 1:20
    y = sin(y) .* y + 0.5;
    fprintf("step %d: %.6f\n", k, gather(mean(y, 'all')));
end

This isn't "wrong," but it changes the performance profile: you're measuring device synchronization and transfers as much as compute. Move the gather outside the loop to get a fair picture of GPU speed.


What NVIDIA GPUs accelerate in MATLAB (and what they don't)

NVIDIA GPUs are throughput machines. They're great at applying the same operations across huge arrays: elementwise transforms (sin, exp, .*, ./), reductions (sum, mean, std), and big matrix operations. That's why GPU acceleration tends to shine in image pipelines, Monte Carlo simulation, signal processing, and dense linear algebra. Those workloads naturally operate over millions of values.

Where GPUs lose is when the workload is fragmented: lots of tiny arrays, lots of small kernels, heavy scalar control flow, or frequent CPU-GPU transfers. In those cases, the GPU spends more time being managed than computing.

A quick gut check:

% GPU-shaped: large arrays + vectorized ops
x = rand(5_000_000, 1, 'single');
y = (x - mean(x)) ./ (std(x) + single(1e-6));
z = sum(sqrt(abs(y)), 'all');
% Often not GPU-shaped: many tiny problems, lots of overhead
acc = single(0);
for i = 1:10000
    a = rand(128, 1, 'single');
    acc = acc + sum(a .* a, 'all');
end

Here's a quick decision tree:

Loading diagram...

Use the flowchart above to see where your code sits; if you're in "Refactor first," the fix is usually to batch the work so the GPU sees fewer, larger operations instead of many small ones.


GPU-enabled functions in MATLAB

Not every MATLAB function supports gpuArray inputs. Here are the most commonly used GPU-enabled functions by category:

CategoryFunctionsNotes
Elementwise mathsin, cos, exp, log, sqrt, abs, pow2, signJust wrap input in gpuArray; these work transparently
Arithmetic+, -, .*, ./, .^, * (matrix multiply)Standard operators dispatch to GPU when operands are gpuArray
Reductionssum, mean, std, var, min, max, prod, normUse 'all' dimension flag for full-array reductions
Linear algebramtimes, mldivide (\), eig, svd, lu, qr, cholLarge matrices benefit most; small matrices may be faster on CPU
FFTfft, ifft, fft2, ifft2, fftnStrong GPU speedups for large transforms
Random generationgpuArray.rand, gpuArray.randn, gpuArray.randiGenerate directly on GPU to avoid an upload
Array creationgpuArray.zeros, gpuArray.ones, gpuArray.eye, gpuArray.linspaceSame — allocate on device directly
Logical / indexingfind, sort, logical, any, all, comparison operatorsMost work on GPU; find returns a gpuArray of indices

For the full list, see MathWorks' GPU-enabled functions reference.

The pattern is dense, array-level work: elementwise ops, reductions, linear algebra, FFTs. What's missing is anything that leans on per-element branching or sparse indexing — those paths either aren't implemented for GPU or force a gather. If a function doesn't support gpuArray, MATLAB will either error or silently gather the data to CPU, which can introduce a hidden transfer penalty. When chaining operations, check that every function in the pipeline is GPU-enabled to keep data on the device.


Performance traps that erase GPU speedups

Most disappointing GPU results come from a small set of patterns. You don't have to become a GPU expert to fix them; you just need to recognize a few shapes.

1) Too many CPU-GPU transfers

Transfers are expensive and they often force synchronization. In MATLAB, that's usually an accidental gather (or a CPU-only function that forces one). Touching intermediate results can pull you back to the host.

Keeping data resident on the GPU — rather than bouncing it back to the CPU — avoids transfer overhead.

% MATLAB anti-pattern: forcing a download mid-pipeline
x = gpuArray.rand(10_000_000, 1, 'single');
y = sin(x) .* x + 0.5;
y_host = gather(y);          % boundary: download early
m = mean(y_host, 'all');     % now you're on CPU

That early gather(y) forces a sync and copy; everything after it runs on CPU. Defer any gather until you actually need the result on the host.

2) Lots of tiny kernels instead of one big block

Every GPU kernel launch has some fixed overhead; thousands of tiny launches can be slower than one big fused launch. If your program looks like "do a tiny thing 10,000 times," you're often paying overhead more than compute. The fix is usually batching.

% Overhead-heavy shape: many small problems
acc = single(0);
for i = 1:2000
    x = rand(4096, 1, 'single');
    acc = acc + sum(sin(x) .* x + 0.5, 'all');
end

The fix: batch the work into one large array and do an elementwise chain plus a reduction:

% Better GPU shape: batch the work
X = rand(4096, 2000, 'single');
acc2 = sum(sin(X) .* X + 0.5, 'all');

3) Precision choices (single vs double)

Precision should be driven by two things: what your calculation actually needs, and what your GPU path supports.

What the backends support. Not every GPU abstraction offers full double-precision (FP64). For example, Metal (Apple's GPU API) provides strong FP32 support but no FP64. In MATLAB with an NVIDIA GPU, the Parallel Computing Toolbox supports both single and double; performance on double then depends on the GPU's FP64 capability and memory bandwidth. Consumer GPUs (GeForce) typically have much weaker FP64 throughput than data-center GPUs (Tesla, A100).

What you need. For many workloads, single precision (FP32) is enough — you get plenty of significant digits for a wide range of scientific and numerical tasks. When your problem genuinely requires double (e.g. certain accumulations or legacy requirements), you accept the performance cost.

Practical takeaway: If you're testing whether GPU acceleration is working, start with single unless you have a clear reason to need double.

4) Hidden sync points (printing, plotting, inspection)

Many workflows accidentally benchmark synchronization. Printing intermediate values, plotting inside loops, or repeatedly checking partial results can turn a smooth GPU pipeline into "compute a little, synchronize, download, repeat." For example, calling fprintf or disp on a gpuArray inside a loop forces a gather each time, so you're measuring transfer and sync cost rather than GPU compute. Move any inspection or logging outside the timed region, or gather once after the loop and then print.

Benchmarking: how to measure GPU speed without fooling yourself

Good benchmarks do a few boring things consistently: warm up once (first-run overhead can be large), run multiple iterations and take a median or mean, fix dtype and shape so single vs double doesn't skew the comparison, keep I/O out of the timed region (plotting and printing can dominate), and be explicit about whether you're timing uploads/downloads or just compute.

A clean benchmark shape: allocate big single inputs, run a contiguous chain of elementwise math, reduce at the end, materialize a scalar once.


When vectorization isn't enough: custom CUDA kernels from MATLAB

There are real cases where the fastest approach is a custom kernel: unusual indexing, nonstandard ops, or tight loops that don't map onto GPU-enabled built-ins. MATLAB can support deeper CUDA integration paths, and they can deliver great performance.

But the cost curve changes. You're now managing:

  • a build toolchain (compilers, flags, target architectures),
  • driver/runtime compatibility,
  • deployment environments (developer laptops vs CI vs servers),
  • debugging and profiling at the kernel level.

If you enjoy debugging kernels and managing toolchains, this can be rewarding work. If you don't, it can become a time sink that pulls focus from your actual problem. For those who do want to go this route, MathWorks documents the workflow: Run CUDA or PTX Code on GPU.


Quick checks: "am I actually using the GPU?"

A fast way to sanity-check GPU execution is to compare the same calculation at a size where GPUs should win (millions of elements). Don't obsess over one run; warm-up and overhead are real.

In MATLAB you need explicit gpuArray to run on GPU; the snippet below shows CPU (plain x) vs GPU (gpuArray(x)):

N = 10_000_000;
x = rand(N, 1, 'single');

% CPU
m_cpu = mean(sin(x) .* x + 0.5, 'all');

% GPU
xg = gpuArray(x);
m_gpu = mean(sin(xg) .* xg + 0.5, 'all');

fprintf("cpu=%.6f gpu=%.6f\n", double(m_cpu), double(gather(m_gpu)));

If the GPU isn't helping, it's usually one of three things: the problem is too small, the code is forcing boundaries, or the computation is dominated by something other than array math (I/O, parsing, plotting, scalar loops).


Beyond NVIDIA: GPU acceleration on any hardware

The sections above cover MATLAB's GPU path — which works well when you have an NVIDIA GPU, the Parallel Computing Toolbox, and you're comfortable managing gpuArray and gather calls. But there are real limitations:

  • If you're on Apple Silicon, AMD, or Intel integrated graphics, MATLAB's GPU path doesn't work.
  • The Parallel Computing Toolbox is a separate license on top of MATLAB.
  • Every script becomes a residency and transfer exercise: gpuArray here, gather there, and you must check that every function in the pipeline is GPU-enabled.

RunMat takes a different approach. It runs MATLAB-syntax code and handles GPU acceleration automatically — no explicit device arrays, no vendor lock-in, and no extra license.

The same computation from earlier, without any device management:

rng(0);
x = rand(10_000_000, 1, 'single');
y = sin(x) .* x + 0.5;
m = mean(y, 'all');
fprintf("m = %.6f\n", double(m));

Under the hood, RunMat uses fusion — combining multiple array operations into one GPU kernel — to reduce overhead and keep the GPU busy. This happens automatically when the computation is contiguous. For more detail, see the RunMat Fusion guide.

How automatic routing works

RunMat's runtime examines the shape of your computation — array sizes, operation types, data dependencies — and decides per-operation whether to run on CPU or GPU. Large, contiguous elementwise chains get fused into a single GPU kernel. Small or irregular work stays on CPU. You don't need to annotate anything.

This is the same "GPU-shaped" intuition from earlier in this guide, except the runtime applies it for you instead of requiring you to manually wrap arrays in gpuArray.

Cross-platform GPU support

RunMat uses wgpu (the WebGPU standard) to target multiple GPU backends:

  • Metal on macOS (M1/M2/M3/M4 Apple Silicon)
  • DirectX 12 on Windows
  • Vulkan on Linux
  • WebGPU in the browser (Chrome 113+, Edge 113+, Safari 18+, Firefox 139+)

RunMat targets these backends directly, so you don't need the CUDA toolkit or an NVIDIA card.

Where fusion helps most

RunMat tends to shine on the same workloads that are naturally GPU-shaped in MATLAB: long chains of array math and reductions, big elementwise pipelines, and batched workloads. The code itself usually doesn't need to change.

A good pattern is "math first, inspect last":

x = rand(10_000_000, 1, 'single');
y = sin(x) .* x + 0.5;
m = mean(y, 'all');
fprintf("m = %.6f\n", double(m));

If a script is slower than expected, the first thing to do is usually structural:

  • Make the arrays larger (or batch multiple problems together)
  • Remove mid-pipeline printing/plotting
  • Ensure your data is in single precision
  • Avoid reshaping the program into thousands of tiny steps

For benchmarks comparing RunMat's fusion engine against MATLAB, PyTorch, and NumPy, see Introducing RunMat Accelerate.

GPU-resident visualization

The "avoid transfers" principle extends to plotting. In most tools, visualizing GPU-computed data means gathering it back to CPU and handing it to a separate rendering system — which introduces exactly the kind of transfer overhead this guide warns against. RunMat's plotting renders directly from GPU memory, with zero copy between the computation and the visualization. The plot is just a few more matrix operations (camera transforms, projections) at the end of the same GPU pipeline that computed the data.

Where to run RunMat

EnvironmentGPU pathBest for
Browser (runmat.com/sandbox)WebGPU when supportedTry RunMat with no install; smaller/medium workloads
Desktop app (coming soon)Native (Metal / DX12 / Vulkan)Full IDE + full GPU headroom
CLI (runmat run script.m)Native (Metal / DX12 / Vulkan)Scripts, benchmarks, CI, max performance

FAQ: common GPU + MATLAB questions

Why is my GPU slower than my CPU?

Usually the arrays are too small, you're doing many tiny steps, or you're transferring often (e.g. gather or printing in a loop). Batch into larger arrays and call gather only once at the end. See Performance traps.

What GPU do I need for MATLAB?

NVIDIA only, with CUDA compute capability 3.5+. See Prerequisites and NVIDIA's CUDA GPUs page.

How much faster is GPU vs CPU for MATLAB?

For large, vectorized workloads expect roughly 10–100x when the code is GPU-shaped; for small or fragmented work, GPU can be slower. Use the decision flowchart in this guide to check fit.

Should I use single or double?

Use what your numerics need. Single (FP32) is faster and uses half the memory; start there unless you need double. See Precision choices in the performance traps section.

Do I need to rewrite everything for GPU?

Not always. If your code is already vectorized array math, the main work is wrapping inputs with gpuArray, keeping data on the device, and batching small operations into larger arrays where you can. See Your first GPU computation and Performance traps.

How do I know if my code is GPU-shaped?

Large arrays (100K+ elements), elementwise or reduction ops, minimal transfers. Use the flowchart; if you have small arrays or frequent gather/printing, refactor toward batching.

Does MATLAB GPU acceleration work on Mac?

Not with the official toolbox (CUDA isn't on Apple Silicon). RunMat uses Metal on macOS for M1/M2/M3/M4. See Beyond NVIDIA.

Can I use GPU without the Parallel Computing Toolbox?

In MATLAB, no — the toolbox is required and is a paid add-on. RunMat includes GPU acceleration by default. See Beyond NVIDIA and free MATLAB alternatives.

What is GPU fusion and why does it matter?

Fusion combines multiple array ops into one GPU kernel instead of one kernel per op, cutting memory traffic and launch overhead. MATLAB fuses only where the toolbox implements it; RunMat does it automatically for contiguous computation. See Beyond NVIDIA.

What's the simplest rule to remember?

Make the work big, make it contiguous, and avoid transfers.

Enjoyed this post? Join the newsletter

Monthly updates on RunMat, Rust internals, and performance tips.

Try RunMat — free, no sign-up

Start running MATLAB code immediately in your browser.