Why doesn't RunMat guarantee bit-identical output across environments?

Accuracy and bit-identical reproducibility are different contracts. IEEE 754 arithmetic produces slightly different last bits depending on operation order, SIMD width, thread count, and fused multiply-adds; two MATLAB installations on different hardware can disagree on the last bit for the same code. Intel's math library calls this the CNR tradeoff (Conditional Numerical Reproducibility): you can have peak performance or bit-identical output, not both for free. RunMat provides validated numerical accuracy within documented tolerances, not bit-identical reproduction of any particular binary.

Correctness & Trust

Every numerical builtin in RunMat traces back to a named module, a pinned version, a stable commit reference, and a reproducible test set.

RunMat has 111 numerical-math builtins, out of a total of 422 core builtins in the runtime.

Numerical math builtins typically include CPU and GPU implementations. CPU implementations forward to "known-correct" reference implementations from established open-source Rust libraries, such as rustfft for FFT, nalgebra for SVD, and LAPACK for eigendecomposition.

Where RunMat implements a builtin itself, such as GPU implementations of each builtin, it tests against a known-correct reference to a documented floating-point tolerance.

How we measure numerical correctness

Numerical correctness means a mathematical operation produces the same answer as a known-correct reference, within a computational tolerance limit based on the numerical properties of the operation.

Where available, RunMat uses concrete, well-established open-source library references, such as rustfft for FFT, nalgebra for SVD, and LAPACK for eigendecomposition.

What we do to ensure this:

Every numerical builtin has a reference implementation and an automated test that compares representative evaluation cases against it, typically including positive, negative, and edge cases.
Reference implementations are measured to the target numeric type's maximum floating-point precision and documented inline in the test source.
CI/CD gates releases and verifies that all tests execute and resolve within their tolerance limits. CI/CD tests run on macOS, Windows (Intel CPU + NVIDIA GPU), and Linux (Intel CPU + NVIDIA GPU).

Correctness tiers

RunMat's builtins fall into three correctness tiers, tested differently:

Crate-backed builtins. When a builtin delegates to a well-established Rust crate (rustfft for FFT, nalgebra for SVD), we inherit that crate's correctness. Our tests confirm RunMat's calling convention, array layout, and output format reproduce the crate's results, not re-prove the crate's internals.
LAPACK-backed builtins. Optional FFI to platform-native BLAS and LAPACK (lapack 0.19, blas 0.22, Apple Accelerate via accelerate-src on macOS, openblas-src elsewhere). These are the same libraries NumPy, SciPy, and MATLAB's own solvers rely on.
In-repo solvers. Some factorizations (LU, QR, Cholesky) and small in-place routines are implemented directly in the RunMat runtime. For these, we test against a known-correct reference (an external crate, LAPACK, or hand-derived expected values) to a documented floating-point tolerance.
GPU paths. Every GPU-accelerated builtin is parity-tested against RunMat's own CPU path. Each parity test picks a tolerance from the numerical properties of its operation: typically 1e-9 to 1e-12 for f64 and 1e-5 to 1e-6 for f32, with relative bounds for operations where absolute magnitudes grow. See the coverage table for the exact atol and rtol per builtin, or the FFT deep dive for a full GPU-vs-CPU parity walkthrough.

Validation spans the entire math tree

We'll use the FFT deep dive below as a worked example: it crosses a CPU/GPU boundary against an external reference (rustfft). The same pattern runs across the rest of the math tree:

RunMat ships 422 total builtins across math, linear algebra, array ops, string, plotting, I/O, and OOP. The 111 builtins where tolerance-based numerical validation applies all live under crates/runmat-runtime/src/builtins/math/. String handling, file I/O, figure rendering, and the rest are validated by behavioural tests instead; they don't have an "atol" to document.
Those 111 numerical-math modules ship with co-located #[cfg(test)] suites, totalling 1,635 #[test] functions.
Categories covered include elementary math (exp, log, abs, sqrt, gamma, hyperbolics), trigonometry (sin/cos/tan and their inverse and hyperbolic variants), reductions (mean, std, var, median, min, max, prod, cumsum, diff, all, any), rounding (mod, rem, fix, floor, ceil, round), signal processing (conv, conv2, filter, deconv, window functions), polynomials (polyval, polyfit, polyder, polyint, roots), and every linear algebra factorization, solver, and structural query.
41 integration-level test files under crates/runmat-accelerate/tests/ and crates/runmat-runtime/tests/ exercise cross-cutting paths: GPU-vs-CPU parity, output-pool reuse, fusion-engine correctness, feature-gated BLAS/LAPACK FFI, and statistical properties of the RNG.

Use the coverage table below to find the exact reference, test file, and tolerance for any category you care about.

Coverage table

The below is a representative list of the reference implementations RunMat is tested against and the tolerance bounds those tests enforce. The atol column is the absolute tolerance; the rtol column is the relative tolerance, scaled by max(|reference|, 1).

Category	Source	atol	rtol
Elementary math	`libm` + in-repo; `num-complex` 0.4 for complex → `elementwise/` module tests + `complex.rs`	`1e-12` (F64) `1e-5` (F32)	—
Trigonometry	`libm` + in-repo → `trigonometry/` module tests	`1e-12` (F64) `1e-5` (F32)	—
Rounding / modulo	In-repo → `rounding/` module tests	Exact (int) or `1e-12`	—
FFT	`rustfft` (CPU); staged WGPU kernel (GPU) → `fft_staged.rs` + `fft/` module tests	`1e-9` (F64) `1e-3` (F32)	—
Signal	In-repo + GPU provider hooks; `num-complex` 0.4 for complex → `signal/` module tests	`1e-10` – `1e-12` (exact on int)	—
Polynomials	In-repo Horner's method; companion-matrix roots → `poly/` module tests	`1e-10` – `1e-12`	—
Reductions	Runtime reduction infra (CPU + GPU) → `reduction_parity.rs` + `reduction/` module tests	`1e-7` (mean), `1e-6` (sum), `1e-10` – `1e-12` (in-module)	—
Cumulative reductions	In-repo → `reduction/` module tests	Exact (int); `1e-12` (float)	—
Linear solve	`nalgebra` 0.32 `SVD` + `DMatrix` → `linalg/solve/` module tests	`1e-7` (residual), `1e-4` – `1e-5` (GPU/CPU)	—
SVD	`nalgebra` 0.32 `linalg::SVD` → `svd.rs` tests	`1e-10` – `1e-12`	—
LU factorization	In-repo partial-pivot solver → `lu.rs` tests	`1e-9` (reconstruction)	—
QR factorization	In-repo Householder solver → `qr.rs` tests	`1e-9` (reconstruction)	—
Eigendecomposition	LAPACK `dgeev`/`zgeev` via `lapack` 0.19 FFI → `eig.rs` tests	`1e-10` (`assert_matrix_close`)	—
Cholesky	LAPACK `dpotrf` via `lapack` 0.19 (feature-gated) + in-repo fallback → `chol.rs` tests + GPU tests	`1e-12` (in-module), `1e-6` (GPU)	—
Linalg ops (non-factor)	In-repo + `nalgebra` 0.32 → `linalg/ops/` module tests	`1e-9` – `1e-12`	—
Linalg structure	In-repo → `linalg/structure/` module tests	Pattern / boolean (exact)	—
Vector algebra	In-repo + WGPU provider → `cross.rs` tests	`1e-9` – `1e-12`	—
Matrix multiply (GPU)	WGPU fused kernels → `matmul_residency.rs`, `matmul_epilogue.rs`, `matmul_small_k.rs`	`1e-9` (F64 linear), `5e-5` (nonlinear epilogue), `1e-6` (residency)	`5e-4 * max(
SYRK (GPU)	WGPU kernel → `syrk.rs`	`1e-9` (F64)	`1e-3 * max(
Fused reductions (GPU)	WGPU fusion engine → `fused_square_mean_all_parity.rs`, `fused_reduction_sum_square.rs`, `fused_reduction_sum_mul.rs`	`1e-6`	—
BLAS / LAPACK (optional)	FFI via `blas` 0.22, `lapack` 0.19; `accelerate-src` (macOS) / `openblas-src` (Linux) → `blas_lapack.rs`	`1e-10`	—
Random number generation	Custom `RunMatLCG` (in-repo) → `rng.rs`	Statistical: `	mean

For the full set, see the individual test files in the repo. Builtin tests are co-located with the builtin implementation. Other tests live under crates/*/tests/.

Deep dive: validating FFT

The FFT, IFFT, and their companion operations are core operations in signal processing and linear algebra.

In RunMat, the FFT is implemented across two backends:

On the CPU, RunMat performs one-dimensional transforms with rustfft and builds fft2, ifft2, fftn, and ifftn by applying those transforms across the requested axes. The rustfft crate is the same crate that underlies much of the Rust numerical ecosystem, with over 15 million downloads on crates.io as of 2026-04-17.

On the GPU, RunMat uses its own WGPU kernel implementation, with specialized paths for power-of-two sizes, radix-3, radix-5, mixed 2/3/5 factorizations, and Bluestein fallback for harder lengths.

The first layer of validation is small closed-form checks. The builtin tests for fft and ifft verify known spectra and known inverses on small inputs, along with MATLAB-style API behavior such as default-dimension selection, zero-padding, truncation, empty lengths, and the 'symmetric' flag. These tests establish that the public surface behaves as intended, not just that two implementations happen to agree.

The second layer is structural validation for higher-dimensional transforms. In RunMat, fft2 is implemented as two sequential one-dimensional transforms, and fftn as repeated one-dimensional transforms over each axis. The corresponding tests verify exactly that decomposition. So the multidimensional correctness claim is not “we trust a separate monolithic N-D FFT kernel”; it is “our N-D builtins are validated as the composition of the 1-D transform we already test.”

The third layer is GPU parity against the host reference. The staged GPU tests in crates/runmat-accelerate/tests/fft_staged.rs run representative inputs through the WGPU kernels and compare the results elementwise against a CPU reference computed with rustfft.

The tolerances are selected by provider precision: 1e-3 for F32 and 1e-9 for F64. Those tests cover both forward parity and FFT-then-IFFT roundtrips, and they include not only power-of-two sizes but also non-power-of-two families such as 9, 25, 15, and 7. At the runtime layer, additional tests compare GPU-backed fft and ifft calls against their CPU equivalents, including a prime-length transform on a non-last dimension.

If the GPU kernel drifts (from a compiler upgrade, a shader rewrite, or an environment variable change), CI fails on the next pull request. If rustfft drifts against its own prior behaviour after a version bump, the same test catches it.

Other GPU-accelerated builtins follow the same pattern with tolerances matched to their own numerical properties (see the coverage table).

Running RunMat's test suite

RunMat is set up to run tests with standard cargo test. A WGPU-compatible GPU is needed for GPU parity tests; CPU-only tests run anywhere.

# Run the full test suite
cargo test

# All GPU tests in the runmat-accelerate domain
cargo test -p runmat-accelerate --features wgpu

# Run a specific test suite
# e.g. 1 the FFT GPU staged kernel vs rustfft CPU reference tests
cargo test -p runmat-accelerate --features wgpu --test fft_staged
# e.g. 2 CPU vs GPU Fused reduction x.*x then mean(...,'all') vs closed-form tests
cargo test -p runmat-accelerate --features wgpu --test fused_square_mean_all_parity

To run the above tests, git clone the repo runmat-org/runmat and run the corresponding command.

How RunMat compares

Trait	RunMat	MATLAB	Octave	NumPy / SciPy
Numerical backend visible?	Yes; every crate, version, and test is public	No; closed-source binary	Yes	Yes
Tolerance documented per builtin?	Yes, on this page and per-function pages	No public methodology	Per-test in source	Per-test in source
GPU path validated against CPU?	Yes; every GPU builtin has a parity test	Parallel Computing Toolbox; no public parity methodology	No native GPU	Requires CuPy/JAX with their own validation
Parity tests runnable by users?	Yes, via `cargo test`	No	Yes (Octave CI)	Yes
Validation against reference implementation?	`rustfft`, LAPACK, `nalgebra`, own CPU	Internal only	Own reference	LAPACK / reference BLAS
Open-source runtime?	Yes	No	Yes	Yes

Bit-identical reproducibility between environments

RunMat builtins are validated to tolerance limits, not to bit-identical output across every environment, because of the limits of IEEE 754 floating-point arithmetic and differences in underlying hardware implementations. This is a well-studied problem in numerical computing. See PyTorch's numerical-accuracy note, NumPy's discussion of reproducibility, and Intel's CNR documentation for more details.

CI commitment

Every new numerical builtin ships with a parity test before it merges. The test must name its reference (external crate, LAPACK routine, analytic solution, or CPU path) and document its tolerance. When the test merges, this page gains a row. We will never silently remove a row. If a validation regresses or a dependency changes, the row updates to reflect that.

External crate versions are pinned in the workspace Cargo.toml. When we bump a version, the corresponding parity test re-runs in CI. If it breaks, the bump doesn't ship.

Glossary

Tolerance. The maximum allowed difference between two numerical results before a test fails. RunMat tests use two forms: atol (absolute tolerance, a fixed floating-point value like 1e-9) and rtol (relative tolerance, a bound that scales with the magnitude of the reference, e.g. 5e-4 * max(|reference|, 1)). Choice of form depends on whether the operation's absolute answer grows with input size. Same vocabulary as NumPy's numpy.testing.assert_allclose.
Parity test. An automated test that computes the same quantity two ways (e.g. GPU and CPU, or RunMat and rustfft) and asserts the results agree within a documented tolerance.
IEEE 754. The floating-point arithmetic standard implemented in every modern CPU and GPU. Defines f32/f64 representation, rounding modes, and exceptional values.
LAPACK. Linear Algebra PACKage. The standard library of dense matrix routines underneath NumPy, SciPy, MATLAB, Julia, and most of the rest of scientific computing. RunMat wraps it via FFI.
BLAS. Basic Linear Algebra Subprograms. The low-level matrix/vector kernels LAPACK is built on. Apple Accelerate and OpenBLAS are two widely-shipped implementations.
WGPU. The Rust implementation of the WebGPU API. RunMat's GPU path targets WGPU so the same kernels run on Apple Metal, NVIDIA CUDA-compatible drivers, AMD, and any other WGPU-capable device.

FAQ

How does RunMat validate numerical accuracy?

Three mechanisms, stacked:

Inherited correctness from established libraries. FFTs come from rustfft, SVD from nalgebra, complex arithmetic from num-complex. These crates have their own test suites, their own reference implementations, and millions of downloads in production.
Optional FFI to platform-native BLAS and LAPACK. Apple Accelerate on macOS, OpenBLAS on Linux. The same libraries NumPy, SciPy, and MATLAB's own solvers rely on.
In-repo solvers validated against external references or RunMat's own CPU path. RunMat ships 422 total builtins; the 111 that live under crates/runmat-runtime/src/builtins/math/ are the numerical-math subset where tolerance-based validation applies. Those ship with 1,635 co-located #[test] functions plus 41 integration-level test files covering GPU-vs-CPU parity, fusion engine, feature-gated FFI, and RNG statistics.

The coverage table lists the reference implementation, parity test file, and enforced tolerance for each builtin. For one category walked through end-to-end (inputs, reference, tolerance choice, CI hook), see the FFT deep dive.

What reference implementations does RunMat validate against?

Established open-source Rust crates (rustfft, nalgebra, num-complex), platform-native BLAS and LAPACK (Apple Accelerate on macOS, OpenBLAS elsewhere), closed-form or analytic solutions for elementary math and reductions, and RunMat's own CPU paths when validating GPU kernels. Parity tests document their references in each test file.

How do you validate GPU results?

Every GPU-accelerated builtin is parity-tested against RunMat's CPU path. Tolerance is calculated based on the numerical precision limits of its specific operation: absolute bounds for most cases, relative scaling for operations where magnitudes grow (SYRK, large-k F32 matmul). See the coverage table for the exact atol and rtol per builtin. If a GPU path drifts past its test's tolerance, CI fails the build before it ships.

What tolerances does RunMat use and why?

Tolerances are chosen per operation, not by a universal rule. For F64, most tests bound absolute error between 1e-9 and 1e-12: tighter for closed-form operations (elementary math, reductions on well-conditioned inputs), looser for factorizations that accumulate rounding across many operations. For F32, most tests bound absolute error between 1e-5 and 1e-6; operations where absolute magnitudes grow with problem size (SYRK, large-k matmul) use relative bounds like 5e-4 * max(|reference|, 1) instead. The coverage table lists exact atol and rtol per category, and every test file cites its own tolerance inline with a comment explaining the choice.

Why doesn't `a == b` return `true` when `a` and `b` look equal for floating-point values?

Exact equality on floating-point values is an unreliable way to compare results. IEEE 754 arithmetic produces slightly different last bits depending on operation order, SIMD width, and fused multiply-adds, so two paths that compute the "same" mathematical answer routinely disagree in the 52nd bit. A more reliable check is:

abs(a - b) < 1e-9              % scalars
all(abs(a - b) < 1e-9, 'all')  % arrays (or use an isclose-style helper)

This applies regardless of the language or library (NumPy, MATLAB, Julia, PyTorch, etc.). If exact equality is what you need, check whether you're comparing integers stored as floating-point values; cast to int64 first and compare those.

Are all builtins validated?

Each builtin has a thorough validation suite. Of RunMat's 422 total builtins, 111 are numerical-math builtins (under crates/runmat-runtime/src/builtins/math/); all carry co-located unit tests. Separately, RunMat has a suite of GPU-accelerated, LAPACK-wrapped, and cross-cutting reduction builtins that additionally have integration-level parity tests. The remaining 311 builtins handle string, array, plotting, I/O, and OOP; these are validated by behavioural tests. The coverage table above lists every numerical builtin category with its current validation status.

Can I run the validation tests myself?

Yes. Every parity test ships in the public repository and runs with standard cargo test. No MATLAB license, no external data files. A WGPU-compatible GPU is required for GPU parity tests; CPU tests run anywhere. Commands are in the Running RunMat's test suite section above.

What happens when a dependency is updated?

Our Cargo.toml pins each numerical crate to a specific version. When we bump a version, the corresponding parity tests re-run in CI. If tolerances break, the bump does not ship.

Does RunMat use the same random number generator as MATLAB?

No. RunMat's rand and randn use a custom linear congruential generator (RunMatLCG). Sequences are deterministic given a seed, but they will not match MATLAB's Mersenne Twister or NumPy's PCG64. If your workflow depends on reproducing a specific random sequence from another tool, this is a known difference.

How do I report a bug?

Open an issue at github.com/runmat-org/runmat/issues with a minimal MATLAB-syntax reproduction, the expected output (and its source), and the observed output. If you find a numerical precision bug, please mention it in the issue title.

Report an issue

If a RunMat builtin returns a result you believe is incorrect, please open an issue with:

A minimal MATLAB-syntax snippet that reproduces the problem.
The expected output and where that expectation comes from (MATLAB version, a textbook formula, a hand calculation, another tool).
The observed RunMat output, including the RunMat version (runmat --version) and GPU backend if relevant.

A numerical regression is a higher-priority bug class than a performance regression. We fix them first.

Last reviewed: 2026-04-17 · Source: docs/CORRECTNESS.md