# Fastest Way To Multiply A Static Sized Vector With A Static Sized Constant Non-square Matrix In C, Producing A 15-element Output Vector

## 31 July 2020 - 1 answer

I have a large `float` matrix, `A`, about 100000 in column length and 15 in row size.

Then we have a `uint8_t` vector `X` that has row size 100000. Both `A` and `X` have static sizes and never ever going to change in size.

`X` can change its values but `A` remains constant.

So what is the most absolute fastest way in C to compute `A*X`, producing a 15-element product vector? Is it a good way to write something like this, instead of using a for-loop?

`````` A(0,0)*X(0) + A(0,1)*X(1) + A(0,2)*X(2) + ... +  A(0,n)*X(n)
A(1,0)*X(0) + A(1,1)*X(1) + A(1,2)*X(2) + ... +  A(1,n)*X(n)
......
A(m,0)*X(0) + A(m,1)*X(1) + A(m,2)*X(2) + ... +  A(m,n)*X(n)
``````

I'm assuming your matrix is dense, not sparse. (Not mostly `0.0`). Also that it's not mostly composed of `0.0` and `1.0` elements; if that's the case turn it into a bitmap you can use for masked sums of the vector.

I assume these are `float` or `double` values, and that you want to run this on a typical modern machine, like x86-64 or AArch64, with SIMD. A loop is probably better because you need the compiler to auto-vectorize for max performance, and that's more likely with a loop than fully unrolled code.

You probably want to use each SIMD chunk of `X[]` with each of 4 or so chunks of `A[][]` data, so `X[]` only has to be loaded into registers 4 times total. Each row of `A[][]` is only read once, so there's no data-reuse possible for `A[][]`.

Cache-blocking could also reduce the number of times X[] data has to be fetched into L1d cache to 1 total. But you probably don't want to write a single loop doing 15 sums in parallel. Fetching 15 streams from A might be a bad idea, and x86-64 (without AVX512) only has 16 SIMD registers, so unless you carefully manually vectorize, a compiler might spill vector accumulators to the stack and introduce a store-forwarding bottleneck.

## Do not fully unroll: code-cache misses would hurt much more than loop overhead.

Compilers typically don't roll straight-line code back up into a loop, even when that would be better. So you'd get a huge block of asm with no branches. Instead of reusing the same loop body from the L1 instruction cache (or uop cache or a loop buffer), the CPU would have to fetch code from memory, costing bandwidth that competes with bandwidth for data.

In practice you should call a BLAS library function: it will be heavily optimized with SIMD for the specific CPU in the system its installed on.

Or not, according to the last update, `X` is `uint8_t X[]`. I doubt there are BLAS libraries that can convert on the fly from `uint8_t` to `float`, but that's probably what you want to save memory bandwidth, instead of separately converting to a tmp `float` vector. Although doing that + calling a good BLAS function would still be better than badly-vectorized code if your compiler doesn't do a good job with your pure C loop.

Unrolling to use each SIMD-vector of `X` data more times will be extra good, to amortize the conversion overhead over more times. Like probably unroll by 8.