Fastest Way To Multiply A Static Sized Vector With A Static Sized Constant Non-square Matrix In C, Producing A 15-element Output Vector

I have a large float matrix, A, about 100000 in column length and 15 in row size.

Then we have a uint8_t vector X that has row size 100000. Both A and X have static sizes and never ever going to change in size.

X can change its values but A remains constant.

So what is the most absolute fastest way in C to compute A*X, producing a 15-element product vector? Is it a good way to write something like this, instead of using a for-loop?

 A(0,0)*X(0) + A(0,1)*X(1) + A(0,2)*X(2) + ... +  A(0,n)*X(n)
 A(1,0)*X(0) + A(1,1)*X(1) + A(1,2)*X(2) + ... +  A(1,n)*X(n)
 A(m,0)*X(0) + A(m,1)*X(1) + A(m,2)*X(2) + ... +  A(m,n)*X(n)


I'm assuming your matrix is dense, not sparse. (Not mostly 0.0). Also that it's not mostly composed of 0.0 and 1.0 elements; if that's the case turn it into a bitmap you can use for masked sums of the vector.

I assume these are float or double values, and that you want to run this on a typical modern machine, like x86-64 or AArch64, with SIMD. A loop is probably better because you need the compiler to auto-vectorize for max performance, and that's more likely with a loop than fully unrolled code.

You probably want to use each SIMD chunk of X[] with each of 4 or so chunks of A[][] data, so X[] only has to be loaded into registers 4 times total. Each row of A[][] is only read once, so there's no data-reuse possible for A[][].

Cache-blocking could also reduce the number of times X[] data has to be fetched into L1d cache to 1 total. But you probably don't want to write a single loop doing 15 sums in parallel. Fetching 15 streams from A might be a bad idea, and x86-64 (without AVX512) only has 16 SIMD registers, so unless you carefully manually vectorize, a compiler might spill vector accumulators to the stack and introduce a store-forwarding bottleneck.

Do not fully unroll: code-cache misses would hurt much more than loop overhead.

Compilers typically don't roll straight-line code back up into a loop, even when that would be better. So you'd get a huge block of asm with no branches. Instead of reusing the same loop body from the L1 instruction cache (or uop cache or a loop buffer), the CPU would have to fetch code from memory, costing bandwidth that competes with bandwidth for data.

In practice you should call a BLAS library function: it will be heavily optimized with SIMD for the specific CPU in the system its installed on.

Or not, according to the last update, X is uint8_t X[]. I doubt there are BLAS libraries that can convert on the fly from uint8_t to float, but that's probably what you want to save memory bandwidth, instead of separately converting to a tmp float vector. Although doing that + calling a good BLAS function would still be better than badly-vectorized code if your compiler doesn't do a good job with your pure C loop.

Unrolling to use each SIMD-vector of X data more times will be extra good, to amortize the conversion overhead over more times. Like probably unroll by 8.