Sign in

Let Compiler Vectorize Our Code!

TL;DR: Modern compilers feature very strong capability of auto-vectorization. So just write loops and let the compiler optimize them! (with an appropriate optimization level)

Why We Need to Vectorize Our Code

In many cases, we need to perform the same operations on a huge amount of data. For example, if I need to calculate a single value of the probability density function of the Cauchy distribution, I will write:

The PDF of the Cauchy distribution

That works fine for a single x:

But what if I want to plot the PDF on a interval? A naive method is to write a loop:

Then in the main program: (it’s just some glue)

int main() {
constexpr size_t n = 1000;
constexpr float lower = -10;
constexpr float by = 0.02;
constexpr float location = 0;
constexpr float scale = 1;
float vs[n];
// generate one thousand "x"
frange(vs, lower, by, n);
// calculate the function values at multiple x
dcauchy_inplace(vs, n, location, scale);
// output
for (size_t i = 0; i < n; ++i) {
printf("%f ", vs[i]);
}
}

This is a SISD (single instruction stream, single data stream) strategy: the loop iterates over the array, taking the element and calculating the value one by one. The process of the calculation is the same (I use the same formula). Only the data changes in each iteration.

There is certainly a better strategy: SIMD (single instruction, multiple data). In each iteration, we do not take and calculate a single element, but multiple elements, simultaneously!

What Is Vectorization

“Vectorization” is to perform operations on a vector (one-dimensional array) instead of one single element. Suppose there is a data type vec4f that consists of four float and supports arithmetic operations, we can then vectorize our code:

void dcauchy_inplace_vec4f(
vec4f* vs,
size_t n,
float location,
float scale
) {
for (size_t i = 0; i < n; ++i) {
vs[i] = dcauchy_vec4f(vs[i], location, scale);
}
}

Here, in each iteration, we take a vec4f (four float) and calculate the value of the PDF. Four x in, four values out, in a single iteration. Theoretically, we will archive 4x performance!

We may wonder why in the real world there is no vec4f. In fact, there is! The floating-point unit of x86 CPU, actually uses wider registers — that is, 128 bits (xmm), the equivalent of four float or two double. Even when we operate on a single float, it is put into this wide register, with unused space. That is also another reason why C/C++ uses 64 bit double by default rather than 32 bit float— the registers are wide enough.

Vector units have been added to the x86 CPU since the last century, along with new instruction sets. MMX features 64 bit mm registers for SIMD with integers. SSE adds 128 bit xmm registers for SIMD with both integers and floating-point numbers, and meanwhile, replaces old x87 instructions. Then it is AVX. AVX2, introduced in Intel’s Haswell, adds 256 bit ymm registers. AVX-512 adds 512 bit zmm registers.

We can use these wider registers directly by using assembly or __m128 and __m256 with intrinsic functions. But it is the modern era, what we need is just a modern compiler!

Let the Compiler Help!

Modern compilers feature the capability of auto-vectorization. We just need to tweak some options! In this section, I use gcc 10.2 as the compiler. Compiler Explorer offers online assembly inspection.

We will use the exactly same code:

void dcauchy_inplace(
float* vs,
size_t n,
float location,
float scale
) {
for (size_t i = 0; i < n; ++i) {
vs[i] = dcauchy_single(vs[i], location, scale);
}
}

First, tell gcc to enable auto-vectorization:

-ftree-vectorize is not included in -O2. But it is turned on under -O3 and -Ofast. I recommend not to use -ftree-vectorize explicitly, but to use -Ofast, because -Ofast does other useful optimization like -ffast-math, which let gcc use a faster floating-point model that does not conform to IEEE 754. -Ofast may be too aggressive for common code, so it’s better to put the arithmetic part of code in a separated compilation unit using special optimization options.

Because not all the CPUs have the same instruction sets, and gcc fallbacks to a “generic x86_64” by default, we need to tell gcc which CPU we are targeting:

Here I ask gcc to target Intel’s Haswell and later CPUs. Haswell came out in 2013. We can hardly find an older CPU, so I think it’s safe to target it. And, Haswell introduced AVX2, which adds 256 bit registers. In this way, gcc will generate binary that uses AVX2 instruction set.

(You can use -mprefer-vector-width=512 -march=icelake-client to generate AVX-512 instructions.)

Finally we can add

to see what is optimized.

We’re done! Compiling with these options, gcc will automatically vectorize our code using AVX2, with the unmodified code! Here is the assembly generated by gcc:

gcc inlines dcauchy_single into dcauchy_inplace, uses AVX2 instructions to vectorize the code, and unrolls 6 iteraions. The core part of the assembly is:

This is one iteration. We can see a lot of instructions starting with “v” — these are VEX-encoded instructions used by AVX. The instructions operate on ymm registers, which is 256 bit wide. In each iteration, %rax—the current position, increased by 32 bytes. 8 float are taken from memory and calculated simultaneously.

But the length of the array is not always a multiple of 8. So gcc processes the extra elements (if any) after the loop using a 4+1+1+1 scheme: fisrt uses the 128 bit xmm to process 4 float, then one by one.

We can understand the control flow better using a graph

AVX-SSE Transition Penalties

We may notice vzeroupper in the assembly. This instruction is to avoid AVX-SSE transition penalties. AVX instructions are VEX-encoded but the old SSE instructions are not VEX-encoded. AVX instructions zero the unused upper bits of registers but SSE has no knowledge of the upper bits. So if AVX and SSE instructions are mixed, CPU has to save and restore the upper bits of registers during the transitions, which causes performance penalties. vzeroupper zero the upper bits to avoid the penalties. So gcc generate a vzeroupper following AVX instructions. See also

Memory Alignment

If we benchmark our AVX2-optimized code, we will be surprised to find that it won’t be faster than SSE2-optimized code. Why? The bit width of ymm registers in AVX2 is twice as xmm in SSE!

This is because the memory that AVX2 instructions operate on is unaligned, which causes performance penalty.

The ymm registers in AVX2 is 256 bits wide. To achieve the maximum performance, the memory address should be aligned to 32 bytes. By default, the memory allocated will be aligned to 16 bytes in gcc. So we need to explicitly allocate 32-byte-aligned memory.

In C11, we can use aligned_alloc. In Windows, we can use _aligned_malloc.

Do we need to add __attribute__ ((__aligned__(32))) in the arguments for a aligned float array?

void dcauchy_inplace(
afloat* vs,
size_t n,
float location,
float scale
)

In fact, it is not necessary — the assembly generated is the same.

__restrict__

When a function accepts multiple pointers as arguments, the compiler cannot tell if these pointers are pointing to the same memory. So some optimizations are suppressed. Adding __restrict__ will tell the compiler that they are independent of each other.

STL Algorithm

gcc can vectorize STL algorithms, despite whether the execution policy std::execution::unseq is specified or not.

The above code using STL algorithms will still be vectorized.

Math Functions

We often use math functions in <math.h>. Suppose now we need to plot the normal distribution:

void dnorm_inplace(
float* vs,
size_t n,
float location,
float scale
) {
for (size_t i = 0; i < n; ++i) {
vs[i] = dnorm_single(vs[i], location, scale);
}
}
The PDF of the normal distribution

In the code, we call powf, which is a math function in <math.h>. powf only accepts one single argument. How can it be vectorized?

gcc generates:

The core part is:

The assembly calls _ZGVdN8v_expf. It is a AVX2-optimized powf in glibc, taking ymm0 as its argument, processing 8 float simultaneously. With the help of glibc, gcc can vectorize code using math functions.

clang 11.0 doesn’t use _ZGVdN8v_expf, and generates much slower binary. Because there isn’t a function like this in msvcrt, gcc in MinGW cannot vectorize the code.

A programming enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store