Let Compiler Vectorize Our Code!

Misaki Kasumi
11 min readNov 1, 2020

TL;DR: Modern compilers feature very strong capability of auto-vectorization. So just write loops and let the compiler optimize them! (with an appropriate optimization level)

Why We Need to Vectorize Our Code

In many cases, we need to perform the same operations on a huge amount of data. For example, if I need to calculate a single value of the probability density function of the Cauchy distribution, I will write:

float dcauchy_single(float x, float location, float scale) {
const float fct = M_1_PI / scale;
const float frc = (x - location) / scale;
const float y = fct / (1.0f + frc * frc);
return y;
}
The PDF of the Cauchy distribution

That works fine for a single x:

dcauchy_single(0, 0, 1);
# x=0, x0=0, γ=1, the result should be 0.3183

But what if I want to plot the PDF on a interval? A naive method is to write a loop:

void dcauchy_inplace(
float* vs,
size_t n,
float location,
float scale
) {
for (size_t i = 0; i < n; ++i) {
vs[i] = dcauchy_single(vs[i], location, scale);
}
}

Then in the main program: (it’s just some glue)

// generate a sequence
void frange(float* dest, float lower, float by, size_t n) {
for (size_t i = 0; i < n; ++i) {
dest[i] = lower + i * by;
}
}
int main() {
constexpr size_t n = 1000;
constexpr float lower = -10;
constexpr float by = 0.02;
constexpr float location = 0;
constexpr float scale = 1;
float vs[n];
// generate one thousand "x"
frange(vs, lower, by, n);
// calculate the function values at multiple x
dcauchy_inplace(vs, n, location, scale);
// output
for (size_t i = 0; i < n; ++i) {
printf("%f ", vs[i]);
}
}

This is a SISD (single instruction stream, single data stream) strategy: the loop iterates over the array, taking the element and calculating the value one by one. The process of the calculation is the same (I use the same formula). Only the data changes in each iteration.

There is certainly a better strategy: SIMD (single instruction, multiple data). In each iteration, we do not take and calculate a single element, but multiple elements, simultaneously!

What Is Vectorization

“Vectorization” is to perform operations on a vector (one-dimensional array) instead of one single element. Suppose there is a data type vec4f that consists of four float and supports arithmetic operations, we can then vectorize our code:

vec4f dcauchy_vec4f(vec4f x, float location, float scale) {
const float fct = M_1_PI / scale;
const vec4f frc = (x - location) / scale;
const vec4f y = fct / (1.0f + frc * frc);
return y;
}
void dcauchy_inplace_vec4f(
vec4f* vs,
size_t n,
float location,
float scale
) {
for (size_t i = 0; i < n; ++i) {
vs[i] = dcauchy_vec4f(vs[i], location, scale);
}
}

Here, in each iteration, we take a vec4f (four float) and calculate the value of the PDF. Four x in, four values out, in a single iteration. Theoretically, we will archive 4x performance!

We may wonder why in the real world there is no vec4f. In fact, there is! The floating-point unit of x86 CPU, actually uses wider registers — that is, 128 bits (xmm), the equivalent of four float or two double. Even when we operate on a single float, it is put into this wide register, with unused space. That is also another reason why C/C++ uses 64 bit double by default rather than 32 bit float— the registers are wide enough.

Vector units have been added to the x86 CPU since the last century, along with new instruction sets. MMX features 64 bit mm registers for SIMD with integers. SSE adds 128 bit xmm registers for SIMD with both integers and floating-point numbers, and meanwhile, replaces old x87 instructions. Then it is AVX. AVX2, introduced in Intel’s Haswell, adds 256 bit ymm registers. AVX-512 adds 512 bit zmm registers.

We can use these wider registers directly by using assembly or __m128 and __m256 with intrinsic functions. But it is the modern era, what we need is just a modern compiler!

Let the Compiler Help!

Modern compilers feature the capability of auto-vectorization. We just need to tweak some options! In this section, I use gcc 10.2 as the compiler. Compiler Explorer offers online assembly inspection.

We will use the exactly same code:

float dcauchy_single(float x, float location, float scale) {
const float fct = M_1_PI / scale;
const float frc = (x - location) / scale;
const float y = fct / (1.0f + frc * frc);
return y;
}
void dcauchy_inplace(
float* vs,
size_t n,
float location,
float scale
) {
for (size_t i = 0; i < n; ++i) {
vs[i] = dcauchy_single(vs[i], location, scale);
}
}

First, tell gcc to enable auto-vectorization:

-ftree-vectorize

-ftree-vectorize is not included in -O2. But it is turned on under -O3 and -Ofast. I recommend not to use -ftree-vectorize explicitly, but to use -Ofast, because -Ofast does other useful optimization like -ffast-math, which let gcc use a faster floating-point model that does not conform to IEEE 754. -Ofast may be too aggressive for common code, so it’s better to put the arithmetic part of code in a separated compilation unit using special optimization options.

Because not all the CPUs have the same instruction sets, and gcc fallbacks to a “generic x86_64” by default, we need to tell gcc which CPU we are targeting:

-march=haswell

Here I ask gcc to target Intel’s Haswell and later CPUs. Haswell came out in 2013. We can hardly find an older CPU, so I think it’s safe to target it. And, Haswell introduced AVX2, which adds 256 bit registers. In this way, gcc will generate binary that uses AVX2 instruction set.

(You can use -mprefer-vector-width=512 -march=icelake-client to generate AVX-512 instructions.)

Finally we can add

-fopt-info

to see what is optimized.

We’re done! Compiling with these options, gcc will automatically vectorize our code using AVX2, with the unmodified code! Here is the assembly generated by gcc:

dcauchy_inplace(float*, unsigned long, float, float):
vmovaps %xmm0, %xmm3
testq %rsi, %rsi
je .L12
vmovsd .LC0(%rip), %xmm5
vcvtss2sd %xmm1, %xmm1, %xmm0
vmovss .LC1(%rip), %xmm4
leaq -1(%rsi), %rcx
vdivsd %xmm0, %xmm5, %xmm5
vdivss %xmm1, %xmm4, %xmm2
vcvtsd2ss %xmm5, %xmm5, %xmm5
cmpq $6, %rcx
jbe .L10
movq %rsi, %rdx
vbroadcastss %xmm3, %ymm9
vbroadcastss %xmm2, %ymm8
movq %rdi, %rax
shrq $3, %rdx
vmovaps .LC2(%rip), %ymm6
vbroadcastss %xmm5, %ymm7
salq $5, %rdx
addq %rdi, %rdx
.L5:
vmovups (%rax), %ymm1
addq $32, %rax
vsubps %ymm9, %ymm1, %ymm0
vmulps %ymm8, %ymm0, %ymm0
vfmadd132ps %ymm0, %ymm6, %ymm0
vrcpps %ymm0, %ymm1
vmulps %ymm0, %ymm1, %ymm0
vmulps %ymm0, %ymm1, %ymm0
vaddps %ymm1, %ymm1, %ymm1
vsubps %ymm0, %ymm1, %ymm0
vmulps %ymm0, %ymm7, %ymm0
vmovups %ymm0, -32(%rax)
cmpq %rax, %rdx
jne .L5
movq %rsi, %rax
andq $-8, %rax
testb $7, %sil
je .L14
vzeroupper
.L3:
movq %rsi, %r8
subq %rax, %rcx
subq %rax, %r8
cmpq $2, %rcx
jbe .L8
leaq (%rdi,%rax,4), %rdx
vshufps $0, %xmm3, %xmm3, %xmm0
vshufps $0, %xmm2, %xmm2, %xmm1
vshufps $0, %xmm5, %xmm5, %xmm6
vmovups (%rdx), %xmm7
vsubps %xmm0, %xmm7, %xmm0
vmulps %xmm1, %xmm0, %xmm0
vfmadd213ps .LC3(%rip), %xmm0, %xmm0
vrcpps %xmm0, %xmm1
vmulps %xmm0, %xmm1, %xmm0
vmulps %xmm0, %xmm1, %xmm0
vaddps %xmm1, %xmm1, %xmm1
vsubps %xmm0, %xmm1, %xmm0
vmulps %xmm0, %xmm6, %xmm0
vmovups %xmm0, (%rdx)
movq %r8, %rdx
andq $-4, %rdx
addq %rdx, %rax
cmpq %rdx, %r8
je .L12
.L8:
leaq (%rdi,%rax,4), %rdx
vmovss (%rdx), %xmm0
vsubss %xmm3, %xmm0, %xmm0
vmulss %xmm2, %xmm0, %xmm0
vfmadd132ss %xmm0, %xmm4, %xmm0
vdivss %xmm0, %xmm5, %xmm0
vmovss %xmm0, (%rdx)
leaq 1(%rax), %rdx
cmpq %rdx, %rsi
jbe .L12
leaq (%rdi,%rdx,4), %rdx
addq $2, %rax
vmovss (%rdx), %xmm0
vsubss %xmm3, %xmm0, %xmm0
vmulss %xmm2, %xmm0, %xmm0
vfmadd132ss %xmm0, %xmm4, %xmm0
vdivss %xmm0, %xmm5, %xmm0
vmovss %xmm0, (%rdx)
cmpq %rax, %rsi
jbe .L12
leaq (%rdi,%rax,4), %rax
vmovss (%rax), %xmm0
vsubss %xmm3, %xmm0, %xmm3
vmulss %xmm2, %xmm3, %xmm2
vfmadd132ss %xmm2, %xmm4, %xmm2
vdivss %xmm2, %xmm5, %xmm5
vmovss %xmm5, (%rax)
.L12:
ret
.L14:
vzeroupper
ret
.L10:
xorl %eax, %eax
jmp .L3

gcc inlines dcauchy_single into dcauchy_inplace, uses AVX2 instructions to vectorize the code, and unrolls 6 iteraions. The core part of the assembly is:

.L5:
vmovups (%rax), %ymm1
addq $32, %rax
vsubps %ymm9, %ymm1, %ymm0
vmulps %ymm8, %ymm0, %ymm0
vfmadd132ps %ymm0, %ymm6, %ymm0
vrcpps %ymm0, %ymm1
vmulps %ymm0, %ymm1, %ymm0
vmulps %ymm0, %ymm1, %ymm0
vaddps %ymm1, %ymm1, %ymm1
vsubps %ymm0, %ymm1, %ymm0
vmulps %ymm0, %ymm7, %ymm0
vmovups %ymm0, -32(%rax)
cmpq %rax, %rdx
jne .L5

This is one iteration. We can see a lot of instructions starting with “v” — these are VEX-encoded instructions used by AVX. The instructions operate on ymm registers, which is 256 bit wide. In each iteration, %rax—the current position, increased by 32 bytes. 8 float are taken from memory and calculated simultaneously.

But the length of the array is not always a multiple of 8. So gcc processes the extra elements (if any) after the loop using a 4+1+1+1 scheme: fisrt uses the 128 bit xmm to process 4 float, then one by one.

We can understand the control flow better using a graph

AVX-SSE Transition Penalties

We may notice vzeroupper in the assembly. This instruction is to avoid AVX-SSE transition penalties. AVX instructions are VEX-encoded but the old SSE instructions are not VEX-encoded. AVX instructions zero the unused upper bits of registers but SSE has no knowledge of the upper bits. So if AVX and SSE instructions are mixed, CPU has to save and restore the upper bits of registers during the transitions, which causes performance penalties. vzeroupper zero the upper bits to avoid the penalties. So gcc generate a vzeroupper following AVX instructions. See also

Memory Alignment

If we benchmark our AVX2-optimized code, we will be surprised to find that it won’t be faster than SSE2-optimized code. Why? The bit width of ymm registers in AVX2 is twice as xmm in SSE!

This is because the memory that AVX2 instructions operate on is unaligned, which causes performance penalty.

The ymm registers in AVX2 is 256 bits wide. To achieve the maximum performance, the memory address should be aligned to 32 bytes. By default, the memory allocated will be aligned to 16 bytes in gcc. So we need to explicitly allocate 32-byte-aligned memory.

In C11, we can use aligned_alloc. In Windows, we can use _aligned_malloc.

float* aligned_float_alloc(size_t n) {
#ifdef _ISOC11_SOURCE
void* p = aligned_alloc(32, n * sizeof(float));
#else
void* p = _aligned_malloc(n * sizeof(float), 32);
#endif
assert(p);
return reinterpret_cast<float*>(p);
}

Do we need to add __attribute__ ((__aligned__(32))) in the arguments for a aligned float array?

typedef float afloat __attribute__ ((__aligned__(32)));void dcauchy_inplace(
afloat* vs,
size_t n,
float location,
float scale
)

In fact, it is not necessary — the assembly generated is the same.

__restrict__

When a function accepts multiple pointers as arguments, the compiler cannot tell if these pointers are pointing to the same memory. So some optimizations are suppressed. Adding __restrict__ will tell the compiler that they are independent of each other.

void zip(
float* __restrict__ dest,
float* __restrict__ a,
float* __restrict__ b,
size_t n
) {
for (size_t i = 0; i < n; ++i) {
dest[i * 2] = a[i];
}
for (size_t i = 0; i < n; ++i) {
dest[i * 2 + 1] = b[i];
}
}

STL Algorithm

gcc can vectorize STL algorithms, despite whether the execution policy std::execution::unseq is specified or not.

void dcauchy_inplace(
float* vs,
size_t n,
float location,
float scale
) {
std::for_each_n(vs, n, [=](float& v) {
v = dcauchy_single(v, location, scale);
});
}

The above code using STL algorithms will still be vectorized.

Math Functions

We often use math functions in <math.h>. Suppose now we need to plot the normal distribution:

float dnorm_single(float x, float location, float scale) {
const float fct = 1 / sqrtf(2 * M_PI);
const float frc = (x - location) / scale;
return fct / scale * powf(M_E, -0.5f * frc * frc);
}
void dnorm_inplace(
float* vs,
size_t n,
float location,
float scale
) {
for (size_t i = 0; i < n; ++i) {
vs[i] = dnorm_single(vs[i], location, scale);
}
}
The PDF of the normal distribution

In the code, we call powf, which is a math function in <math.h>. powf only accepts one single argument. How can it be vectorized?

gcc generates:

dnorm_inplace(float*, unsigned long, float, float):
leaq 8(%rsp), %r10
andq $-32, %rsp
pushq -8(%r10)
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %r10
pushq %rbx
addq $-128, %rsp
vmovss %xmm0, -148(%rbp)
testq %rsi, %rsi
je .L12
vmovaps %xmm0, %xmm7
leaq -1(%rsi), %r14
movq %rdi, %r13
movq %rsi, %r12
vmovss .LC0(%rip), %xmm0
vdivss %xmm1, %xmm0, %xmm6
vmovss .LC1(%rip), %xmm0
vdivss %xmm1, %xmm0, %xmm2
vmovss %xmm6, -152(%rbp)
vmovss %xmm2, -156(%rbp)
cmpq $6, %r14
jbe .L10
vbroadcastss %xmm7, %ymm7
movq %rsi, %r15
movq %rdi, %rbx
vmovaps %ymm7, -80(%rbp)
shrq $3, %r15
vbroadcastss %xmm6, %ymm7
vmovaps %ymm7, -112(%rbp)
salq $5, %r15
vbroadcastss %xmm2, %ymm7
vmovaps %ymm7, -144(%rbp)
addq %rdi, %r15
.L5:
vmovups (%rbx), %ymm2
vsubps -80(%rbp), %ymm2, %ymm0
addq $32, %rbx
vmulps -112(%rbp), %ymm0, %ymm0
vmulps %ymm0, %ymm0, %ymm0
vmulps .LC2(%rip), %ymm0, %ymm0
call _ZGVdN8v_expf
vmulps -144(%rbp), %ymm0, %ymm0
vmovups %ymm0, -32(%rbx)
cmpq %r15, %rbx
jne .L5
movq %r12, %rbx
andq $-8, %rbx
testb $7, %r12b
je .L15
vzeroupper
.L3:
movq %r12, %r15
subq %rbx, %r14
subq %rbx, %r15
cmpq $2, %r14
jbe .L8
vbroadcastss -148(%rbp), %xmm0
leaq 0(%r13,%rbx,4), %r14
vbroadcastss -152(%rbp), %xmm1
vmovups (%r14), %xmm6
vsubps %xmm0, %xmm6, %xmm0
vmulps %xmm1, %xmm0, %xmm0
vmulps %xmm0, %xmm0, %xmm0
vmulps .LC3(%rip), %xmm0, %xmm0
call _ZGVbN4v_expf
movq %r15, %rax
vbroadcastss -156(%rbp), %xmm1
andq $-4, %rax
vmulps %xmm1, %xmm0, %xmm0
addq %rax, %rbx
vmovups %xmm0, (%r14)
cmpq %rax, %r15
je .L12
.L8:
leaq 0(%r13,%rbx,4), %r14
vmovss (%r14), %xmm0
vsubss -148(%rbp), %xmm0, %xmm0
vmulss -152(%rbp), %xmm0, %xmm0
vmulss %xmm0, %xmm0, %xmm0
vmulss .LC4(%rip), %xmm0, %xmm0
vmulss .LC5(%rip), %xmm0, %xmm0
call expf
vmulss -156(%rbp), %xmm0, %xmm0
leaq 1(%rbx), %rax
vmovss %xmm0, (%r14)
cmpq %rax, %r12
jbe .L12
leaq 0(%r13,%rax,4), %r14
addq $2, %rbx
vmovss (%r14), %xmm0
vsubss -148(%rbp), %xmm0, %xmm0
vmulss -152(%rbp), %xmm0, %xmm0
vmulss %xmm0, %xmm0, %xmm0
vmulss .LC4(%rip), %xmm0, %xmm0
vmulss .LC5(%rip), %xmm0, %xmm0
call expf
vmulss -156(%rbp), %xmm0, %xmm0
vmovss %xmm0, (%r14)
cmpq %rbx, %r12
jbe .L12
leaq 0(%r13,%rbx,4), %rbx
vmovss (%rbx), %xmm0
vsubss -148(%rbp), %xmm0, %xmm0
vmulss -152(%rbp), %xmm0, %xmm0
vmulss %xmm0, %xmm0, %xmm0
vmulss .LC4(%rip), %xmm0, %xmm0
vmulss .LC5(%rip), %xmm0, %xmm0
call expf
vmulss -156(%rbp), %xmm0, %xmm0
vmovss %xmm0, (%rbx)
.L12:
subq $-128, %rsp
popq %rbx
popq %r10
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
leaq -8(%r10), %rsp
ret
.L15:
vzeroupper
jmp .L12
.L10:
xorl %ebx, %ebx
jmp .L3

The core part is:

.L5:
vmovups (%rbx), %ymm2
vsubps -80(%rbp), %ymm2, %ymm0
addq $32, %rbx
vmulps -112(%rbp), %ymm0, %ymm0
vmulps %ymm0, %ymm0, %ymm0
vmulps .LC2(%rip), %ymm0, %ymm0
call _ZGVdN8v_expf
vmulps -144(%rbp), %ymm0, %ymm0
vmovups %ymm0, -32(%rbx)
cmpq %r15, %rbx
jne .L5

The assembly calls _ZGVdN8v_expf. It is a AVX2-optimized powf in glibc, taking ymm0 as its argument, processing 8 float simultaneously. With the help of glibc, gcc can vectorize code using math functions.

clang 11.0 doesn’t use _ZGVdN8v_expf, and generates much slower binary. Because there isn’t a function like this in msvcrt, gcc in MinGW cannot vectorize the code.

--

--