Let Compiler Vectorize Our Code!

11 min readNov 1, 2020

TL;DR: Modern compilers feature very strong capability of auto-vectorization. So just write loops and let the compiler optimize them! (with an appropriate optimization level)

Why We Need to Vectorize Our Code

In many cases, we need to perform the same operations on a huge amount of data. For example, if I need to calculate a single value of the probability density function of the Cauchy distribution, I will write:

float dcauchy_single(float x, float location, float scale) {
    const float fct = M_1_PI / scale;
    const float frc = (x - location) / scale;
    const float y = fct / (1.0f + frc * frc);
    return y;
}

That works fine for a single x:

dcauchy_single(0, 0, 1);
# x=0, x0=0, γ=1, the result should be 0.3183

But what if I want to plot the PDF on a interval? A naive method is to write a loop:

void dcauchy_inplace(
    float* vs,
    size_t n,
    float location,
    float scale
) {
    for (size_t i = 0; i < n; ++i) {
        vs[i] = dcauchy_single(vs[i], location, scale);
    }
}

Then in the main program: (it’s just some glue)

// generate a sequence
void frange(float* dest, float lower, float by, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        dest[i] = lower + i * by;
    }
}int main() {
    constexpr size_t n = 1000;
    constexpr float lower = -10;
    constexpr float by = 0.02;
    constexpr float location = 0;
    constexpr float scale = 1;
    float vs[n];    // generate one thousand "x"
    frange(vs, lower, by, n);
    // calculate the function values at multiple x
    dcauchy_inplace(vs, n, location, scale);    // output
    for (size_t i = 0; i < n; ++i) {
        printf("%f ", vs[i]);
    }
}

This is a SISD (single instruction stream, single data stream) strategy: the loop iterates over the array, taking the element and calculating the value one by one. The process of the calculation is the same (I use the same formula). Only the data changes in each iteration.

There is certainly a better strategy: SIMD (single instruction, multiple data). In each iteration, we do not take and calculate a single element, but multiple elements, simultaneously!

What Is Vectorization

“Vectorization” is to perform operations on a vector (one-dimensional array) instead of one single element. Suppose there is a data type vec4f that consists of four float and supports arithmetic operations, we can then vectorize our code:

vec4f dcauchy_vec4f(vec4f x, float location, float scale) {
    const float fct = M_1_PI / scale;
    const vec4f frc = (x - location) / scale;
    const vec4f y = fct / (1.0f + frc * frc);
    return y;
}void dcauchy_inplace_vec4f(
    vec4f* vs,
    size_t n,
    float location,
    float scale
) {
    for (size_t i = 0; i < n; ++i) {
        vs[i] = dcauchy_vec4f(vs[i], location, scale);
    }
}

Here, in each iteration, we take a vec4f (four float) and calculate the value of the PDF. Four x in, four values out, in a single iteration. Theoretically, we will archive 4x performance!

We may wonder why in the real world there is no vec4f. In fact, there is! The floating-point unit of x86 CPU, actually uses wider registers — that is, 128 bits (xmm), the equivalent of four float or two double. Even when we operate on a single float, it is put into this wide register, with unused space. That is also another reason why C/C++ uses 64 bit double by default rather than 32 bit float— the registers are wide enough.

Vector units have been added to the x86 CPU since the last century, along with new instruction sets. MMX features 64 bit mm registers for SIMD with integers. SSE adds 128 bit xmm registers for SIMD with both integers and floating-point numbers, and meanwhile, replaces old x87 instructions. Then it is AVX. AVX2, introduced in Intel’s Haswell, adds 256 bit ymm registers. AVX-512 adds 512 bit zmm registers.

We can use these wider registers directly by using assembly or __m128 and __m256 with intrinsic functions. But it is the modern era, what we need is just a modern compiler!

Let the Compiler Help!

Modern compilers feature the capability of auto-vectorization. We just need to tweak some options! In this section, I use gcc 10.2 as the compiler. Compiler Explorer offers online assembly inspection.

We will use the exactly same code:

float dcauchy_single(float x, float location, float scale) {
    const float fct = M_1_PI / scale;
    const float frc = (x - location) / scale;
    const float y = fct / (1.0f + frc * frc);
    return y;
}void dcauchy_inplace(
    float* vs,
    size_t n,
    float location,
    float scale
) {
    for (size_t i = 0; i < n; ++i) {
        vs[i] = dcauchy_single(vs[i], location, scale);
    }
}

First, tell gcc to enable auto-vectorization:

-ftree-vectorize

-ftree-vectorize is not included in -O2. But it is turned on under -O3 and -Ofast. I recommend not to use -ftree-vectorize explicitly, but to use -Ofast, because -Ofast does other useful optimization like -ffast-math, which let gcc use a faster floating-point model that does not conform to IEEE 754. -Ofast may be too aggressive for common code, so it’s better to put the arithmetic part of code in a separated compilation unit using special optimization options.

Because not all the CPUs have the same instruction sets, and gcc fallbacks to a “generic x86_64” by default, we need to tell gcc which CPU we are targeting:

-march=haswell

Here I ask gcc to target Intel’s Haswell and later CPUs. Haswell came out in 2013. We can hardly find an older CPU, so I think it’s safe to target it. And, Haswell introduced AVX2, which adds 256 bit registers. In this way, gcc will generate binary that uses AVX2 instruction set.

(You can use -mprefer-vector-width=512 -march=icelake-client to generate AVX-512 instructions.)

Finally we can add

-fopt-info

to see what is optimized.

We’re done! Compiling with these options, gcc will automatically vectorize our code using AVX2, with the unmodified code! Here is the assembly generated by gcc:

dcauchy_inplace(float*, unsigned long, float, float):
    vmovaps %xmm0, %xmm3
    testq   %rsi, %rsi
    je      .L12
    vmovsd  .LC0(%rip), %xmm5
    vcvtss2sd       %xmm1, %xmm1, %xmm0
    vmovss  .LC1(%rip), %xmm4
    leaq    -1(%rsi), %rcx
    vdivsd  %xmm0, %xmm5, %xmm5
    vdivss  %xmm1, %xmm4, %xmm2
    vcvtsd2ss       %xmm5, %xmm5, %xmm5
    cmpq    $6, %rcx
    jbe     .L10
    movq    %rsi, %rdx
    vbroadcastss    %xmm3, %ymm9
    vbroadcastss    %xmm2, %ymm8
    movq    %rdi, %rax
    shrq    $3, %rdx
    vmovaps .LC2(%rip), %ymm6
    vbroadcastss    %xmm5, %ymm7
    salq    $5, %rdx
    addq    %rdi, %rdx
.L5:
    vmovups (%rax), %ymm1
    addq    $32, %rax
    vsubps  %ymm9, %ymm1, %ymm0
    vmulps  %ymm8, %ymm0, %ymm0
    vfmadd132ps     %ymm0, %ymm6, %ymm0
    vrcpps  %ymm0, %ymm1
    vmulps  %ymm0, %ymm1, %ymm0
    vmulps  %ymm0, %ymm1, %ymm0
    vaddps  %ymm1, %ymm1, %ymm1
    vsubps  %ymm0, %ymm1, %ymm0
    vmulps  %ymm0, %ymm7, %ymm0
    vmovups %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne     .L5
    movq    %rsi, %rax
    andq    $-8, %rax
    testb   $7, %sil
    je      .L14
    vzeroupper
.L3:
    movq    %rsi, %r8
    subq    %rax, %rcx
    subq    %rax, %r8
    cmpq    $2, %rcx
    jbe     .L8
    leaq    (%rdi,%rax,4), %rdx
    vshufps $0, %xmm3, %xmm3, %xmm0
    vshufps $0, %xmm2, %xmm2, %xmm1
    vshufps $0, %xmm5, %xmm5, %xmm6
    vmovups (%rdx), %xmm7
    vsubps  %xmm0, %xmm7, %xmm0
    vmulps  %xmm1, %xmm0, %xmm0
    vfmadd213ps     .LC3(%rip), %xmm0, %xmm0
    vrcpps  %xmm0, %xmm1
    vmulps  %xmm0, %xmm1, %xmm0
    vmulps  %xmm0, %xmm1, %xmm0
    vaddps  %xmm1, %xmm1, %xmm1
    vsubps  %xmm0, %xmm1, %xmm0
    vmulps  %xmm0, %xmm6, %xmm0
    vmovups %xmm0, (%rdx)
    movq    %r8, %rdx
    andq    $-4, %rdx
    addq    %rdx, %rax
    cmpq    %rdx, %r8
    je      .L12
.L8:
    leaq    (%rdi,%rax,4), %rdx
    vmovss  (%rdx), %xmm0
    vsubss  %xmm3, %xmm0, %xmm0
    vmulss  %xmm2, %xmm0, %xmm0
    vfmadd132ss     %xmm0, %xmm4, %xmm0
    vdivss  %xmm0, %xmm5, %xmm0
    vmovss  %xmm0, (%rdx)
    leaq    1(%rax), %rdx
    cmpq    %rdx, %rsi
    jbe     .L12
    leaq    (%rdi,%rdx,4), %rdx
    addq    $2, %rax
    vmovss  (%rdx), %xmm0
    vsubss  %xmm3, %xmm0, %xmm0
    vmulss  %xmm2, %xmm0, %xmm0
    vfmadd132ss     %xmm0, %xmm4, %xmm0
    vdivss  %xmm0, %xmm5, %xmm0
    vmovss  %xmm0, (%rdx)
    cmpq    %rax, %rsi
    jbe     .L12
    leaq    (%rdi,%rax,4), %rax
    vmovss  (%rax), %xmm0
    vsubss  %xmm3, %xmm0, %xmm3
    vmulss  %xmm2, %xmm3, %xmm2
    vfmadd132ss     %xmm2, %xmm4, %xmm2
    vdivss  %xmm2, %xmm5, %xmm5
    vmovss  %xmm5, (%rax)
.L12:
    ret
.L14:
    vzeroupper
    ret
.L10:
    xorl    %eax, %eax
    jmp     .L3

gcc inlines dcauchy_single into dcauchy_inplace, uses AVX2 instructions to vectorize the code, and unrolls 6 iteraions. The core part of the assembly is:

.L5:
    vmovups (%rax), %ymm1
    addq    $32, %rax
    vsubps  %ymm9, %ymm1, %ymm0
    vmulps  %ymm8, %ymm0, %ymm0
    vfmadd132ps     %ymm0, %ymm6, %ymm0
    vrcpps  %ymm0, %ymm1
    vmulps  %ymm0, %ymm1, %ymm0
    vmulps  %ymm0, %ymm1, %ymm0
    vaddps  %ymm1, %ymm1, %ymm1
    vsubps  %ymm0, %ymm1, %ymm0
    vmulps  %ymm0, %ymm7, %ymm0
    vmovups %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne     .L5

This is one iteration. We can see a lot of instructions starting with “v” — these are VEX-encoded instructions used by AVX. The instructions operate on ymm registers, which is 256 bit wide. In each iteration, %rax—the current position, increased by 32 bytes. 8 float are taken from memory and calculated simultaneously.

But the length of the array is not always a multiple of 8. So gcc processes the extra elements (if any) after the loop using a 4+1+1+1 scheme: fisrt uses the 128 bit xmm to process 4 float, then one by one.

We can understand the control flow better using a graph

AVX-SSE Transition Penalties

We may notice vzeroupper in the assembly. This instruction is to avoid AVX-SSE transition penalties. AVX instructions are VEX-encoded but the old SSE instructions are not VEX-encoded. AVX instructions zero the unused upper bits of registers but SSE has no knowledge of the upper bits. So if AVX and SSE instructions are mixed, CPU has to save and restore the upper bits of registers during the transitions, which causes performance penalties. vzeroupper zero the upper bits to avoid the penalties. So gcc generate a vzeroupper following AVX instructions. See also

Memory Alignment

If we benchmark our AVX2-optimized code, we will be surprised to find that it won’t be faster than SSE2-optimized code. Why? The bit width of ymm registers in AVX2 is twice as xmm in SSE!

This is because the memory that AVX2 instructions operate on is unaligned, which causes performance penalty.

The ymm registers in AVX2 is 256 bits wide. To achieve the maximum performance, the memory address should be aligned to 32 bytes. By default, the memory allocated will be aligned to 16 bytes in gcc. So we need to explicitly allocate 32-byte-aligned memory.

In C11, we can use aligned_alloc. In Windows, we can use _aligned_malloc.

float* aligned_float_alloc(size_t n) {
#ifdef _ISOC11_SOURCE
    void* p = aligned_alloc(32, n * sizeof(float));
#else
    void* p = _aligned_malloc(n * sizeof(float), 32);
#endif
    assert(p);
    return reinterpret_cast<float*>(p);
}

Do we need to add __attribute__ ((__aligned__(32))) in the arguments for a aligned float array?

typedef float afloat __attribute__ ((__aligned__(32)));void dcauchy_inplace(
    afloat* vs,
    size_t n,
    float location,
    float scale
)

In fact, it is not necessary — the assembly generated is the same.

restrict

When a function accepts multiple pointers as arguments, the compiler cannot tell if these pointers are pointing to the same memory. So some optimizations are suppressed. Adding __restrict__ will tell the compiler that they are independent of each other.

void zip(
    float* __restrict__ dest,
    float* __restrict__ a,
    float* __restrict__ b,
    size_t n
) {
    for (size_t i = 0; i < n; ++i) {
        dest[i * 2] = a[i];
    }
    for (size_t i = 0; i < n; ++i) {
        dest[i * 2 + 1] = b[i];
    }
}

STL Algorithm

gcc can vectorize STL algorithms, despite whether the execution policy std::execution::unseq is specified or not.

void dcauchy_inplace(
    float* vs,
    size_t n,
    float location,
    float scale
) {
    std::for_each_n(vs, n, [=](float& v) {
        v = dcauchy_single(v, location, scale);
    });
}

The above code using STL algorithms will still be vectorized.

Math Functions

We often use math functions in <math.h>. Suppose now we need to plot the normal distribution:

float dnorm_single(float x, float location, float scale) {
    const float fct = 1 / sqrtf(2 * M_PI);
    const float frc = (x - location) / scale;
    return fct / scale * powf(M_E, -0.5f * frc * frc);
}void dnorm_inplace(
    float* vs,
    size_t n,
    float location,
    float scale
) {
    for (size_t i = 0; i < n; ++i) {
        vs[i] = dnorm_single(vs[i], location, scale);
    }
}

In the code, we call powf, which is a math function in <math.h>. powf only accepts one single argument. How can it be vectorized?

gcc generates:

dnorm_inplace(float*, unsigned long, float, float):
    leaq    8(%rsp), %r10
    andq    $-32, %rsp
    pushq   -8(%r10)
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r15
    pushq   %r14
    pushq   %r13
    pushq   %r12
    pushq   %r10
    pushq   %rbx
    addq    $-128, %rsp
    vmovss  %xmm0, -148(%rbp)
    testq   %rsi, %rsi
    je      .L12
    vmovaps %xmm0, %xmm7
    leaq    -1(%rsi), %r14
    movq    %rdi, %r13
    movq    %rsi, %r12
    vmovss  .LC0(%rip), %xmm0
    vdivss  %xmm1, %xmm0, %xmm6
    vmovss  .LC1(%rip), %xmm0
    vdivss  %xmm1, %xmm0, %xmm2
    vmovss  %xmm6, -152(%rbp)
    vmovss  %xmm2, -156(%rbp)
    cmpq    $6, %r14
    jbe     .L10
    vbroadcastss    %xmm7, %ymm7
    movq    %rsi, %r15
    movq    %rdi, %rbx
    vmovaps %ymm7, -80(%rbp)
    shrq    $3, %r15
    vbroadcastss    %xmm6, %ymm7
    vmovaps %ymm7, -112(%rbp)
    salq    $5, %r15
    vbroadcastss    %xmm2, %ymm7
    vmovaps %ymm7, -144(%rbp)
    addq    %rdi, %r15
.L5:
    vmovups (%rbx), %ymm2
    vsubps  -80(%rbp), %ymm2, %ymm0
    addq    $32, %rbx
    vmulps  -112(%rbp), %ymm0, %ymm0
    vmulps  %ymm0, %ymm0, %ymm0
    vmulps  .LC2(%rip), %ymm0, %ymm0
    call    _ZGVdN8v_expf
    vmulps  -144(%rbp), %ymm0, %ymm0
    vmovups %ymm0, -32(%rbx)
    cmpq    %r15, %rbx
    jne     .L5
    movq    %r12, %rbx
    andq    $-8, %rbx
    testb   $7, %r12b
    je      .L15
    vzeroupper
.L3:
    movq    %r12, %r15
    subq    %rbx, %r14
    subq    %rbx, %r15
    cmpq    $2, %r14
    jbe     .L8
    vbroadcastss    -148(%rbp), %xmm0
    leaq    0(%r13,%rbx,4), %r14
    vbroadcastss    -152(%rbp), %xmm1
    vmovups (%r14), %xmm6
    vsubps  %xmm0, %xmm6, %xmm0
    vmulps  %xmm1, %xmm0, %xmm0
    vmulps  %xmm0, %xmm0, %xmm0
    vmulps  .LC3(%rip), %xmm0, %xmm0
    call    _ZGVbN4v_expf
    movq    %r15, %rax
    vbroadcastss    -156(%rbp), %xmm1
    andq    $-4, %rax
    vmulps  %xmm1, %xmm0, %xmm0
    addq    %rax, %rbx
    vmovups %xmm0, (%r14)
    cmpq    %rax, %r15
    je      .L12
.L8:
    leaq    0(%r13,%rbx,4), %r14
    vmovss  (%r14), %xmm0
    vsubss  -148(%rbp), %xmm0, %xmm0
    vmulss  -152(%rbp), %xmm0, %xmm0
    vmulss  %xmm0, %xmm0, %xmm0
    vmulss  .LC4(%rip), %xmm0, %xmm0
    vmulss  .LC5(%rip), %xmm0, %xmm0
    call    expf
    vmulss  -156(%rbp), %xmm0, %xmm0
    leaq    1(%rbx), %rax
    vmovss  %xmm0, (%r14)
    cmpq    %rax, %r12
    jbe     .L12
    leaq    0(%r13,%rax,4), %r14
    addq    $2, %rbx
    vmovss  (%r14), %xmm0
    vsubss  -148(%rbp), %xmm0, %xmm0
    vmulss  -152(%rbp), %xmm0, %xmm0
    vmulss  %xmm0, %xmm0, %xmm0
    vmulss  .LC4(%rip), %xmm0, %xmm0
    vmulss  .LC5(%rip), %xmm0, %xmm0
    call    expf
    vmulss  -156(%rbp), %xmm0, %xmm0
    vmovss  %xmm0, (%r14)
    cmpq    %rbx, %r12
    jbe     .L12
    leaq    0(%r13,%rbx,4), %rbx
    vmovss  (%rbx), %xmm0
    vsubss  -148(%rbp), %xmm0, %xmm0
    vmulss  -152(%rbp), %xmm0, %xmm0
    vmulss  %xmm0, %xmm0, %xmm0
    vmulss  .LC4(%rip), %xmm0, %xmm0
    vmulss  .LC5(%rip), %xmm0, %xmm0
    call    expf
    vmulss  -156(%rbp), %xmm0, %xmm0
    vmovss  %xmm0, (%rbx)
.L12:
    subq    $-128, %rsp
    popq    %rbx
    popq    %r10
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    popq    %rbp
    leaq    -8(%r10), %rsp
    ret
.L15:
    vzeroupper
    jmp     .L12
.L10:
    xorl    %ebx, %ebx
    jmp     .L3

The core part is:

.L5:
    vmovups (%rbx), %ymm2
    vsubps  -80(%rbp), %ymm2, %ymm0
    addq    $32, %rbx
    vmulps  -112(%rbp), %ymm0, %ymm0
    vmulps  %ymm0, %ymm0, %ymm0
    vmulps  .LC2(%rip), %ymm0, %ymm0
    call    _ZGVdN8v_expf
    vmulps  -144(%rbp), %ymm0, %ymm0
    vmovups %ymm0, -32(%rbx)
    cmpq    %r15, %rbx
    jne     .L5

The assembly calls _ZGVdN8v_expf. It is a AVX2-optimized powf in glibc, taking ymm0 as its argument, processing 8 float simultaneously. With the help of glibc, gcc can vectorize code using math functions.

clang 11.0 doesn’t use _ZGVdN8v_expf, and generates much slower binary. Because there isn’t a function like this in msvcrt, gcc in MinGW cannot vectorize the code.