As Apple M1 launches, there comes a tendency to port codes originally run on x86 machines to ARM. For C code, porting can be easy if the code only relies on well-defined behaviors — you just need to recompile it. But things can get complicated if your code relies on undefined behaviors — they might differ on these two architectures.
Days ago I was porting otfcc, an optimized OpenType builder and inspector, to ARM platform. I was expecting all I needed to do was a simple recompilation and everything would run out-of-the-box. But I got sucked into segmentation error and malformed output. After about a day’s debugging, I found that these errors were caused by a small difference in wrapping behavior between x86 and ARM! On x86, when casting a negative floating point number into unsigned integer, it will be wrapped. …
TL;DR: Modern compilers feature very strong capability of auto-vectorization. So just write loops and let the compiler optimize them! (with an appropriate optimization level)
In many cases, we need to perform the same operations on a huge amount of data. For example, if I need to calculate a single value of the probability density function of the Cauchy distribution, I will write:
float dcauchy_single(float x, float location, float scale) {
const float fct = M_1_PI / scale;
const float frc = (x - location) / scale;
const float y = fct / (1.0f + frc * frc);
return y;
}
That works fine for a single…
About