**Preamble for planet.freedesktop.org**

Sorry about the poor formatting on planet.freedesktop.org; it seems it and BlogSpot don't quite get along, therefor you won't see any color hilights. It looks much better (and easier to read) on my actual blog page, honest!

Updated version includes float-to-int optimization and comments; sorry if this bumps this rather long post to the top again; this is not my intention. planet.freedesktop.org admins: is there some way to disable bumping when a post is updated? (Perhaps selectively, in case the bump is important. e.g. updated dates for an event.)

This analysis was performed using a modified version of Chris Lomont's inverse square-root testing code. The accompanying publication is worth reading before looking at any of this data.

I've started looking into whether there would be any performance difference in a few optimized math functions should

**-fstrict-aliasing**be enabled. I did not believe strict-aliasing would have much of an effect on these optimized functions (and it turns out I was correct) but the benefit is seen when compiling other code which includes these inline functions.

Without strict-aliasing compatibility, including the header file containing the incompatible functions/macros taints the entire file, meaning you cannot use

**-fstrict-aliasing**where it may be helpful for your general code.

Here are the results for the standard

*1.0 / sqrt(x)*frequently used in graphics engines. Even though today's renderers typically use carefully crafted SIMD functions for the critical path, this is still useful for quickly normalizing vectors in game code, etc.

The Lomont version of the function is a tiny bit faster and a tiny bit more accurate, but nothing to write home about.

Clearly it can be seen that this micro-optimization is an excellent for x86 and x86_64. Don't try it on ARM; it's

*far*slower than just taking the hit on

*1.0 / sqrt(x)*

I don't know whether this optimization could be modified for ARM; any assembly experts out there?

Timing Exact function

1752 ms used for 100000000 passes, avg 1.752e-05 msTiming Carmack function

463 ms used for 100000000 passes, avg 4.63e-06 ms

Timing Carmack function (strict-aliasing)

455 ms used for 100000000 passes, avg 4.55e-06 ms

Timing Lomont function

453 ms used for 100000000 passes, avg 4.53e-06 ms

Timing Lomont function (strict-aliasing)

455 ms used for 100000000 passes, avg 4.55e-06 ms

The absolute value function is mostly used for comparisons (e.g.

*fabs(y - x) > epsilon*and some other specialized functions: finding on which side of a plane an AABB resides, it's distance from said plane, AABB radius, etc. Therefor it's useful to optimize this function where possible...

However, apparently it's quite a bit faster to just call libc'sTiming Exact fabsf function

268 ms used for 100000000 passes, avg 2.68e-06 msTiming Bit-Masking fabsf function

304 ms used for 100000000 passes, avg 3.04e-06 ms

Timing Bit-Masking fabsf function (strict-aliasing)

305 ms used for 100000000 passes, avg 3.05e-06 ms

*fabsf*function! I saw this originally in the Quake 3 Arena source code, so maybe things were different with the compilers and hardware of the time.

These macros/functions are used when you want to know the sign of a float (i.e. is the value positive or negative) without performing any comparison (for performance reasons.) It seems that the strict-aliasing versions perform about identical to the macros.

Timing Exact float sign bit not set function

327 ms used for 100000000 passes, avg 3.27e-06 msTiming FLOATSIGNBITNOTSET macro

313 ms used for 100000000 passes, avg 3.13e-06 ms

Timing Bit-Masking float sign bit not set function (strict-aliasing)

312 ms used for 100000000 passes, avg 3.12e-06 ms

Timing Exact float sign bit set function

342 ms used for 100000000 passes, avg 3.42e-06 msTiming FLOATSIGNBITSET macro

305 ms used for 100000000 passes, avg 3.05e-06 ms

Timing Bit-Masking float sign bit set function (strict-aliasing)

305 ms used for 100000000 passes, avg 3.05e-06 ms

Don't use "

*d = (int) f*" if you want fast code.

*fld*and

*fistp*work nicely on x86 and x86_64.

These measurements were taken on my laptop with anTiming Exact float-to-int function

1252 ms used for 100000000 passes, avg 1.252e-05 msTiming Fast float-to-int function

336 ms used for 100000000 passes, avg 3.36e-06 ms

Done. By Chris Lomont 2003. Modified by Oliver McFadden 2011

*Intel(R) Core(TM)2 Duo CPU P9500 @ 2.53GHz*processor and the test program compiled with

*gcc version 4.4.5 (Debian 4.4.5-6)*

Whether this makes any huge difference in frames-per-second is debatable, really I had a bit of time and was bored. :-) Anyway, I wouldn't say anything until testing under real-world conditions.

It does look like the bit-masking

*fabs*can be thrown away, though, and the fast float-to-int is a major win (although beware of possible rounding differences.)

For x86 (SSE), fast inverse square-root approximation can be calculated using RSQRTSS instruction.

ReplyDeleteFor ARM (NEON), this is achieved using VRSQRTE (initial approximation) instruction followed by VRSQRTS instructions (Newton-Raphson iteration).

Timing Exact float-to-int function

ReplyDelete1252 ms used for 100000000 passes, avg 1.252e-05 ms

Timing Fast float-to-int function

336 ms used for 100000000 passes, avg 3.36e-06 ms

Don't use "d = (int) f" if you want fast code, either. fld and fistp work nicely on x86 and x86_64.

@Serge: Yeah, I'm aware of rsqrtss for x86, I'm just not sure it's any faster than the Carmack/Lomont method. I'll add a test case.

ReplyDeleteThanks for the tip on ARM; I'll definitely give that a try and see how it does on ioquake3 for N900, or if I'm lazy, just recompile the test code. :)

Well, it does look like the SSE rsqrtss is clearly a lot faster on first glance... However I'm suspicious enough to run it through the error checking code. I believe rsqrtss doesn't handle zero values correctly as the other functions; I would have to double check this, though.

ReplyDeleteTiming Exact function

1678 ms used for 100000000 passes, avg 1.678e-05 ms

Timing Carmack function

454 ms used for 100000000 passes, avg 4.54e-06 ms

Timing Carmack function (strict-aliasing)

460 ms used for 100000000 passes, avg 4.6e-06 ms

Timing Lomont function

452 ms used for 100000000 passes, avg 4.52e-06 ms

Timing Lomont function (strict-aliasing)

451 ms used for 100000000 passes, avg 4.51e-06 ms

Timing rsqrtss function

290 ms used for 100000000 passes, avg 2.9e-06 ms

Checking errors. This may take a long time

Carmack error: max 1.56138e+16 rel max 0.00175228 avg 1.54397e-05

Lomont error : max 1.56046e+16 rel max 0.00175124 avg 1.54397e-05

rsqrtss error: max 9.22337e+18 rel max 1 avg 0.00790514

I posted the whole results again, because it seems that enabling -msse (for the rsqrtss function) does affect the other functions slightly. Or maybe Firefox was eating a bit less of my CPU this time (PC was idle for the test, but with programs running.)

ReplyDeleteSo to be fair, the complete results are posted above.

RSQRTSS is basically only a replacement for the initial approximation part:

ReplyDeleteint i = *(int*)&x;

i = 0x5f3759df - (i>>1);

x = *(float*)&i;

But still at least one extra Newton step is needed after that in order to get reasonable precision.

Just as example, this is the code generated by Intel compiler (as we see, it can use RSQRTSS instruction automatically):

$ cat rsqrt-test.c

#include

float rsqrt(float f)

{

return 1.0f / sqrt(f);

}

$ icc -O3 -c rsqrt-test.c

$ objdump -Mintel -d rsqrt-test.o

0000000000000000 :

0: f3 0f 52 c8 rsqrtss xmm1,xmm0

4: f3 0f 59 c1 mulss xmm0,xmm1

8: f3 0f 59 c1 mulss xmm0,xmm1

c: f3 0f 5c 05 00 00 00 subss xmm0,DWORD PTR [rip+0x0] # 14

13: 00

14: f3 0f 59 c8 mulss xmm1,xmm0

18: f3 0f 59 0d 00 00 00 mulss xmm1,DWORD PTR [rip+0x0] # 20

1f: 00

20: 0f 28 c1 movaps xmm0,xmm1

23: c3 ret

ARM ISA is a little bit better because it has an additional special instruction also for the Newton step.

Oh, yes. You're absolutely right, here are the new results with just one Newton-Raphson iteration (same as the other functions.)

ReplyDeleteTiming Exact function

1708 ms used for 100000000 passes, avg 1.708e-05 ms

Timing Carmack function

454 ms used for 100000000 passes, avg 4.54e-06 ms

Timing Carmack function (strict-aliasing)

465 ms used for 100000000 passes, avg 4.65e-06 ms

Timing Lomont function

457 ms used for 100000000 passes, avg 4.57e-06 ms

Timing Lomont function (strict-aliasing)

454 ms used for 100000000 passes, avg 4.54e-06 ms

Timing rsqrtss function

449 ms used for 100000000 passes, avg 4.49e-06 ms

Yes, it's slightly faster, but even with the Newton-Raphson iteration added we still have a higher relative error than the Lomont/Carmack functions:

Checking errors. This may take a long time

Carmack error : max 1.56138e+16 rel max 0.00175228 avg 1.54397e-05

Lomont error : max 1.56046e+16 rel max 0.00175124 avg 1.54397e-05

rsqrtss error : max 1.48724e+12 rel max 2.0068e-07 avg 1.88473e-09

Of course, you could do another Newton-Raphson iteration (or even more) but we're not talking much difference at all between the rsqrtss and Lomont methods. Good discussion, though! :-)

I definitely need to check into ARM assembly further.

I think 2.0068e-07 is actually smaller than 0.00175124

ReplyDeleteAnyway, that's an interesting stuff. Thanks for bringing it up.

Yes you're right; rsqrtss with a single Newton-Raphson iteration does seem to be the superior method here. Thanks for pointing it out; I had somehow missed it.

ReplyDeleteI'll update my math library to prefer it where __SSE__ is defined (-msse passed to the compiler.)