Comments on Oliver's Blog: Math function micro-optimization...

Yes you're right; rsqrtss with a single Newton...

2011-02-10T14:14:04.894+02:00

Yes you're right; rsqrtss with a single Newton-Raphson iteration does seem to be the superior method here. Thanks for pointing it out; I had somehow missed it.

I'll update my math library to prefer it where __SSE__ is defined (-msse passed to the compiler.)

I think 2.0068e-07 is actually smaller than 0.0017...

2011-02-10T10:59:14.899+02:00

I think 2.0068e-07 is actually smaller than 0.00175124

Anyway, that's an interesting stuff. Thanks for bringing it up.

Oh, yes. You're absolutely right, here are the...

2011-02-09T23:07:35.654+02:00

Oh, yes. You're absolutely right, here are the new results with just one Newton-Raphson iteration (same as the other functions.)

Timing Exact function
1708 ms used for 100000000 passes, avg 1.708e-05 ms
Timing Carmack function
454 ms used for 100000000 passes, avg 4.54e-06 ms
Timing Carmack function (strict-aliasing)
465 ms used for 100000000 passes, avg 4.65e-06 ms
Timing Lomont function
457 ms used for 100000000 passes, avg 4.57e-06 ms
Timing Lomont function (strict-aliasing)
454 ms used for 100000000 passes, avg 4.54e-06 ms
Timing rsqrtss function
449 ms used for 100000000 passes, avg 4.49e-06 ms

Yes, it's slightly faster, but even with the Newton-Raphson iteration added we still have a higher relative error than the Lomont/Carmack functions:

Checking errors. This may take a long time
Carmack error : max 1.56138e+16 rel max 0.00175228 avg 1.54397e-05
Lomont error : max 1.56046e+16 rel max 0.00175124 avg 1.54397e-05
rsqrtss error : max 1.48724e+12 rel max 2.0068e-07 avg 1.88473e-09

Of course, you could do another Newton-Raphson iteration (or even more) but we're not talking much difference at all between the rsqrtss and Lomont methods. Good discussion, though! :-)

I definitely need to check into ARM assembly further.

RSQRTSS is basically only a replacement for the in...

2011-02-09T22:29:14.275+02:00

RSQRTSS is basically only a replacement for the initial approximation part:

int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;

But still at least one extra Newton step is needed after that in order to get reasonable precision.

Just as example, this is the code generated by Intel compiler (as we see, it can use RSQRTSS instruction automatically):

$ cat rsqrt-test.c
#include

float rsqrt(float f)
{
return 1.0f / sqrt(f);
}
$ icc -O3 -c rsqrt-test.c
$ objdump -Mintel -d rsqrt-test.o

0000000000000000 :
0: f3 0f 52 c8 rsqrtss xmm1,xmm0
4: f3 0f 59 c1 mulss xmm0,xmm1
8: f3 0f 59 c1 mulss xmm0,xmm1
c: f3 0f 5c 05 00 00 00 subss xmm0,DWORD PTR [rip+0x0] # 14
13: 00
14: f3 0f 59 c8 mulss xmm1,xmm0
18: f3 0f 59 0d 00 00 00 mulss xmm1,DWORD PTR [rip+0x0] # 20
1f: 00
20: 0f 28 c1 movaps xmm0,xmm1
23: c3 ret

ARM ISA is a little bit better because it has an additional special instruction also for the Newton step.

I posted the whole results again, because it seems...

2011-02-09T22:04:55.029+02:00

I posted the whole results again, because it seems that enabling -msse (for the rsqrtss function) does affect the other functions slightly. Or maybe Firefox was eating a bit less of my CPU this time (PC was idle for the test, but with programs running.)

So to be fair, the complete results are posted above.

Well, it does look like the SSE rsqrtss is clearly...

2011-02-09T22:01:42.218+02:00

Well, it does look like the SSE rsqrtss is clearly a lot faster on first glance... However I'm suspicious enough to run it through the error checking code. I believe rsqrtss doesn't handle zero values correctly as the other functions; I would have to double check this, though.

Timing Exact function
1678 ms used for 100000000 passes, avg 1.678e-05 ms
Timing Carmack function
454 ms used for 100000000 passes, avg 4.54e-06 ms
Timing Carmack function (strict-aliasing)
460 ms used for 100000000 passes, avg 4.6e-06 ms
Timing Lomont function
452 ms used for 100000000 passes, avg 4.52e-06 ms
Timing Lomont function (strict-aliasing)
451 ms used for 100000000 passes, avg 4.51e-06 ms
Timing rsqrtss function
290 ms used for 100000000 passes, avg 2.9e-06 ms

Checking errors. This may take a long time
Carmack error: max 1.56138e+16 rel max 0.00175228 avg 1.54397e-05
Lomont error : max 1.56046e+16 rel max 0.00175124 avg 1.54397e-05
rsqrtss error: max 9.22337e+18 rel max 1 avg 0.00790514

@Serge: Yeah, I'm aware of rsqrtss for x86, I&...

2011-02-09T21:33:10.904+02:00

@Serge: Yeah, I'm aware of rsqrtss for x86, I'm just not sure it's any faster than the Carmack/Lomont method. I'll add a test case.

Thanks for the tip on ARM; I'll definitely give that a try and see how it does on ioquake3 for N900, or if I'm lazy, just recompile the test code. :)

Timing Exact float-to-int function 1252 ms used fo...

2011-02-09T21:29:11.625+02:00

Timing Exact float-to-int function
1252 ms used for 100000000 passes, avg 1.252e-05 ms
Timing Fast float-to-int function
336 ms used for 100000000 passes, avg 3.36e-06 ms

Don't use "d = (int) f" if you want fast code, either. fld and fistp work nicely on x86 and x86_64.

For x86 (SSE), fast inverse square-root approximat...

2011-02-09T21:17:58.888+02:00

For x86 (SSE), fast inverse square-root approximation can be calculated using RSQRTSS instruction.

For ARM (NEON), this is achieved using VRSQRTE (initial approximation) instruction followed by VRSQRTS instructions (Newton-Raphson iteration).