tag:blogger.com,1999:blog-5203830418461309921.post5805445130583971105..comments2023-09-20T10:34:44.183+03:00Comments on Oliver's Blog: Math function micro-optimization...Oliver McFaddenhttp://www.blogger.com/profile/06947239359915728130noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-5203830418461309921.post-68837202707055778292011-02-10T14:14:04.894+02:002011-02-10T14:14:04.894+02:00Yes you're right; rsqrtss with a single Newton...Yes you're right; rsqrtss with a single Newton-Raphson iteration does seem to be the superior method here. Thanks for pointing it out; I had somehow missed it.<br /><br />I'll update my math library to prefer it where __SSE__ is defined (-msse passed to the compiler.)Oliver McFaddenhttps://www.blogger.com/profile/06947239359915728130noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-66511197411896521202011-02-10T10:59:14.899+02:002011-02-10T10:59:14.899+02:00I think 2.0068e-07 is actually smaller than 0.0017...I think 2.0068e-07 is actually smaller than 0.00175124<br /><br />Anyway, that's an interesting stuff. Thanks for bringing it up.Unknownhttps://www.blogger.com/profile/17325769823932798969noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-82936488226147378562011-02-09T23:07:35.654+02:002011-02-09T23:07:35.654+02:00Oh, yes. You're absolutely right, here are the...Oh, yes. You're absolutely right, here are the new results with just one Newton-Raphson iteration (same as the other functions.)<br /><br />Timing Exact function<br />1708 ms used for 100000000 passes, avg 1.708e-05 ms<br />Timing Carmack function<br />454 ms used for 100000000 passes, avg 4.54e-06 ms<br />Timing Carmack function (strict-aliasing)<br />465 ms used for 100000000 passes, avg 4.65e-06 ms<br />Timing Lomont function<br />457 ms used for 100000000 passes, avg 4.57e-06 ms<br />Timing Lomont function (strict-aliasing)<br />454 ms used for 100000000 passes, avg 4.54e-06 ms<br />Timing rsqrtss function<br />449 ms used for 100000000 passes, avg 4.49e-06 ms<br /><br />Yes, it's slightly faster, but even with the Newton-Raphson iteration added we still have a higher relative error than the Lomont/Carmack functions:<br /><br />Checking errors. This may take a long time<br />Carmack error : max 1.56138e+16 rel max 0.00175228 avg 1.54397e-05<br />Lomont error : max 1.56046e+16 rel max 0.00175124 avg 1.54397e-05<br />rsqrtss error : max 1.48724e+12 rel max 2.0068e-07 avg 1.88473e-09<br /><br />Of course, you could do another Newton-Raphson iteration (or even more) but we're not talking much difference at all between the rsqrtss and Lomont methods. Good discussion, though! :-)<br /><br />I definitely need to check into ARM assembly further.Oliver McFaddenhttps://www.blogger.com/profile/06947239359915728130noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-13591208032294562762011-02-09T22:29:14.275+02:002011-02-09T22:29:14.275+02:00RSQRTSS is basically only a replacement for the in...RSQRTSS is basically only a replacement for the initial approximation part:<br /><br />int i = *(int*)&x;<br />i = 0x5f3759df - (i>>1);<br />x = *(float*)&i;<br /><br />But still at least one extra Newton step is needed after that in order to get reasonable precision.<br /><br />Just as example, this is the code generated by Intel compiler (as we see, it can use RSQRTSS instruction automatically):<br /><br />$ cat rsqrt-test.c<br />#include <br /><br />float rsqrt(float f)<br />{<br /> return 1.0f / sqrt(f);<br />}<br />$ icc -O3 -c rsqrt-test.c<br />$ objdump -Mintel -d rsqrt-test.o<br /><br />0000000000000000 :<br /> 0: f3 0f 52 c8 rsqrtss xmm1,xmm0<br /> 4: f3 0f 59 c1 mulss xmm0,xmm1<br /> 8: f3 0f 59 c1 mulss xmm0,xmm1<br /> c: f3 0f 5c 05 00 00 00 subss xmm0,DWORD PTR [rip+0x0] # 14 <br /> 13: 00 <br /> 14: f3 0f 59 c8 mulss xmm1,xmm0<br /> 18: f3 0f 59 0d 00 00 00 mulss xmm1,DWORD PTR [rip+0x0] # 20 <br /> 1f: 00 <br /> 20: 0f 28 c1 movaps xmm0,xmm1<br /> 23: c3 ret<br /><br /><br />ARM ISA is a little bit better because it has an additional special instruction also for the Newton step.Unknownhttps://www.blogger.com/profile/17325769823932798969noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-86936209947156997602011-02-09T22:04:55.029+02:002011-02-09T22:04:55.029+02:00I posted the whole results again, because it seems...I posted the whole results again, because it seems that enabling -msse (for the rsqrtss function) does affect the other functions slightly. Or maybe Firefox was eating a bit less of my CPU this time (PC was idle for the test, but with programs running.)<br /><br />So to be fair, the complete results are posted above.Oliver McFaddenhttps://www.blogger.com/profile/06947239359915728130noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-44439107369865922682011-02-09T22:01:42.218+02:002011-02-09T22:01:42.218+02:00Well, it does look like the SSE rsqrtss is clearly...Well, it does look like the SSE rsqrtss is clearly a lot faster on first glance... However I'm suspicious enough to run it through the error checking code. I believe rsqrtss doesn't handle zero values correctly as the other functions; I would have to double check this, though.<br /><br />Timing Exact function<br />1678 ms used for 100000000 passes, avg 1.678e-05 ms<br />Timing Carmack function<br />454 ms used for 100000000 passes, avg 4.54e-06 ms<br />Timing Carmack function (strict-aliasing)<br />460 ms used for 100000000 passes, avg 4.6e-06 ms<br />Timing Lomont function<br />452 ms used for 100000000 passes, avg 4.52e-06 ms<br />Timing Lomont function (strict-aliasing)<br />451 ms used for 100000000 passes, avg 4.51e-06 ms<br />Timing rsqrtss function<br />290 ms used for 100000000 passes, avg 2.9e-06 ms<br /><br />Checking errors. This may take a long time<br />Carmack error: max 1.56138e+16 rel max 0.00175228 avg 1.54397e-05<br />Lomont error : max 1.56046e+16 rel max 0.00175124 avg 1.54397e-05<br />rsqrtss error: max 9.22337e+18 rel max 1 avg 0.00790514Oliver McFaddenhttps://www.blogger.com/profile/06947239359915728130noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-17915971278929731072011-02-09T21:33:10.904+02:002011-02-09T21:33:10.904+02:00@Serge: Yeah, I'm aware of rsqrtss for x86, I&...@Serge: Yeah, I'm aware of rsqrtss for x86, I'm just not sure it's any faster than the Carmack/Lomont method. I'll add a test case.<br /><br />Thanks for the tip on ARM; I'll definitely give that a try and see how it does on ioquake3 for N900, or if I'm lazy, just recompile the test code. :)Oliver McFaddenhttps://www.blogger.com/profile/06947239359915728130noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-57085806006628032812011-02-09T21:29:11.625+02:002011-02-09T21:29:11.625+02:00Timing Exact float-to-int function
1252 ms used fo...Timing Exact float-to-int function<br />1252 ms used for 100000000 passes, avg 1.252e-05 ms<br />Timing Fast float-to-int function<br />336 ms used for 100000000 passes, avg 3.36e-06 ms<br /><br />Don't use "d = (int) f" if you want fast code, either. fld and fistp work nicely on x86 and x86_64.Oliver McFaddenhttps://www.blogger.com/profile/06947239359915728130noreply@blogger.comtag:blogger.com,1999:blog-5203830418461309921.post-34695250666236736022011-02-09T21:17:58.888+02:002011-02-09T21:17:58.888+02:00For x86 (SSE), fast inverse square-root approximat...For x86 (SSE), fast inverse square-root approximation can be calculated using RSQRTSS instruction.<br /><br />For ARM (NEON), this is achieved using VRSQRTE (initial approximation) instruction followed by VRSQRTS instructions (Newton-Raphson iteration).Unknownhttps://www.blogger.com/profile/17325769823932798969noreply@blogger.com