Showing posts with label Analysis. Show all posts
Showing posts with label Analysis. Show all posts

Sunday, February 20, 2011

New home for Revenge (Radeon Reverse Engineering Tool)

Thanks to Marek Olšák for having a backup copy of my Git repository online! The hard drive containing much of my personal code which was on people.freedesktop.org (until those directories were lost) is half way around the world.

Revenge now has a new home on http://gitorious.org/omcfadde/revenge.

I have bumped the version to 2.0.0, which introduces some minor configure.ac fixes: mostly PKG_CHECK_MODULES for libpci, sdl, and zlib. I have also updated email addresses and revenge.sh for non-developers.

Honestly I do not expect this code to get much interest now that we have documentation from AMD; but it's useful for historical/nostalgic reasons.

If I were to do it over again today: I would start with the kernel MMIO tracer (which would deal with the fglrx kernel module) then extend this to handle dumping MMIO access from a userspace process too. The kernel is the perfect place to do so, and would be far more reliable than userspace.

If you have any questions or bug reports, feel free to ask them here and I will try to provide you with timely answers/fixes.

Saturday, February 19, 2011

Math function micro-optimization...

Preamble for planet.freedesktop.org

Sorry about the poor formatting on planet.freedesktop.org; it seems it and BlogSpot don't quite get along, therefor you won't see any color hilights. It looks much better (and easier to read) on my actual blog page, honest!

Updated version includes float-to-int optimization and comments; sorry if this bumps this rather long post to the top again; this is not my intention. planet.freedesktop.org admins: is there some way to disable bumping when a post is updated? (Perhaps selectively, in case the bump is important. e.g. updated dates for an event.)


This analysis was performed using a modified version of Chris Lomont's inverse square-root testing code. The accompanying publication is worth reading before looking at any of this data.

I've started looking into whether there would be any performance difference in a few optimized math functions should -fstrict-aliasing be enabled. I did not believe strict-aliasing would have much of an effect on these optimized functions (and it turns out I was correct) but the benefit is seen when compiling other code which includes these inline functions.

Without strict-aliasing compatibility, including the header file containing the incompatible functions/macros taints the entire file, meaning you cannot use -fstrict-aliasing where it may be helpful for your general code.



Here are the results for the standard 1.0 / sqrt(x) frequently used in graphics engines. Even though today's renderers typically use carefully crafted SIMD functions for the critical path, this is still useful for quickly normalizing vectors in game code, etc.

The Lomont version of the function is a tiny bit faster and a tiny bit more accurate, but nothing to write home about.

Clearly it can be seen that this micro-optimization is an excellent for x86 and x86_64. Don't try it on ARM; it's far slower than just taking the hit on 1.0 / sqrt(x)

I don't know whether this optimization could be modified for ARM; any assembly experts out there?
Timing Exact function
1752 ms used for 100000000 passes, avg 1.752e-05 ms
Timing Carmack function
463 ms used for 100000000 passes, avg 4.63e-06 ms
Timing Carmack function (strict-aliasing)
455 ms used for 100000000 passes, avg 4.55e-06 ms
Timing Lomont function
453 ms used for 100000000 passes, avg 4.53e-06 ms
Timing Lomont function (strict-aliasing)
455 ms used for 100000000 passes, avg 4.55e-06 ms

The absolute value function is mostly used for comparisons (e.g. fabs(y - x) > epsilon and some other specialized functions: finding on which side of a plane an AABB resides, it's distance from said plane, AABB radius, etc. Therefor it's useful to optimize this function where possible...
Timing Exact fabsf function
268 ms used for 100000000 passes, avg 2.68e-06 ms
Timing Bit-Masking fabsf function
304 ms used for 100000000 passes, avg 3.04e-06 ms
Timing Bit-Masking fabsf function (strict-aliasing)
305 ms used for 100000000 passes, avg 3.05e-06 ms
However, apparently it's quite a bit faster to just call libc's fabsf function! I saw this originally in the Quake 3 Arena source code, so maybe things were different with the compilers and hardware of the time.

These macros/functions are used when you want to know the sign of a float (i.e. is the value positive or negative) without performing any comparison (for performance reasons.) It seems that the strict-aliasing versions perform about identical to the macros.
Timing Exact float sign bit not set function
327 ms used for 100000000 passes, avg 3.27e-06 ms
Timing FLOATSIGNBITNOTSET macro
313 ms used for 100000000 passes, avg 3.13e-06 ms
Timing Bit-Masking float sign bit not set function (strict-aliasing)
312 ms used for 100000000 passes, avg 3.12e-06 ms

Timing Exact float sign bit set function
342 ms used for 100000000 passes, avg 3.42e-06 ms
Timing FLOATSIGNBITSET macro
305 ms used for 100000000 passes, avg 3.05e-06 ms
Timing Bit-Masking float sign bit set function (strict-aliasing)
305 ms used for 100000000 passes, avg 3.05e-06 ms

Don't use "d = (int) f" if you want fast code. fld and fistp work nicely on x86 and x86_64.
Timing Exact float-to-int function
1252 ms used for 100000000 passes, avg 1.252e-05 ms
Timing Fast float-to-int function
336 ms used for 100000000 passes, avg 3.36e-06 ms

Done. By Chris Lomont 2003. Modified by Oliver McFadden 2011
These measurements were taken on my laptop with an Intel(R) Core(TM)2 Duo CPU P9500 @ 2.53GHz processor and the test program compiled with gcc version 4.4.5 (Debian 4.4.5-6)



Whether this makes any huge difference in frames-per-second is debatable, really I had a bit of time and was bored. :-) Anyway, I wouldn't say anything until testing under real-world conditions.

It does look like the bit-masking fabs can be thrown away, though, and the fast float-to-int is a major win (although beware of possible rounding differences.)

Thursday, December 24, 2009

Holiday boredom...

I don't have any plans for the Christmas holidays, and Helsinki is at just the right temperature to make doing anything outside really difficult. Some of the snow has melted and frozen into ice, making it feel a lot like walking on broken glass: physically and emotionally; I stopped counting the number of unintentional acrobatics after about the 5th or 6th time sliding in some way.

I might implement zlib compression for network traffic in my engine, which should only be a few dozen lines... or something a bit more interesting; stencil shadow volumes could still use more optimization, or some SIMD stuff...

There are at least a few dozen minor X server issues discovered by Coverity, but that's getting dangerously close to Nokia work over the holidays. The majority of the issues are error paths anyway, nothing really critical.

In the more being lazy and doing nothing area, Avatar looks like it could be an interesting movie...

Monday, December 21, 2009

Static analysis For The Win!


I caught on about clang, which is both a compiler and static analyzer, after it was mentioned on the xorg-devel list. Since then I've been running it on my own code and gradually fixing up errors.

Now, usually such tools are less than impressive (excluding extremely expensive proprietary tools such as Coverity.) However, clang has been really useful in eliminating a lot of dead code, unused variables (and the overhead of calling expensive functions to calculate those variables.)

I even discovered one major bug in the expression evaluator in my game engine which had gone overlooked for some time, which would result in two opcodes giving incorrect results and ultimately effecting the rendering of the shaders.

Clang is currently lacking Inter-Procedural Analysis but it seems like it's on the to-do shortlist, which will be a very nice feature. Hopefully it won't be long before Coverity's eating it's heart out.