Too lazy to check, but I think this is the thread where I railed on gcc.
Was geeking out tonight and found some humourous quotes from the main developer of the x264 encoder pertaining to gcc.
Was geeking out tonight and found some humourous quotes from the main developer of the x264 encoder pertaining to gcc.
<Dark_Shikari> are there intrinsics one can give gcc to cause it not to do retarded things?
<pengvado> __attribute__((no_******))
<pengvado> I got -fwhole-program working, but it doesn't give any measurable speedup
<saintdev> do you have -fomg-fast-speed working yet?
<pengvado> -fomit-instructions?
<pengvado> gcc fails to optimize it because gcc has always sucked at arrays
<pengvado> that's *the* benefit of fortran
<pengvado> will eventually need a struct mv_t to simplify int16_t[2] manipulations
<pengvado> hmm, no it fails
<pengvado> simple assignment doesn't work either, that also compiles to assignment of member fields
<pengvado> ok, gcc sucks at structs just as much as it sucks at arrays. scratch that idea.
<pengvado> all the standard tools suck. I need to fix copy+paste in uuterm so I can switch from xterm, and I need to fix wildcards in psh so I can switch from bash, and I need to write a C compiler so I can switch from gcc ...
<Dark_Shikari> so why am I losing so much speed?
<Dark_Shikari> wouldn't it compile to the exact same thing?!
<Dark_Shikari> given that its inlined
<pengvado> gcc sucks?
<pengvado> ICE not crash. that is, it reports an error in what really shouldn't be an error condition, then refuses to compile your code.
<pengvado> this case is slightly believable, as -freorder-blocks-and-partition and -fprofile-use both modify sections (hot vs cold functions), and neither is implied by any of the -O settings so it's quite possible that they were never tested together
<pengvado> oops, this doesn't depend on -Os. just those 2
<pengvado> oh well. typical oss courtesy says I should file a bugreport, but I'm really not interested in dealing with gcc people, so I'll just file it under "don't do that"
<Dark_Shikari> WHAT THE HELL I change the flat to zeroes and it segfaults on startup on my machine?!?!!
<Dark_Shikari> In fact I've found this everywhere now, anywhere I use a static array it crashes.
<pengvado> x264 must have been added to the gcc testsuite, so they have a pristine copy to compare against and know when to crash
<Dark_Shikari> did you hear they're adding the ability to inline function pointers in gcc?
<pengvado> of course. because "interesting" optimizations are preferred over optimizations that make programs fast
<Dark_Shikari> is there a measurable cost, in general to always-correctly-predicted branches in inner loops?
<pengvado> depends which side of the loop is inlined
<pengvado> iirc k8 has a minimum cost of 2 cycles for a jump of any kind but not for a non-taken branch
<Dark_Shikari> so in profiled mode, it'll inline the non-lossless. in non-profiled, how do you know which side it will?
<pengvado> flip a coin?
<pengvado> as you noticed, gcc doesn't store array elements in registers
<Dark_Shikari> even extremely simple arrays, like x[2]?
<pengvado> oh, gcc does, but only when it doesn't help
<Dark_Shikari> Is that like Murphy's Law of gcc--it only does useful things when they aren't useful?
<pengvado> e.g. struct mv { int16_t x[2]; } does store x[2] in registers, thus preventing write combining
<Dark_Shikari> is there any good reason why the makefile has -O4 in it?
<pengvado> because 4 is bigger than 3, duh
<Dark_Shikari> it's much slower on gcc 3.4
<Dark_Shikari> 100 -> 130 cycles
<pengvado> anything obvious in the asm?
<pengvado> because all this is really just rerolling the gcc random code generator
<Dark_Shikari> What happened there is exactly equivalent to the following:
<Dark_Shikari> last_nonb = i; i--; cur_nonb = i;
<Dark_Shikari> assert( last_nonb != cur_nonb );
<Dark_Shikari> That assert failed. This is, of course, completely impossible.
<pengvado> We're talking gcc here. It does the impossible every morning before breakfast.
<pengvado> __attribute__((no_******))
<pengvado> I got -fwhole-program working, but it doesn't give any measurable speedup
<saintdev> do you have -fomg-fast-speed working yet?
<pengvado> -fomit-instructions?
<pengvado> gcc fails to optimize it because gcc has always sucked at arrays
<pengvado> that's *the* benefit of fortran
<pengvado> will eventually need a struct mv_t to simplify int16_t[2] manipulations
<pengvado> hmm, no it fails
<pengvado> simple assignment doesn't work either, that also compiles to assignment of member fields
<pengvado> ok, gcc sucks at structs just as much as it sucks at arrays. scratch that idea.
<pengvado> all the standard tools suck. I need to fix copy+paste in uuterm so I can switch from xterm, and I need to fix wildcards in psh so I can switch from bash, and I need to write a C compiler so I can switch from gcc ...
<Dark_Shikari> so why am I losing so much speed?
<Dark_Shikari> wouldn't it compile to the exact same thing?!
<Dark_Shikari> given that its inlined
<pengvado> gcc sucks?
<pengvado> ICE not crash. that is, it reports an error in what really shouldn't be an error condition, then refuses to compile your code.
<pengvado> this case is slightly believable, as -freorder-blocks-and-partition and -fprofile-use both modify sections (hot vs cold functions), and neither is implied by any of the -O settings so it's quite possible that they were never tested together
<pengvado> oops, this doesn't depend on -Os. just those 2
<pengvado> oh well. typical oss courtesy says I should file a bugreport, but I'm really not interested in dealing with gcc people, so I'll just file it under "don't do that"
<Dark_Shikari> WHAT THE HELL I change the flat to zeroes and it segfaults on startup on my machine?!?!!
<Dark_Shikari> In fact I've found this everywhere now, anywhere I use a static array it crashes.
<pengvado> x264 must have been added to the gcc testsuite, so they have a pristine copy to compare against and know when to crash
<Dark_Shikari> did you hear they're adding the ability to inline function pointers in gcc?
<pengvado> of course. because "interesting" optimizations are preferred over optimizations that make programs fast
<Dark_Shikari> is there a measurable cost, in general to always-correctly-predicted branches in inner loops?
<pengvado> depends which side of the loop is inlined
<pengvado> iirc k8 has a minimum cost of 2 cycles for a jump of any kind but not for a non-taken branch
<Dark_Shikari> so in profiled mode, it'll inline the non-lossless. in non-profiled, how do you know which side it will?
<pengvado> flip a coin?
<pengvado> as you noticed, gcc doesn't store array elements in registers
<Dark_Shikari> even extremely simple arrays, like x[2]?
<pengvado> oh, gcc does, but only when it doesn't help
<Dark_Shikari> Is that like Murphy's Law of gcc--it only does useful things when they aren't useful?
<pengvado> e.g. struct mv { int16_t x[2]; } does store x[2] in registers, thus preventing write combining
<Dark_Shikari> is there any good reason why the makefile has -O4 in it?
<pengvado> because 4 is bigger than 3, duh
<Dark_Shikari> it's much slower on gcc 3.4
<Dark_Shikari> 100 -> 130 cycles
<pengvado> anything obvious in the asm?
<pengvado> because all this is really just rerolling the gcc random code generator
<Dark_Shikari> What happened there is exactly equivalent to the following:
<Dark_Shikari> last_nonb = i; i--; cur_nonb = i;
<Dark_Shikari> assert( last_nonb != cur_nonb );
<Dark_Shikari> That assert failed. This is, of course, completely impossible.
<pengvado> We're talking gcc here. It does the impossible every morning before breakfast.
Comment