Why use bool32 in stead of bool?

Benjamin Kloster

#2875

February 28, 2015

Ok, so struct size optimization is one reason to use fixed-size bool. I guess the next question is: why bool32, when a bool8 would do an even better job at keeping data structures small? I remember Casey saying something about processors being "happier" with 32 bit data types, so I did a quick test:

    int32_t a = 0;
    uint8_t b8 = true;
    if (b8) {
        a = 1; 
    }
    uint32_t b32 = true;
    if (b32) {
        a = 2; 
    }
    bool b = true;
    if (b) {
        a = 3;
    }

Compiled without optimization, MSVC generates this assembly:

    int a = 0;
000007FEF9A92D2F  mov         dword ptr [a],0  
    uint8_t b8 = true;
000007FEF9A92D37  mov         byte ptr [b8],1  
    if (b8) {
000007FEF9A92D3C  movzx       eax,byte ptr [b8]  
000007FEF9A92D41  test        eax,eax  
000007FEF9A92D43  je          updateGame+45Dh (07FEF9A92D4Dh)  
        a = 1;
000007FEF9A92D45  mov         dword ptr [a],1  
    }
    uint32_t b32 = true;
000007FEF9A92D4D  mov         dword ptr [b32],1  
    if (b32) {
000007FEF9A92D55  cmp         dword ptr [b32],0  
000007FEF9A92D5A  je          updateGame+474h (07FEF9A92D64h)  
        a = 2;
000007FEF9A92D5C  mov         dword ptr [a],2  
    }
    bool b = true;
000007FEF9A92D64  mov         byte ptr [b],1  
    if (b) {
000007FEF9A92D69  movzx       eax,byte ptr [b]  
000007FEF9A92D6E  test        eax,eax  
000007FEF9A92D70  je          updateGame+48Ah (07FEF9A92D7Ah)  
        a = 3;
000007FEF9A92D72  mov         dword ptr [a],3  
    }

First observation: bool and uint8_t behave exactly the same (no surprises there). The difference between bool/uint8_t and uint32_t is in the if-condition:

// bool / uint8_t
    if (b) {
000007FEFA092D6D  movzx       eax,byte ptr [b]  
000007FEFA092D72  test        eax,eax  

// uint32_t
    if (b32) {
000007FEFA092D57  cmp         dword ptr [b32],0

So, 8 bit types need two instructions, compared to a single instruction for the 32 bit type. The Agner Fog tables for Haswell have the following to say about the used instructions:

[table]
[tr]
[td]Instruction[/td]
[td]Micro operations (fused domain)[/td]
[td]Reciprocal throughput[/td]
[/tr]
[tr]
[td]movzx r,m[/td]
[td]1[/td]
[td]0.5[/td]
[/tr]
[tr]
[td]test r,r[/td]
[td]1[/td]
[td]0.25[/td]
[/tr]
[tr]
[td]cmp m,i[/td]
[td]1[/td]
[td]0.5[/td]
[/tr]
[/table]

If I interpret those numbers right (and I might very well not, so correct me if I'm wrong), the 8 bit versions take 2 micro operations and 0.5 + 0.25 = 0.75 clock cycles, while the 32 bit version takes 1 micro operation and 0.5 clock cycles to check the condition.

It seems like bool32 eeks out a victory, performance wise. That may change in situations where using a bool8 instead of a bool32 in a struct brings the struct's size below the size of a cache line (or causes significantly more of them to fit into one cache line), but I don't know how to test this (yet).

Edited by Benjamin Kloster on February 28, 2015, 10:38am

popcorn

#2876

February 28, 2015

mmm assembly

I guess since eax register is 32 bit, it makes it easier for the compiler than a 8 bit.

I really love these low level stuff, it's always nice to see whats under the hood.

Sven

#2877

February 28, 2015

Nimbal

Compiled without optimization, MSVC generates this assembly:

To get meaningful results you really should compile with optimizations for such tests. The compiler might optimize your tests away then, but you can counter that by using function parameters for your test variables.

Benjamin Kloster

#2880

March 1, 2015

5sw

To get meaningful results you really should compile with optimizations for such tests.

To be fair, the only way we'd ever get "meaningful" results is measuring execution time, preferrably of a real-world application, not some artificial example.

I just tried to construct such an artificial example and build it with optimization. The compiler was pretty persistent in optimizing away any kind of comparable assembly instructions. It first inlined my function calls, then, after I disabled that particular optimization, it recognized that the two functions (for bool8 and bool32, respectively) basically did the same thing and just generated one single function. After further obfuscating, I found that now it uses a single jne instruction for the if-clause in both cases, but the calculation of the true / false values was a bit different (seemingly in favor of bool8, but I didn't sum up the clock cycles). I don't see a point in sharing the code here, as it's, in my eyes, even less meaningful than comparing unoptimized assembly due to all the obfuscations necessary to wrangle the compiler into giving us something we can compare.

The bottom line is, bool8 vs bool32 in an optimized build seems to be highly dependent on the context (not very surprising) and probably even more dependent on the platform you are targeting. The C standard doesn't fix sizeof(bool) to be 1 to give implementers the option to choose the most optimal size for their platform. In a way, it's counter productive to force the size of a boolean if you want to port your application across many architectures and maintain comparable performance on all of them.

In any case, I hope everyone here is aware that choosing between bool, bool8 or bool32 to minimize execution time is usually little more than a moot micro-optimization. Even in a debug build without any optimizations enabled, the difference of 0.25 clock cycles per boolean check means that you'd have to do millions upon millions of such checks per frame for it to ever be noticeable. At that point, the size of a boolean is probably the least of your problems.

Casey Muratori

#2881

March 1, 2015

For the record, yes, I usually have bool8, bool16, bool32, bool64. This has nothing to do with anything other than the fact that I like to specify sizes, and I find it nice to have "bool" instead of "uint" or "int" so it reminds me that it is not supposed to be used for anything other than zero / not zero.

Performance wise, it's really not something that you're going to see show up as for as CPU operations are concerned, I wouldn't think. Often times I will do it for much more straightforward reasons such as:

struct bad
{
    bool Foo;
    bool Bar;
};

struct good
{
    bool8 Foo;
    bool8 Bar;
};

where "bad" can't be reliably serialized as a block write, but "good" can be. Similarly:

struct bad
{
    bool8 Bar;
    bool8 Bar;
    bool8 Bar;
    bool8 Bar;
    uint32 Foo;
};

struct good
{
    bool32 Bar;
    bool32 Bar;
    bool32 Bar;
    bool32 Bar;
    uint32 Foo;
};

where "bad" can't be endian-reversed as a block of 32's, but "good" can, just to pick a random example.

Again, this is not about some kind of expected general performance gain, and more about me wanting to know how big things are and where they are going. That's just generally something I find I like to know at all times about all my code. Performance for things like this is not something you can really expect to have generally, you have to start talking about specific circumstances, and usually it's a waste of time to think about bools and abstract performance anyway because if you really cared you'd probably be bit packing or using wide ops, etc.

- Casey