Ok, so struct size optimization is one reason to use fixed-size bool. I guess the next question is: why bool32, when a bool8 would do an even better job at keeping data structures small? I remember Casey saying something about processors being "happier" with 32 bit data types, so I did a quick test:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13  |     int32_t a = 0;
    uint8_t b8 = true;
    if (b8) {
        a = 1; 
    }
    uint32_t b32 = true;
    if (b32) {
        a = 2; 
    }
    bool b = true;
    if (b) {
        a = 3;
    }
 
 | 
 
Compiled without optimization, MSVC generates this assembly:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28  |     int a = 0;
000007FEF9A92D2F  mov         dword ptr [a],0  
    uint8_t b8 = true;
000007FEF9A92D37  mov         byte ptr [b8],1  
    if (b8) {
000007FEF9A92D3C  movzx       eax,byte ptr [b8]  
000007FEF9A92D41  test        eax,eax  
000007FEF9A92D43  je          updateGame+45Dh (07FEF9A92D4Dh)  
        a = 1;
000007FEF9A92D45  mov         dword ptr [a],1  
    }
    uint32_t b32 = true;
000007FEF9A92D4D  mov         dword ptr [b32],1  
    if (b32) {
000007FEF9A92D55  cmp         dword ptr [b32],0  
000007FEF9A92D5A  je          updateGame+474h (07FEF9A92D64h)  
        a = 2;
000007FEF9A92D5C  mov         dword ptr [a],2  
    }
    bool b = true;
000007FEF9A92D64  mov         byte ptr [b],1  
    if (b) {
000007FEF9A92D69  movzx       eax,byte ptr [b]  
000007FEF9A92D6E  test        eax,eax  
000007FEF9A92D70  je          updateGame+48Ah (07FEF9A92D7Ah)  
        a = 3;
000007FEF9A92D72  mov         dword ptr [a],3  
    }
 
 | 
 
First observation: bool and uint8_t behave exactly the same (no surprises there). The difference between bool/uint8_t and uint32_t is in the if-condition:
 | // bool / uint8_t
    if (b) {
000007FEFA092D6D  movzx       eax,byte ptr [b]  
000007FEFA092D72  test        eax,eax  
// uint32_t
    if (b32) {
000007FEFA092D57  cmp         dword ptr [b32],0  
 
 | 
 
So, 8 bit types need two instructions, compared to a single instruction for the 32 bit type. The Agner Fog tables for Haswell have the following to say about the used instructions:
[table]
  [tr]
    [td]Instruction[/td]
    [td]Micro operations (fused domain)[/td]
    [td]Reciprocal throughput[/td]
  [/tr]
  [tr]
    [td]movzx r,m[/td]
    [td]1[/td]
    [td]0.5[/td]
  [/tr]
  [tr]
    [td]test r,r[/td]
    [td]1[/td]
    [td]0.25[/td]
  [/tr]
  [tr]
    [td]cmp m,i[/td]
    [td]1[/td]
    [td]0.5[/td]
  [/tr]
[/table]
If I interpret those numbers right (and I might very well not, so correct me if I'm wrong), the 8 bit versions take 2 micro operations and 0.5 + 0.25 = 0.75 clock cycles, while the 32 bit version takes 1 micro operation and 0.5 clock cycles to check the condition.
It seems like bool32 eeks out a victory, performance wise. That may change in situations where using a bool8 instead of a bool32 in a struct brings the struct's size below the size of a cache line (or causes significantly more of them to fit into one cache line), but I don't know how to test this (yet).