Word Alignment and Variable Size Performance Implications

Draos

#16108

August 28, 2018

Hi, so very often in the episodes, alignment on a 32-bit boundary is brought up, because the CPU is, apparently, more efficient when dealing with 32-bits. I did some research and saw that CPUs are faster at dealing with data aligned to their word size, since that is how many bytes it can read at one time (and is also the register size). However, if that is the case, why are we aligning to 32-bits instead of 64, given most CPUs nowadays are 64-bit?

Also, why do we not use int64 instead of int32, as a standard, if CPUs are meant to be faster at dealing with values aligned to their native word sizes? I guess what I don't get is why int32 is considered fast on x64, but int16 and int8 are slow. If the Math needs to be done in 64 bit registers, shouldn't int32 be just as slow? Even if int32 has a faster path for whatever reason, shouldn't int64 still theoretically be the fastest?

EDIT: Apparently, any type lower than 32-bits is implicitly promoted to a integer for arithmetic operations, so I guess it makes sense why those are slower. I still don't really get why we strive for 32-bit alignment and utilize 32-bit variables over 64-bit though.

Edited by Draos on August 28, 2018, 7:15am

Mārtiņš Možeiko

#16109

August 28, 2018

If we are talking only about Intel architecture, then even in 64-bit mode it does operations on 32-bit values in a special way - meaning it has reasonably efficient instruction encoding and can do 32-bit operations without worrying about upper 32-bits in 64-bit register. Like doing 32-bit add in 64-bit register will leave upper half as 0, so casting to 64-bit integer will be "free", no need to mask upper bits. Intel probably added these features to instruction set & architecture specially for efficiency & compatibility with older 32-bit software.

Also, why do we not use int64 instead of int32, as a standard, if CPUs are meant to be faster at dealing with values aligned to their native word sizes?

This is not quite true. Let's say we take Skylake CPU. Then according to Agner instruction tables32-bit integer division throughput is 6, but throughput of 64-bit division in 21-83 cycles. Quite a difference, div is more complex instruction and more bits for it means more processing. Latency also is different between these two.

Even simply multiplication with register costs 2 vs 1 throughput for 32-bit vs 64-bit operations.

int64 costs also more in storage. That means less data fits into cache. And that means lower performance because cpu needs to wait more on data being loaded from or stored to memory.

And then you get into SIMD registers. For AVX code which has 256-bits registers, you can process 8 int32 at same time, but only 4 int64. That would mean your code would run at least 2x slower when processing larger amounts of data.

Edited by Mārtiņš Možeiko on August 28, 2018, 7:38am

Draos

#16110

August 28, 2018

i see, so everything you said made a lot of sense, but then why are the smaller integer data types considered slow in comparison? is it just a matter of Intel optimizing 32-bit and 64-bit because those are the more common operations? Like, there are special paths for 32-bit and 64-bit, but 8-bit and 16-bit need to actually run as a 32-bit operation?

Mārtiņš Možeiko

#16111

August 28, 2018

I'm not sure who, why and where considers smaller integer types slower, but my gut feeling is that this has more to do with C semantics, not actual x86 operations. As you said in beginning, any operation with smaller than int types is promoted to int for operation, and it is converted back only if result is stored in smaller type. That means that compiler often needs to generate more code for operation to be correct. Instead of doing directly 8 or 16-bit operation, it needs to generate code that loads values in 32-bit register, performs operations and then later stores back to 8/16-bit location.

Here is a simple example: https://godbolt.org/z/h91xQr
See how f16 function has more instructions than f32? It still uses same opcode to do the 32-bit division - "idiv esi", but it requires more code to load 16-bit values into 32-bit registers for C semantics to be correct.

This pretty much heavily depends how smart is compiler and how easy is your code to optimize - often compiler will not do this kind of manipulation and simply directly operate with 32-bits values if it can prove that semantics of your code does not change.

And again - if we look at SIMD register and your algorithm works with 8-bit values well, then you can process data with 4x less operations than if you would be operating with 32-bit values. Very often 4x less operations means 4x faster code.

If you look at other architectures, for example, ARM - then it can only load and store 8 or 16-bit values. All operations are either 32-bit or 64-bit wide. There are simply no instructions for 8/16 data types (with a few exceptions). Compiler will generate more code if you will be operating with 8/16 data types a lot more than with 32-bit data types.

Edited by Mārtiņš Možeiko on August 28, 2018, 8:19pm