I'm not sure who, why and where considers smaller integer types slower, but my gut feeling is that this has more to do with C semantics, not actual x86 operations. As you said in beginning, any operation with smaller than int types is promoted to int for operation, and it is converted back only if result is stored in smaller type. That means that compiler often needs to generate more code for operation to be correct. Instead of doing directly 8 or 16-bit operation, it needs to generate code that loads values in 32-bit register, performs operations and then later stores back to 8/16-bit location.
Here is a simple example:
https://godbolt.org/z/h91xQr
See how f16 function has more instructions than f32? It still uses same opcode to do the 32-bit division - "idiv esi", but it requires more code to load 16-bit values into 32-bit registers for C semantics to be correct.
This pretty much heavily depends how smart is compiler and how easy is your code to optimize - often compiler will not do this kind of manipulation and simply directly operate with 32-bits values if it can prove that semantics of your code does not change.
And again - if we look at SIMD register and your algorithm works with 8-bit values well, then you can process data with 4x less operations than if you would be operating with 32-bit values. Very often 4x less operations means 4x faster code.
If you look at other architectures, for example, ARM - then it can only load and store 8 or 16-bit values. All operations are either 32-bit or 64-bit wide. There are simply no instructions for 8/16 data types (with a few exceptions). Compiler will generate more code if you will be operating with 8/16 data types a lot more than with 32-bit data types.