Yes, it is other way around. Instead of compiler making unoptimized builds slow, it does really hard work for optimized builds and makes them fast. Doing this is a hard work takes time and memory.
Here's example of calculating new_position = position + speed * time with vector2.
Compare unoptimized build:
https://godbolt.org/g/zXfDjL
vs optimized build:
https://godbolt.org/g/fq6njm
(it's not MSVC, but clang - still the idea is the same).
Look how compiler optimized code - only three instructions + ret.
But for unoptimized build there much more. Compiler didn't create these instructions out of nowhere. It's the other way around. For any code compiler starts with list of unoptimized instructions that were created by translating C code to instructions (usually they are high-level pseudo-assembly instructions before they are converted to real x86 instructions, but the idea is the same). It treats each float as independent variable. Each variable goes to stack exactly how it is specified in C code. After generating unoptimized instructions the compiler goes over them and tries to figure out what can be simplified, what is redundant, what can be removed. That takes time and effort. That's why optimized builds are slower (sometimes significantly). But in result you get everything nice and compact.
When Casey says "code is more or less aligned to how the hardware works" he means optimized assembly code (and data structures, but that's a different story). C code as C code never maps to hardware. You need to know and think what compiler does and how the output code will look like.