Typically language abstracts the CPU it is compiling for. So you can write very complex operations (array[index]->member->x = 0) without worrying what kind of machine instructions will be generated. Sometimes this is good. But sometimes bad. Sometimes you simply cannot express in C (or whatever language) the thing you want compiler to generate. One good example is rotate instruction. Most modern architectures (x86, arm) can rotate 32 or 64 bits in register by constant amount. But you you can express this in C? For shifting it is easy, you simply write "x >> 1". But there is no C operator for rotate.
You could write (x >> n) | (x << (32-n)), but then you would depend on compiler optimizer to not mess up and optimize it properly to rotate instruction. And in debug build it would still generate a lot of operations, because it doesn't optimize, obviously.
One way how to fix this problem is to use inline assembler. You can explictly write instruction that does what you want, figure out how to tell inline assembler to use your variables and you're good. Unfortunately there are a some issues with this. Sometimes compiler will have a hard time optimizing around your inline assembly fragment. Because it treats your assembly block as "black block" - it cannot optimize it, for example, by rearranging instructions.
That's where intrinsics come to help. They look like regular functions, but they don't generate call to function. Compiler recognizes these "functions" as a way you telling it - "please generate instruction X at this point". It knows what instruction generate from name of intrinsic. But it can optimize much better, because it understands what intrinsic does and it sees all the arguments, how they are passed, created, etc...
Back to rotate example - in x86/x86_64 MSVC provides _rotl and _rotr instrinics (in intrin.h header). If you use them like this:
| unsigned int x = 123;
unsigned int y = _rotr(x, 5);
|
Then compiler will know that you want to generate "ror" instruction with rotate count 5, and it will generate appropriate register allocation so variable "x" gets passed/used in input to ror instruction, and output of it is used as "y".
Again, whats good about this is that compiler has full visibility of whats going on. In my example above compiler will actually optimize code to this:
| unsigned int y = 0xd8000003;
|
Because it knows you are rotating by 5. So there's no point of doing this at runtime, if input is constant value. Which it cannot do if you use inline assembly.
Same thing applies to SSE/AVX intrinsics, or math.h intrinsics like fabs.
There are other advantages to intrinsics. Most of them are portable between compilers - you can use same ones for GCC and MSVC. Which again you cannot do with inline assembly. Some of intrinsics are also portable between architectures - like GCC's __builtin_bswap32, which will swap bytes in 32-bit integer.