_mm_mul_epu32 only multiplies 2 32-bit integers and stores result as 2 64-bit integers. So you would still need to do two of those and some shifting/shuffling to put all 4 back together.
On intel when loading or storing memory the addition is free. Instruction for that are typically in this form:
| mov Register, [Base + Scale * Index + Offset]
|
where Register is destination where to store result. Base and Index are registers that contain some memory offsets. Scale is 1, 2, 4 or 8. And Offset is immediate address.
In our case compiler is probably generating one of these forms:
| mov Register, [Base + Index] // where Index register stores TexturePitch
mov Register, [Base + 4] // because sizeof(uint32)==4
mov Register, [Base + Index + 4] // combination of two above
|
Where Base is TexelPtr pointer. So there is no add opcode happening before this memory load. It's all internal to CPU, so there is no need waste additional add operation (SIMD or no SIMD).
Here's a fragment of disassembly that proves this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27 | ░.00000001`800038E7: 66410F6E10 movd xmm2,[r8]
░.00000001`800038EC: 4863D0 movsxd rdx,eax
░.00000001`800038EF: 4803D6 add rdx,rsi
░.00000001`800038F2: 660F6E02 movd xmm0,[rdx]
░.00000001`800038F6: 660F7ED8 movd eax,xmm3
░.00000001`800038FA: 66440F62C0 punpckldq xmm8,xmm0
░.00000001`800038FF: 4863C8 movsxd rcx,eax
░.00000001`80003902: 660F6E4204 movd xmm0,[rdx][4]
░.00000001`80003907: 4803CE add rcx,rsi
░.00000001`8000390A: 660F6E09 movd xmm1,[rcx]
░.00000001`8000390E: 660F62D1 punpckldq xmm2,xmm1
░.00000001`80003912: 66440F62F8 punpckldq xmm15,xmm0
░.00000001`80003917: 660F6E4904 movd xmm1,[rcx][4]
░.00000001`8000391C: 66420F6E042A movd xmm0,[rdx][r13]
░.00000001`80003922: 66440F62C2 punpckldq xmm8,xmm2
░.00000001`80003927: 66440F62E8 punpckldq xmm13,xmm0
░.00000001`8000392C: 66410F6E5004 movd xmm2,[r8][4]
░.00000001`80003932: 660F62D1 punpckldq xmm2,xmm1
░.00000001`80003936: 66420F6E0C29 movd xmm1,[rcx][r13]
░.00000001`8000393C: 66440F62FA punpckldq xmm15,xmm2
░.00000001`80003941: 66430F6E1428 movd xmm2,[r8][r13]
░.00000001`80003947: 660F62D1 punpckldq xmm2,xmm1
░.00000001`8000394B: 66410F6E4C0D04 movd xmm1,[r13][rcx][4]
░.00000001`80003952: 66440F62EA punpckldq xmm13,xmm2
░.00000001`80003957: 66410F6E441504 movd xmm0,[r13][rdx][4]
░.00000001`8000395E: 66430F6E540504 movd xmm2,[r13][r8][4]
░.00000001`80003965: 66430F6E640D04 movd xmm4,[r13][r9][4]
|
"movd xmm4,[r13][r9][4]" is just another syntax for "movd xmm4,[r13+r9+4]". So r13 contains TexturePitch and r8/r9/rdx/rcx contains TexelPtr0/1/2/3.
As for anti-aliasing Casey explained this in one of Q&A's. There will be no need for that in HH, because all graphics are sprites with alpha channel. So edges are always smooth because artist makes them smooth by specifying correct alpha. When HH game draws sprite then the edge of sprite never contains solid pixels - it is always with alpha = 0 (fully transparent). Anti-aliasing matters only when you draw hard edges, like polygons in 3D rendering.