Day 121: Origin.y

I don't know what was happening in the marathon stream, but if you subtract Origin.y from PixelPy without using _mm_sub_ps (handmade_render_group.cpp:578 in that day's source code), the number of cycles reported by IACA actually decreases by almost 7 cycles.

Sorry if it was already reported or is simply not relevant anymore.
It may very well be relevant shortly, but we had some indications that our profiling results had gotten down into the area where we weren't really going to get clear indication of what was really going on until we dealt with memory stalls, so we wanted to put in hyperthreading first, just to make sure that we were measuring cycles in a way that was going to include inputs from two threads at once.

So, hold that thought for another two weeks or so :)

- Casey
Got it, thanks. On a similar note, can OS "detect" a possible memory stall and switch to a different thread? Or this method to hide memory latency only works for hyperthreading? I presume the actual context switching may take longer than waiting for the memory...
Context switching involves many things. Some of them are - reading values from all registers, storing them to memory, then reading register values from memory for other thread. So switching threads while one is waiting for one memory stall doesn't help.

What helps though is to switch context when thread waits on I/O - file read/write, network sockets or memory swapping to swap file. Or on thread primitives like event or mutex.
Thank you, Mārtiņš.
Hmm, this sounds like the problem I wrote about on my topic on IACA markers. Basically as the code is, the compiler can reorder the markers and the code surrounding them. Trivial looking changes can change the ordering, changing what chunk of code actually gets analyzed. I guess we'll pay closer heed to such things after multithreading is working properly.
Yes, at this point I think we're down in the region of unreliable timing, so I think we want to start timing the total time taken to render, which will be more reliable, before we start deciding things like where to put a minus, etc...

- Casey
I think we can save 1 multiply, 1 add, and 1 shift at the end of the function when we unnecessarily compute the destination alpha.

Edited by elle on
Well, that may be true, but we cannot get rid of it until we duplicate the function, because keep in mind that we actually may need destination alpha, for precompositing things (like ground tiles that have holes in them!). So we do need an optimized version with dest alpha.

- Casey
After debugging my vectorized version for a long time, I thought of a few minor things that might lower the amount of cycles per pixel by maybe one or two.

_mm_mul_epu32 is an SSE2 intrinsic, I think this could replace the _mm_mullo_epi16 and _mm_mulhi_epi16, since the arguments can't be negative where they're used.

Create a vector _mm_set_ps(3, 2, 1, 0); and a vector for the minX, and use this at the start of each row to compute the first "pixelX".

[strike]When creating the sample vectors, Instead of doing + sizeof(UInt32), and + texturePitch in scalar, do this also in simd.[/strike] I forgot these are memory fetches. I think the prior calculation of the pointers to the texels can be done in simd, though.

Might be better, might be worse. But I thought it was worth mentioning.

Are we also going to do Anti Aliasing of the edges? I thought of an intuitive way to do this, which makes me wonder why it's often so expensive in games in terms of frame rate. I'd think it only adds an extra blend of the pixels on the edge with the pixels of the layer underneath next to the edge.
Maybe because in 3D it would have to be done for every triangle as opposed to only a couple of large 2D objects?

Edited by elle on
_mm_mul_epu32 only multiplies 2 32-bit integers and stores result as 2 64-bit integers. So you would still need to do two of those and some shifting/shuffling to put all 4 back together.

On intel when loading or storing memory the addition is free. Instruction for that are typically in this form:
1
mov Register, [Base + Scale * Index + Offset]

where Register is destination where to store result. Base and Index are registers that contain some memory offsets. Scale is 1, 2, 4 or 8. And Offset is immediate address.
In our case compiler is probably generating one of these forms:
1
2
3
mov Register, [Base + Index] // where Index register stores TexturePitch
mov Register, [Base + 4] // because sizeof(uint32)==4
mov Register, [Base + Index + 4] // combination of two above

Where Base is TexelPtr pointer. So there is no add opcode happening before this memory load. It's all internal to CPU, so there is no need waste additional add operation (SIMD or no SIMD).

Here's a fragment of disassembly that proves this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
░.00000001`800038E7: 66410F6E10                     movd         xmm2,[r8]
░.00000001`800038EC: 4863D0                         movsxd       rdx,eax
░.00000001`800038EF: 4803D6                         add          rdx,rsi
░.00000001`800038F2: 660F6E02                       movd         xmm0,[rdx]
░.00000001`800038F6: 660F7ED8                       movd         eax,xmm3
░.00000001`800038FA: 66440F62C0                     punpckldq    xmm8,xmm0
░.00000001`800038FF: 4863C8                         movsxd       rcx,eax
░.00000001`80003902: 660F6E4204                     movd         xmm0,[rdx][4]
░.00000001`80003907: 4803CE                         add          rcx,rsi
░.00000001`8000390A: 660F6E09                       movd         xmm1,[rcx]
░.00000001`8000390E: 660F62D1                       punpckldq    xmm2,xmm1
░.00000001`80003912: 66440F62F8                     punpckldq    xmm15,xmm0
░.00000001`80003917: 660F6E4904                     movd         xmm1,[rcx][4]
░.00000001`8000391C: 66420F6E042A                   movd         xmm0,[rdx][r13]
░.00000001`80003922: 66440F62C2                     punpckldq    xmm8,xmm2
░.00000001`80003927: 66440F62E8                     punpckldq    xmm13,xmm0
░.00000001`8000392C: 66410F6E5004                   movd         xmm2,[r8][4]
░.00000001`80003932: 660F62D1                       punpckldq    xmm2,xmm1
░.00000001`80003936: 66420F6E0C29                   movd         xmm1,[rcx][r13]
░.00000001`8000393C: 66440F62FA                     punpckldq    xmm15,xmm2
░.00000001`80003941: 66430F6E1428                   movd         xmm2,[r8][r13]
░.00000001`80003947: 660F62D1                       punpckldq    xmm2,xmm1
░.00000001`8000394B: 66410F6E4C0D04                 movd         xmm1,[r13][rcx][4]
░.00000001`80003952: 66440F62EA                     punpckldq    xmm13,xmm2
░.00000001`80003957: 66410F6E441504                 movd         xmm0,[r13][rdx][4]
░.00000001`8000395E: 66430F6E540504                 movd         xmm2,[r13][r8][4]
░.00000001`80003965: 66430F6E640D04                 movd         xmm4,[r13][r9][4]

"movd xmm4,[r13][r9][4]" is just another syntax for "movd xmm4,[r13+r9+4]". So r13 contains TexturePitch and r8/r9/rdx/rcx contains TexelPtr0/1/2/3.

As for anti-aliasing Casey explained this in one of Q&A's. There will be no need for that in HH, because all graphics are sprites with alpha channel. So edges are always smooth because artist makes them smooth by specifying correct alpha. When HH game draws sprite then the edge of sprite never contains solid pixels - it is always with alpha = 0 (fully transparent). Anti-aliasing matters only when you draw hard edges, like polygons in 3D rendering.

Edited by Mārtiņš Možeiko on
To clarify on the anti-aliasing:

Since we are always drawing from bitmaps which have alpha-blended edges already, we just need to fetch from those bitmaps sub-pixel-accurate, and we will never actually have problems with edges. The bilinear filtering does the anti-aliasing for us.

The reason 3D cards have so much trouble doing anti-aliasing is because they have to do it order independent, and they have to do it with actual hard-edged geometry. They're taking hard-edged triangles and rasterizing them, and they have to make that result become anti-aliased at the end after all the triangles have been drawn.

Since the card doesn't even know if something _is_ an edge until after all the triangles have been rasterized, you can see why this is a tricky problem. You in fact can't even do anti-aliasing during rasterization, because you don't know if the thing you're drawing perhaps abuts some other triangle with the same texture and is, in fact, not an edge at all.

So what the 3D cards have to do is use "multisampling", where for every pixel they actually have capacity to store more information than just the pixel data. They can store a "pattern" of samples which allow them to recover more information about what edges passed through the pixel, so at the end of rendering they can do what's called a "resolve" pass to take the samples collected for every pixel and decide what the final pixel color should probably be.

The finesse here is generally in figuring out how many samples are necessary, what the resolve algorithm should be, what the distribution should look like of the samples geometrically, etc.

- Casey