Measuring Port Usage with IACA | Handmade Hero Episode Guide

0:31Review of last session: cycle counting code

1:20Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops

4:00Accurate way is to write tool to simulate the CPU, as Casey did for the XB360

4:30... or Intel architecture code analyzer (IACA)

7:35Overview on how to use IACA with the code

8:08Marking sections with IACA_START and IACA_END

9:20Modifying build.bat to include the iaca directory

10:09For linux/unix compatibility, mind your case when including files

11:38Running the IACA command line

13:10Reading the IACA results

15:28Trying to decipher the meanings of the letters in the IACA table

16:58IACA can output graphs?

17:12IACA reports max throughput of 86.60 cycles

17:56There maybe some more room for optimization...

18:30IACA is pretty nice!

19:20Adding some macros to turn IACA on/off

19:47Thanks to Fabian for the suggestion

20:44Fabian: bilinear and squaring don't need floating point

21:19Move the sRGB->linear conversion after the bilinear

23:25Bake normalization into color

25:25Works fine (not much improvement)

27:52Remove a number of multiply ops by keeping things in 0-255 space (no improvement)

36:47Diff IACA output from the run with the removed multiplies and the one prior

41:30Getting rid of 43 instructions did not improve throughput reported by IACA

42:46Seems to be doing the same number of multiplies either way

43:09Compiler was smart enough to do the transformations?

45:07What other optimizations could we do?

47:19Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?

49:34Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries

52:53Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16

54:36problem: this will square our alpha as well

55:19We'll have to use another instruction to handle alpha

57:00Bitshifting / masking to pull the components from their 16-bit lanes

58:39Wrong result! It's Q&A, but let's try to debug first...

1:06:35Issue found: Should be masking 16-bit, not 8-bit

1:07:00Better, but still a strange result

1:07:57How to avoid squaring the alpha?

1:09:18Just pull the alpha out prior to the squaring?

1:09:46.. that works fine

1:10:10Now let's convert everything to use the 16-bit squaring

1:10:59... around 6 cycles improvement, but small visual problem with the bilinear

1:11:37Found the issue: We were reading only from SampleA

1:11:47Bilinear looks better, but still oddity with green fringing around the hero

1:12:16Found the issue, looks good, but...

1:12:42... we're actually 8 cycles worse now

1:13:14Why? Let's run it through IACA

1:13:44Throughput bottleneck: Inter-iteration? Good question for Fabian

1:13:59Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse

1:15:56Could use the same technique when loading the destination, but probably not a good idea

1:17:01Q&A

🗩

1:17:01Q&A

🗩

1:17:01Q&A

🗩

1:17:30cubercaleb CP stands for Critical path

🗪

1:17:30cubercaleb CP stands for Critical path

🗪

1:17:30cubercaleb CP stands for Critical path

🗪

1:18:13flaturated IACA was showing Port 1 as the bottleneck, so reducing multplies won't help

🗪

1:18:13flaturated IACA was showing Port 1 as the bottleneck, so reducing multplies won't help

🗪

1:18:13flaturated IACA was showing Port 1 as the bottleneck, so reducing multplies won't help

🗪

1:19:05stelar7 Inter-iteration means that run x of the loop depends on the prior run

🗪

1:19:05stelar7 Inter-iteration means that run x of the loop depends on the prior run

🗪

1:19:05stelar7 Inter-iteration means that run x of the loop depends on the prior run

🗪

1:19:56butwhynot1 Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)

🗪

1:19:56butwhynot1 Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)

🗪

1:19:56butwhynot1 Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)

🗪

1:23:54roflraging How do you support AVX? What about register saving through context switches?

🗪

1:23:54roflraging How do you support AVX? What about register saving through context switches?

🗪

1:23:54roflraging How do you support AVX? What about register saving through context switches?

🗪

1:25:07mmozeiko Replace sqrt with mul/rsqrt?

🗪

1:25:07mmozeiko Replace sqrt with mul/rsqrt?

🗪

1:25:07mmozeiko Replace sqrt with mul/rsqrt?

🗪

1:31:14Some comments on port 1 pressure from Fabian

1:34:19robotchocolatedino How can removing the sqrt help if it's done on the multiply port, not the adder port?

🗪

1:34:19robotchocolatedino How can removing the sqrt help if it's done on the multiply port, not the adder port?

🗪

1:34:19robotchocolatedino How can removing the sqrt help if it's done on the multiply port, not the adder port?

🗪

Handmade Hero

Keyboard Navigation

Global Keys

Menu toggling

In-Menu Movement

Quotes and References Menus

Quotes, References and Credits Menus

Filter Menu

Filter and Link Menus

Credits Menu