Handmade Hero»Episode Guide
Measuring Port Usage with IACA
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next marker
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu Movement

a
w
s
d
h j k l


Quotes and References Menus

Enter Jump to timecode

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
0:31Review of last session: cycle counting code
0:31Review of last session: cycle counting code
0:31Review of last session: cycle counting code
1:20Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops
1:20Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops
1:20Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops
4:00Accurate way is to write tool to simulate the CPU, as Casey did for the XB360
4:00Accurate way is to write tool to simulate the CPU, as Casey did for the XB360
4:00Accurate way is to write tool to simulate the CPU, as Casey did for the XB360
4:30... or Intel architecture code analyzer (IACA)
4:30... or Intel architecture code analyzer (IACA)
4:30... or Intel architecture code analyzer (IACA)
7:35Overview on how to use IACA with the code
7:35Overview on how to use IACA with the code
7:35Overview on how to use IACA with the code
8:08Marking sections with IACA_START and IACA_END
8:08Marking sections with IACA_START and IACA_END
8:08Marking sections with IACA_START and IACA_END
9:20Modifying build.bat to include the iaca directory
9:20Modifying build.bat to include the iaca directory
9:20Modifying build.bat to include the iaca directory
10:09For linux/unix compatibility, mind your case when including files
10:09For linux/unix compatibility, mind your case when including files
10:09For linux/unix compatibility, mind your case when including files
11:38Running the IACA command line
11:38Running the IACA command line
11:38Running the IACA command line
13:10Reading the IACA results
13:10Reading the IACA results
13:10Reading the IACA results
15:28Trying to decipher the meanings of the letters in the IACA table
15:28Trying to decipher the meanings of the letters in the IACA table
15:28Trying to decipher the meanings of the letters in the IACA table
16:58IACA can output graphs?
16:58IACA can output graphs?
16:58IACA can output graphs?
17:12IACA reports max throughput of 86.60 cycles
17:12IACA reports max throughput of 86.60 cycles
17:12IACA reports max throughput of 86.60 cycles
17:56There maybe some more room for optimization...
17:56There maybe some more room for optimization...
17:56There maybe some more room for optimization...
18:30IACA is pretty nice!
18:30IACA is pretty nice!
18:30IACA is pretty nice!
19:20Adding some macros to turn IACA on/off
19:20Adding some macros to turn IACA on/off
19:20Adding some macros to turn IACA on/off
19:47Thanks to Fabian for the suggestion
19:47Thanks to Fabian for the suggestion
19:47Thanks to Fabian for the suggestion
20:44Fabian: bilinear and squaring don't need floating point
20:44Fabian: bilinear and squaring don't need floating point
20:44Fabian: bilinear and squaring don't need floating point
21:19Move the sRGB->linear conversion after the bilinear
21:19Move the sRGB->linear conversion after the bilinear
21:19Move the sRGB->linear conversion after the bilinear
23:25Bake normalization into color
23:25Bake normalization into color
23:25Bake normalization into color
25:25Works fine (not much improvement)
25:25Works fine (not much improvement)
25:25Works fine (not much improvement)
27:52Remove a number of multiply ops by keeping things in 0-255 space (no improvement)
27:52Remove a number of multiply ops by keeping things in 0-255 space (no improvement)
27:52Remove a number of multiply ops by keeping things in 0-255 space (no improvement)
36:47Diff IACA output from the run with the removed multiplies and the one prior
36:47Diff IACA output from the run with the removed multiplies and the one prior
36:47Diff IACA output from the run with the removed multiplies and the one prior
41:30Getting rid of 43 instructions did not improve throughput reported by IACA
41:30Getting rid of 43 instructions did not improve throughput reported by IACA
41:30Getting rid of 43 instructions did not improve throughput reported by IACA
42:46Seems to be doing the same number of multiplies either way
42:46Seems to be doing the same number of multiplies either way
42:46Seems to be doing the same number of multiplies either way
43:09Compiler was smart enough to do the transformations?
43:09Compiler was smart enough to do the transformations?
43:09Compiler was smart enough to do the transformations?
45:07What other optimizations could we do?
45:07What other optimizations could we do?
45:07What other optimizations could we do?
47:19Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?
47:19Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?
47:19Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?
49:34Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries
49:34Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries
49:34Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries
52:53Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16
52:53Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16
52:53Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16
54:36problem: this will square our alpha as well
54:36problem: this will square our alpha as well
54:36problem: this will square our alpha as well
55:19We'll have to use another instruction to handle alpha
55:19We'll have to use another instruction to handle alpha
55:19We'll have to use another instruction to handle alpha
57:00Bitshifting / masking to pull the components from their 16-bit lanes
57:00Bitshifting / masking to pull the components from their 16-bit lanes
57:00Bitshifting / masking to pull the components from their 16-bit lanes
58:39Wrong result! It's Q&A, but let's try to debug first...
58:39Wrong result! It's Q&A, but let's try to debug first...
58:39Wrong result! It's Q&A, but let's try to debug first...
1:06:35Issue found: Should be masking 16-bit, not 8-bit
1:06:35Issue found: Should be masking 16-bit, not 8-bit
1:06:35Issue found: Should be masking 16-bit, not 8-bit
1:07:00Better, but still a strange result
1:07:00Better, but still a strange result
1:07:00Better, but still a strange result
1:07:57How to avoid squaring the alpha?
1:07:57How to avoid squaring the alpha?
1:07:57How to avoid squaring the alpha?
1:09:18Just pull the alpha out prior to the squaring?
1:09:18Just pull the alpha out prior to the squaring?
1:09:18Just pull the alpha out prior to the squaring?
1:09:46.. that works fine
1:09:46.. that works fine
1:09:46.. that works fine
1:10:10Now let's convert everything to use the 16-bit squaring
1:10:10Now let's convert everything to use the 16-bit squaring
1:10:10Now let's convert everything to use the 16-bit squaring
1:10:59... around 6 cycles improvement, but small visual problem with the bilinear
1:10:59... around 6 cycles improvement, but small visual problem with the bilinear
1:10:59... around 6 cycles improvement, but small visual problem with the bilinear
1:11:37Found the issue: We were reading only from SampleA
1:11:37Found the issue: We were reading only from SampleA
1:11:37Found the issue: We were reading only from SampleA
1:11:47Bilinear looks better, but still oddity with green fringing around the hero
1:11:47Bilinear looks better, but still oddity with green fringing around the hero
1:11:47Bilinear looks better, but still oddity with green fringing around the hero
1:12:16Found the issue, looks good, but...
1:12:16Found the issue, looks good, but...
1:12:16Found the issue, looks good, but...
1:12:42... we're actually 8 cycles worse now
1:12:42... we're actually 8 cycles worse now
1:12:42... we're actually 8 cycles worse now
1:13:14Why? Let's run it through IACA
1:13:14Why? Let's run it through IACA
1:13:14Why? Let's run it through IACA
1:13:44Throughput bottleneck: Inter-iteration? Good question for Fabian
1:13:44Throughput bottleneck: Inter-iteration? Good question for Fabian
1:13:44Throughput bottleneck: Inter-iteration? Good question for Fabian
1:13:59Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse
1:13:59Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse
1:13:59Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse
1:15:56Could use the same technique when loading the destination, but probably not a good idea
1:15:56Could use the same technique when loading the destination, but probably not a good idea
1:15:56Could use the same technique when loading the destination, but probably not a good idea
1:17:01Q&A
🗩
1:17:01Q&A
🗩
1:17:01Q&A
🗩
1:17:30cubercaleb CP stands for Critical path
🗪
1:17:30cubercaleb CP stands for Critical path
🗪
1:17:30cubercaleb CP stands for Critical path
🗪
1:18:13flaturated IACA was showing Port 1 as the bottleneck, so reducing multplies won't help
🗪
1:18:13flaturated IACA was showing Port 1 as the bottleneck, so reducing multplies won't help
🗪
1:18:13flaturated IACA was showing Port 1 as the bottleneck, so reducing multplies won't help
🗪
1:19:05stelar7 Inter-iteration means that run x of the loop depends on the prior run
🗪
1:19:05stelar7 Inter-iteration means that run x of the loop depends on the prior run
🗪
1:19:05stelar7 Inter-iteration means that run x of the loop depends on the prior run
🗪
1:19:56butwhynot1 Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)
🗪
1:19:56butwhynot1 Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)
🗪
1:19:56butwhynot1 Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)
🗪
1:23:54roflraging How do you support AVX? What about register saving through context switches?
🗪
1:23:54roflraging How do you support AVX? What about register saving through context switches?
🗪
1:23:54roflraging How do you support AVX? What about register saving through context switches?
🗪
1:25:07mmozeiko Replace sqrt with mul/rsqrt?
🗪
1:25:07mmozeiko Replace sqrt with mul/rsqrt?
🗪
1:25:07mmozeiko Replace sqrt with mul/rsqrt?
🗪
1:31:14Some comments on port 1 pressure from Fabian
1:31:14Some comments on port 1 pressure from Fabian
1:31:14Some comments on port 1 pressure from Fabian
1:34:19robotchocolatedino How can removing the sqrt help if it's done on the multiply port, not the adder port?
🗪
1:34:19robotchocolatedino How can removing the sqrt help if it's done on the multiply port, not the adder port?
🗪
1:34:19robotchocolatedino How can removing the sqrt help if it's done on the multiply port, not the adder port?
🗪