Understanding Audio Fill - Lesson 9

Several things confused me about the audio code. I couldn't find answers to this anywhere. It's mainly about the use of a sample index.

Here is what I understand: there is a single buffer which has a play cursor. I'm going to copy mrmixer's diagrams https://hero.handmade.network/for...4-why_use_a_2_second_sound_buffer

1
2
|-----------------|
       ^      


We also have a Running Sample Index which tracks where we are in the audio we want to play. This can be mapped onto the buffer by modding with the buffer size

1
2
|-----------------|
       ^      *


We also have a Target Cursor which goes beyond the sample index by our latency amount.

1
2
|-----------------|
   o   ^      *


In the code we fill from the sample index up to the play cursor.

1
2
|xxx----------xxxx|
   o   ^      *


If I've understood correctly I have many questions. Any answers to any of them would be appreciated.

1. The msdn documentation says you should not write between the play cursor and the write cursor but we don't even look at the write cursor. Does it not actually matter?
2. Why use a sample index at all - shouldn't we instead start at the play/write cursor and fill the latency amount past there?
3. Again, why use the sample index - it doesn't track time. It arbitrarily tracks where we are in a sine wave, and we increment that value without considering how much time has passed.

I suppose that's my main gripe. Struggling to understand why this seems to strange to me... Am I making sense? Where does this sample index idea come from? Everything so far was explained meticulously but this was just bizzare - we have a value the keeps increasing, we use it to decide where in the buffer to write... I went to see how SDL does it and they don't do it like this at all, they use a callback which just asks you to write past the write cursor... Is this an old way of doing it?

Sorry for the rant!


Edited by kewp on Reason: Initial post
It's been a long time so I might be wrong. I'm using the code from day 009 (copied below) assuming that was what lesson 9 meant.

The running sample index isn't "where we are in the audio we want to play". It's tracking the last place we wrote in the audio buffer (more precisely 1 past the last "valid" sample in the buffer, which also indicate the place we need to write next if we don't want a "hole" in the sound samples). We use the running sample index to compute the ByteToLock value which is the first byte where we need to write new samples to.

The target cursor doesn't "goes beyond the sample index by our latency amount". It uses the play cursor, not the sample index. It represent 1 past the last sample we want to write to. It is meant to be as close as possible to the play cursor to reduce latency.

We don't fill from the running index to the play cursor, we fill from the running index (ByteToLock) to the target cursor.

1. The msdn documentation says you should not write between the play cursor and the write cursor but we don't even look at the write cursor. Does it not actually matter?

- You should not write between the play and write cursor. What happens if you do is "undefined". The new audio might be played if you're lucky but it most likely will not.

You could see (meaning it's my mental model and probably not correct) direct sound as having 2 buffers, one that is an actual buffer on the sound card, and one that is a working buffer in main memory (secondary buffer). When we write things in our buffer (the secondary buffer) direct sound will at some point copy some of its values to the primary buffer that will actually be played on the sound card. The copy region will be between the play and write cursor. So if you write in that region, you don't know if where you write has already been copied to the sound card.

- Casey might be writing between the play and write cursor (we would have to compare the target cursor to the write cursor to know). It's not guaranteed to work. The amount of sample the play and write cursor move is less than the latency (write_cursor - play_cursor) so Casey is trying to write as close as possible as the play cursor (to minimize latency) even if it's before the write cursor; but he also writes enough sample to keep a continuous sound in case of frame drops which adds latency because he doesn't overwrite previously written samples.

- DirectSound is an old API, it's emulated on to of WASAPI since Windows Vista. You can find a version of the handmade hero implemented directly in WASAPI in this thread: Day 19 - Audio Latency.

2. Why use a sample index at all - shouldn't we instead start at the play/write cursor and fill the latency amount past there?

The sample index is the last value that was written. The write cursor might be before that point, meaning we wrote more samples in the previous frame than the amount of sample that the write cursor moved. So we still need the running sample index.

3. Again, why use the sample index - it doesn't track time. It arbitrarily tracks where we are in a sine wave, and we increment that value without considering how much time has passed.

A sample is a unit of time. If the buffer is sampled at 48000Hz, a sample is 1/48000 seconds. We don't use the sample index to track the sine wave, we use the tSine value (in day 9) which is incremented after writing each sample.

The following threads might contains additional information:
Day 20: The Tragedy of Cases
[Day 019] - Possible solution to the 3-frame latency

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* From day 009 */
// NOTE(casey): DirectSound output test
DWORD PlayCursor;
DWORD WriteCursor;
if(SUCCEEDED(GlobalSecondaryBuffer->GetCurrentPosition(&PlayCursor, &WriteCursor)))
{
    DWORD ByteToLock = ((SoundOutput.RunningSampleIndex*SoundOutput.BytesPerSample) %
                        SoundOutput.SecondaryBufferSize);
    
    DWORD TargetCursor =
        ((PlayCursor +
          (SoundOutput.LatencySampleCount*SoundOutput.BytesPerSample)) %
         SoundOutput.SecondaryBufferSize);
    DWORD BytesToWrite;
    // TODO(casey): Change this to using a lower latency offset from the playcursor
    // when we actually start having sound effects.
    if(ByteToLock > TargetCursor)
    {
        BytesToWrite = (SoundOutput.SecondaryBufferSize - ByteToLock);
        BytesToWrite += TargetCursor;
    }
    else
    {
        BytesToWrite = TargetCursor - ByteToLock;
    }
    
    Win32FillSoundBuffer(&SoundOutput, ByteToLock, BytesToWrite);
}
Thanks for your considered response, mrmixer.

I see now that the sample index makes sure we are writing a continuous sound to the output. I guess the point of it is because we are writing a sine wave? So we basically are filling the buffer with a sine wave. So the sample has nothing to do with "audio sample" as in "this is how you would _sample_ the audio you need to send to the buffer". Rather it's tracking specially how to write a function like a sine wave... so it could be called FunctionSampleIndex...

Obviously I need to keep watching to see what Casey does when you need to play fixed audio like WAVS and sequence them.

The whole play/write cursor thing is annoying, but as you mentioned I didn't hear any pops in my audio so I guess what we're doing is fine. Going forward I would probably be more careful in my own code base. If those concepts even exist in newer APIs like WASAPI.

Also thanks for the links you sent.

All the best.
A sample here is the value we use to represent the audio intensity (volume) at a point in time. It's more or less the equivalent of a pixel in an image.

Sound is continuous, which means it would need an infinite amount of value to be represented exactly. So to reduce the amount of data needed, we say "we will measure the value of the sound at regular interval and store that". That regular interval is the sampling rate (48000Hz in our case). A sample is the value we chose to represent the sound intensity of an interval. More information: Sampling (signal processing).

There is a theorem that states that to represent a frequency you need a sampling rate of twice that frequency ( Nyquist–Shannon sampling theorem). Humans can ear sound between +-20Hz and +-20000Hz so we need a sampling rate of at least 40000Hz.

This video is also interesting: https://xiph.org/video/vid2.shtml

Edited by Simon Anciaux on
Very interesting video. I've done some work in function approximation but haven't looked at how it applies to sound.

A lot of what I was confused about seems to be dealt with the later videos in the handmade hero series.

Thanks again for the discussion, though.