Integrating gameGetSoundSamples() with CoreAudio

Hey guys!

I am trying to implement sound output using OS X CoreAudio APIs. The problem is that CoreAudio wants to pull PCM data from me via so called render callback and it calls this callback from another thread. I can just call gameGetSoundSamples() from my render callback but I think it's a bad idea because I don't want gameGetSoundSamples() to think about thread-safety.

What I want to do is to somehow pull data from the game using gameGetSoundSamples() in my main loop and somehow make this data accessible for my render callback. The only solution I have in my mind is to implement a kind of ring buffer like DirectSound does. My main loop can write to this buffer when it gets new data from the game and my render callback can read from it when it's asked for new data by CoreAudio. But then I need to somehow make this ring buffer operations thread-safe.

What other options do I have? If I really want to make this DirectSound-style ring buffer how can I make it thread-safe? Apple reference pages say that I can't just use locks because my render callback should return immediately without blocking on anything. Maybe I should try to use some API for atomic operations?
So I've actually implemented both of the methods you mention for getting the sound from the game into Core Audio. Currently I'm calling gameGetSoundSamples() from the render callback because it just ended up being easier, and there are no threading concerns in the current implementation because nothing else touches the game sound sample data.

Roughly speaking, I have:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
if (globalRunning && game is initialized)
{
    gameSound->frameCount = inNumberFrames;
    gameCode.getSoundSamples(nullptr, gameMemory, gameSound);
    memcpy(ioData->mBuffers[0].mData, gameSound->samples,
               inNumberFrames * bytesPerFrame);
}
else
{
    memset(ioData->mBuffers[0].mData, 0, inNumberFrames * bytesPerFrame);
    *ioActionFlags |= kAudioUnitRenderAction_OutputIsSilence;
}

return noErr;


I used a ring buffer when I first set up my OS X layer, and it works as well but you need to handle a few issues. In terms of threading concerns, as long as you maintain that the read & write pointers are only accessed from their respective read & write methods, the only potentially shared data would be bytesFree (or bytesWritten, or whatever..). You can protect that with OSAtomicAdd32Barrier(). Other than that, you don't need locks as long as there is only one thread writing and one reading.

However, the bigger issue is handling the inevitable case when reading occurs faster than writing (or vice versa) which will eventually cause the read/write pointers to cross. So I set up a temporary hack for a solution that adjusted the number of samples the game produced if the read/write pointers got too close. e.g. if bytes were being consumed faster than I was writing them, write 1.5x as many samples per frame. Obviously not a long term solution, and could introduce some latency, but it worked. :lol:

EDIT: As for other options, I also tried AudioQueue services which lets you specify the size of the buffers it enqueues (as well as the number of buffers to maintain). This seemed to work fine, but in the spirit of HmH I wanted to keep things as low-level as possible.

Edited by Flyingsand on
Flyingsand, thanks a lot. Could you please elaborate about using OSAtomic and when I need to use it and when I don't? For example if I am only reading real32 variable from one thread and writing to it from another thread without any locks. What bad things could happen to me? Is it possible to read garbage value somehow?
vbo
Flyingsand, thanks a lot. Could you please elaborate about using OSAtomic and when I need to use it and when I don't? For example if I am only reading real32 variable from one thread and writing to it from another thread without any locks. What bad things could happen to me? Is it possible to read garbage value somehow?


Sure. The OSAtomics gives you thread safety for operations like add or increment by ensuring sequential memory access order. That way, shared memory is properly synchronized between threads, and is cheaper than locks. As an example in my case, I had a variable bytesFree in my ring buffer than was decremented whenever writing to the buffer (by the number of bytes written) and incremented whenever reading from it. That way, if the value of bytesFree ever dropped below 0 or was greater than the capacity of the ring buffer, I know that the read/write pointers crossed.

Since that was shared memory between threads, I used OSAtomicAdd32Barrier() to increment/decrement that value during reads/writes. In this particular case you wouldn't get a garbage value out if the memory wasn't protected in this way, but you could get the wrong value. i.e. a write may have just occurred and is in the process of updaing the bytesFree when another thread asks for this value and sees the previous value instead of what is supposed to be the currently updated value.
Flyingsand, thanks for you help. I managed to implement DirectSound-style WriteCursor/PlayCursor style API for my main loop using OSAtomic to fight multithreading. My latency is not very good (like 20ms) but for now I am OK with it.

Just a quick check: is it OK to convert from signed 16-bit PCM (used by platform independent code) to signed float PCM (which is Apple's default way of encoding) like this:

1
float AppleSample = (float)CaseySample / 32768.0;

Or am I misunderstanding something simple?

Edited by Vadim Borodin on
Max positive value is 32767, not 32768. So you are shifting everything a bit down for positive values. But that is almost not detectable. So what you are doing is OK for games.

If you want to be very correct you need something like this:
1
float fSample = (float)iSample / (iSample >= 0 ? 32767, 32786);


Here are interesting table about how different audio libraries are doing it (Int to Float column). Some are dividing by 32768, some are dividing with 32767 (then you need clamping in negative case), and some are dividing by 32767 in positive case and 32768 in negative case. So there are options :)

And if you want something faster, you can use SSE instructions for this. Here is SSE2 code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
int count = ...;
int16_t* src = ...;
float* dst = ...;

__m128 kMul = _mm_set_ps1(1.0f / 32768.0f);
__m128i kZero = _mm_setzero_si128();

while (count >= 8)
{
    __m128i i = _mm_loadu_si128((__m128i*)src);

    __m128i sign = _mm_cmpgt_epi16(kZero, i);
    __m128i i0 = _mm_unpacklo_epi16(i, sign);
    __m128i i1 = _mm_unpackhi_epi16(i, sign);

    __m128 f0 = _mm_mul_ps(_mm_cvtepi32_ps(i0), kMul);
    __m128 f1 = _mm_mul_ps(_mm_cvtepi32_ps(i1), kMul);

    _mm_storeu_ps(dst + 0, f0);
    _mm_storeu_ps(dst + 4, f1);

    count -= 8;
    src += 8;
    dst += 8;
}

while (count --> 0)
{
    *dst++ = *src++ * (1.0f / 32768.0f);
}

Just put #include <emmintrin.h> somewhere before this code.

This code converts 8 samples at the same time in one loop iteration, and then rest of them if count is not multiple of 8.

Change "_mm_loadu_si128" to _mm_load_si128" and "_mm_storeu_ps" to "_mm_store_ps" if you can guarantee that src or dst buffers are 16-byte aligned.

Edited by Mārtiņš Možeiko on
vbo

Just a quick check: is it OK to convert from signed 16-bit PCM (used by platform independent code) to signed float PCM (which is Apple's default way of encoding) like this:

1
float AppleSample = (float)CaseySample / 32768.0;

Or am I misunderstanding something simple?


When you set up Core Audio you can just set the stream format to the same that Casey uses in the game. Core Audio will handle any conversion automatically if needed (including sample rate and bit depth).

1
2
3
4
5
6
7
8
9
AudioStreamBasicDescription streamDesc = {0};
streamDesc.mSampleRate = soundOutput->sampleRate;
streamDesc.mFormatID = kAudioFormatLinearPCM;
streamDesc.mFormatFlags = kAudioFormatFlagIsSignedInteger|kAudioFormatFlagIsPacked;
streamDesc.mChannelsPerFrame = 2;
streamDesc.mFramesPerPacket = 1; // Must be 1 for uncompressed audio.
streamDesc.mBytesPerPacket = soundOutput->bytesPerFrame;
streamDesc.mBytesPerFrame = soundOutput->bytesPerFrame;
streamDesc.mBitsPerChannel = (soundOutput->bytesPerFrame / streamDesc.mChannelsPerFrame) * 8;


Then you set it like this:
1
2
3
4
AudioUnitSetProperty(soundOutput->outputUnit, // This is the output Audio Unit.
                     kAudioUnitProperty_StreamFormat,
                     kAudioUnitScope_Input,
                     0, &streamDesc, sizeof(streamDesc));


This also has the advantage of setting the audio buffers to be interleaved instead of non-interleaved (which is the default for Core Audio), making it easier and more efficient to copy the samples from the game to the Core Audio buffers.