I'm dealing with something in WASAPI that need some help. Basically I have to create my own video decoder that use only the source reader. I have successfully decode the video stream using h264 decoder and also 2 audio streams, one is sfx stream and the other is the language stream. The problem is I have to synchronize them. So I read some articles that says the video should sync to the audio because the response time of the human ears is way faster than the eyes, somewhere 5ms. So basically what I did is right before I want to sample a new video frame, I take the presentation time of the current video sample and the other two audio samples and compare them like this

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
LONGLONG llTimestamp = 0;
LONGLONG audioTimestamp = 0;
LONGLONG voTimestamp = 0;

if (m_pOutputVideoSample && SUCCEEDED(m_pOutputVideoSample->GetSampleTime(&llTimestamp))
        && m_pOutputAudioSample[Sfx] && SUCCEEDED(m_pOutputAudioSample[Sfx]->GetSampleTime(&audioTimestamp))
        && m_pOutputAudioSample[Lang] && SUCCEEDED(m_pOutputAudioSample[Lang]->GetSampleTime(&voTimestamp))
)
    {
        if( llTimestamp  > audioTimestamp || llTimestamp > voTimestamp)
        {
            // the previous sample has not expired, don't read new sample for now. 
            readNewSample = FALSE;
        }
    }
}

if(readNewSample )
{
    // Retreive sample from source reader
    IMFSample* pOutputSample = NULL;

    hr = m_pReader->ReadSample(
        (DWORD)MF_SOURCE_READER_FIRST_VIDEO_STREAM,    // Stream index.
        0,                                             // Flags.
        &streamIndex,                                  // Receives the actual stream index. 
        &dwStreamFlags,                                // Receives status flags.
        &llTimestamp,                                  // Receives the time stamp.
        &pOutputSample                                 // Receives the sample or NULL.  If this parameter receives a non-NULL pointer, the caller must release the interface.
        );

    ...
    ...

}


Basically what is checks for is if the presentation time of the current audio frames is earlier than the video frame's, we don't sample for a new video frame. And this works well, the 2 audio streams seems to be synchronized perfectly with the video. So my question is, is this the correct way of doing it? I suspect that this is not how it should be done, because MFT has Media session that supports synchronized playback of the streams using a presentation clock (media session is kinda cloudy for me), but the platform I'm writing for does not support it. Maybe I have actually use the presentation clock manually and synchronize them? Sorry for the question not being relevant to the handmade hero, but I think this is the best place I can get an answer on this