Lag in Win32BeginRecordingInput

In day 25 we run into lags in Win32BeginRecordingInput. I'll try to explain white the lag happens in the latest implementation, when CopyMemory is called.

Summarry: accessing virtual memory pages, that do not exists in physical memory takes time ;-)

CopyMemorycopies the memory from State->GameMemoryBlock to State->ReplayBuffer->MemoryBlock. Let's take at memory usage at the point, when CopyMemory is called. We have allocated GameMemoryBlock with VirtualAlloc at the beginning of the program. We have asked Windows to commit this memory. We have obtained the Reply by creating a file mapping.

Each of the two blocks contains 1GB of data, but we have not really accessed most of its data (I’m ignoring the game_state here, because it is really tiny compared to 1GB).

If we take a look at memory related statistics in task manager, we can see that our private working set is only 6MB. On the other hand Commit size is around 1GB. Private working set represents the memory that is currently backed by physical memory and not shared by other processes (one example of the memory that is shared with other process is code that resides in DLL that is loaded at the same virtual address in different processes). Private working set is low, because we did not touch most of the memory that was reserved/commited.

When CopyMemory finishes our private working set jumps to 1GB. Why? We have been reading bytes from GameMemoryBlock memory that was obtained by VirtualAlloc-ed. As we access this memory, Windows must ensure that it is backed by physical memory – when we try to access it for the first time, Windows must allocate and return zeroed-out memory pages. So there is some additional work that must be done when reading this memory for the first time.

On the other side of equation, we are writing to memory, ReplayBuffer that is backed by memory mapped file, so that this takes time too (although actual disk writes are delayed by Windows). Windows optimizes writes to memory mapped files, and although we write 1GB of data, only 512KB will reside in our working set after the copy is complete (you can verify this by ) - VmMap).

Let’s try to CopyMemory again! It is faster when we call it for the second time, because physical pages for GameMemoryBlock were already allocated (most of them are zero). However, when writing to memory mapped file there will still be page faults, because not all pages are in memory.

After the second pass, all pages (from GamememoryBlock and ReplyBuffer) will be in memory, so if we try to CopyMemory for the third time, no page fauls are generated and the third copy is the fastest one.

The memory usage profile can be analyzed by Windows performance toolkit (wpa.exe, xperf and friends).Under the hood, the API uses Worst API ever made. Luckily, we can just use the GUI provided by Microsoft ;-)

Take a look at the picture bellow. The first graph shows CPU usage. I have inserted a 3 second sleep before each CopyMemory, so that we can read the graphs more easiliy. The first spike in CPU is the program startup. All others are from CopyMemory. We can see tha the third call to CopyMemory is the fastest one, and that it also does not trigger any page faults (second and the third graph). The last graph shows us, that disk activity is done asynchronously (in the background) – the disk activity is not aligned with CPU usage.

Here the link to the image (it looks like embeding attachment does not work in this forum)
https://drive.google.com/file/d/0...VcE5qQmdLbFBMR1E/view?usp=sharing


We might be able to affect the why windows handles the memory by using flags such as FILE_ATTRIBUTE_TEMPORARY or SEC_COMMIT. This is left as an excercies for the reader ;-)

Edited by matra on