Inconsistent update-render timings in Windows

Flyingsand

#12344

July 4, 2017

I recently finished writing the first pass of the Windows platform layer for a game side-project of mine. Most of the fundamental pieces are in place and working (audio, input, multithreading, SIMD). Once I got it up and running and started examining it with the debug profiling system I put in place, I noticed a wildly inconsistent frame timing in debug mode. I mean on the order of like 2.7ms -> 18.6ms! And that's just 7 frames apart! (See screenshots below).
Screen 1
Screen 2

As can be seen in the screenshots, it's not just one function that's to blame -- all of them have a significantly higher cycle count. This suggests, to me at least, that it might be a threading issue of some sort (since this particular timing measurement is within the game's update and render function, SwapBuffers or the timing of the main loop shouldn't have an effect on it). Testing it with multithreading off, I got update-render times of around 5 - 7ms (still in debug mode). In release mode, I'm seeing fairly stable timings of around 1.3 - 1.7ms with multithreading on.

Does anyone know what might be causing this huge variation in timing in debug mode with multithreading? Or have any tips on how to investigate it? I've implemented multithreading much the same way as Casey has on Handmade Hero, with the exception that I'm doing it by scanlines instead of by tiles because that worked better in my case. I've also implemented both SSE 4.2 and AVX 2 SIMD in the renderer (and yes, it's a software renderer for now).

Also, I've noticed none of this at all on the Mac version (I do most of my dev work on my iMac). There, I get a consistent 1.5 - 2.2ms update-render time in debug mode, and 0.4 - 0.6 ms in release mode (both with multithreading on). My iMac is a more powerful machine than my Windows laptop, but still.. the huge discrepancy took me a bit by surprise. Might this be a Windows quirk of some kind? I'm running Windows 10 with Visual Studio 2015.

Edited by Flyingsand on July 4, 2017, 12:05am

Mārtiņš Možeiko

#12345

July 4, 2017

Have you tried running code outside of debugger (Ctrl+F5), is timing still inconsistent?

Edited by Mārtiņš Možeiko on July 4, 2017, 12:13am

Flyingsand

#12346

July 4, 2017

mmozeiko
Have you tried running code outside of debugger (Ctrl+F5), is timing still inconsistent?

I hadn't tried that, no. I have run it a few times now though, and it seems a little more stable. The timings seem to be sitting around 7 - 9ms pretty consistently running without the debugger. So I take it that there is some extra overhead in debugging multithreaded code in MSVC? That does make sense as there is extra work to be done keeping track of all the threads, but I'm still surprised there is such a discrepancy between MSVC on Windows and Xcode on Mac.

The timing, even though it's a little more stable running without the debugger, still seems quite high compared to the debug build on my Mac machine (~2ms) for just clearing the screen, drawing 5 quads, and a spinning triangle. :/ And this is with AVX 2 (all loads and stores are aligned).

Of course, this is the debug build, so it's not a huge deal that it's a little wonky as long as the release build is adequate. It would just be nice to have reasonably stable timings in debug mode so I don't have to keep switching back and forth..

Mārtiņš Možeiko

#12349

July 4, 2017

Can you try omitting blitting part to screen? Just render to memory buffer and don't blit. This is to see what is the slow part - GDI or actual rendering code generated by compiler. If it is GDI that is slow then OSX probably does use OpenGL or similar GPU friendly way to get image to screen. If speed difference stays the same, then it a compiler issue and there's little you can do. In such case maybe you can try to switch to clang compiler, latest MSVC has reasonable integration with it including some debugging support.

Flyingsand

#12361

July 5, 2017

mmozeiko
Can you try omitting blitting part to screen? Just render to memory buffer and don't blit. This is to see what is the slow part - GDI or actual rendering code generated by compiler. If it is GDI that is slow then OSX probably does use OpenGL or similar GPU friendly way to get image to screen. If speed difference stays the same, then it a compiler issue and there's little you can do. In such case maybe you can try to switch to clang compiler, latest MSVC has reasonable integration with it including some debugging support.

I did do this actually when I first noticed the inconsistent timings. I commented out the blit and SwapBuffers, and the variation in the update-render timings remained, as did the relatively high timing values. And while the renderer itself is all software for now as I mentioned, the final bitmap is drawn to the screen using OpenGL (compatibility profile since I'm just doing a single bitmap for now) not GDI or GDI+. So the main game loop in pseudocode is:

while(1) {
    // Record start time
    
    // Process input

    game.update_and_render(...);

    render_to_display(&buffer);
    SwapBuffers(hdc);

    // Record end time, and sleep if needed
}

And just for the sake of clarity, the inconsistent timings are within

1	update_and_render

, not the timings in the main loop, so they should be independent of the main loop's sleep, etc.

So yeah, it seems that maybe it is a compiler issue.. I think that will be interesting to see what difference it makes using Clang for the Windows build. Or I might try an older version of MSVC. I don't recall that we ever encountered this issue in Handmade Hero, and Casey is using VS 2013. Although he does still build with optimizations always on for the renderer if I'm not mistaken, so I guess that's another option.

Flyingsand

#12391

July 7, 2017

Ok, so I've run some tests in both VS 2015 and VS2013 on the fill rect (AVX2) function that was taking a big chunk of the update-render time. I wanted to include Clang as well, but I haven't gotten it working yet. I downloaded the Clang update in VS2015, but that gave me all manner of crazy errors.. So it looks like maybe I'll have to download and build it manually as it says here on LLVM's site, but that's a bit too much work for my liking right now, so I'm going to hold off on that. In any case, my findings indicate that it is clearly a VS2015 problem.

I isolated the fill rect function and ran it 100 times in a loop, and did this 3 times in each VS2015 and VS2013. Here is a graph of the results:
Graph

Wow! Pretty obvious that VS2013 produces much more optimal code in debug mode. (And yes, all runs were done in debug mode). That's nuts that there is such a difference between these two versions. I guess I'll be back-migrating my project to VS2013. Or maybe it's worth checking out VS2017 to see if anything has improved there.

For reference, here is the code I used for the timing:

#include "fs_lib.h"
#include "fs_rect.h"
#include "fs_vector.h"

#include <immintrin.h>
#include <Windows.h>

#define ALIGN __declspec(align(16))

typedef fsVec<fsr32, 2> fsv2f;
typedef fsRect<fsi32, 2> fsr2i;
typedef fsVec<fsr32, 4> fsColorf;

struct fsOffscreenBuffer {
	fsu32 *buffer;
	fsu32 width;
	fsu32 height;
	fsu32 pitch;
};

struct fsRenderContext {
	fsOffscreenBuffer *buffer;
	fsr2i clipRect;
	fsu32 scanlineTotal;
};


static void
_fix_minY(fsi32 *minY, fsu32 scanline, fsu32 scanlineTotal) {
	fsi32 val = *minY;
	if (val % scanlineTotal == 0) {
		*minY += scanline;
	} else {
		fsi32 scanlineY = (val / scanlineTotal) * scanlineTotal + scanline;
		*minY = (scanlineY < val ? scanlineY + scanlineTotal : scanlineY);
	}
}

static void
_mm256_get_clip_masks_aligned(fsi32 *minX, fsi32 *maxX, __m256i *startMask, __m256i *endMask) {
	fsi32 x0 = *minX;
	fsi32 x1 = *maxX;
	fsi32 width = x1 - x0;

	if (!(x1 & 7)) {
		*endMask = _mm256_set1_epi32(-1);
	} else {
		fsu32 mask = x1 & 7;
		ALIGN fsu32 clipMask[] = {
			(mask > 0 ? 0xFFFFFFFF : 0x0),
			(mask > 1 ? 0xFFFFFFFF : 0x0),
			(mask > 2 ? 0xFFFFFFFF : 0x0),
			(mask > 3 ? 0xFFFFFFFF : 0x0),
			(mask > 4 ? 0xFFFFFFFF : 0x0),
			(mask > 5 ? 0xFFFFFFFF : 0x0),
			(mask > 6 ? 0xFFFFFFFF : 0x0),
			(mask > 7 ? 0xFFFFFFFF : 0x0)
		};
		*maxX = (*maxX & ~7) + 8;
		*endMask = _mm256_load_si256((__m256i *)clipMask);
	}

	if (!(x0 & 7)) {
		if (width < 8) {
			*startMask = *endMask;
		} else {
			*startMask = _mm256_set1_epi32(-1);
		}
	} else {
		fsu32 clip = x0 & 7;
		ALIGN fsu32 clipMask[] = {
			(clip > 0 ? 0x0 : 0xFFFFFFFF),
			(clip > 1 ? 0x0 : 0xFFFFFFFF),
			(clip > 2 ? 0x0 : 0xFFFFFFFF),
			(clip > 3 ? 0x0 : 0xFFFFFFFF),
			(clip > 4 ? 0x0 : 0xFFFFFFFF),
			(clip > 5 ? 0x0 : 0xFFFFFFFF),
			(clip > 6 ? 0x0 : 0xFFFFFFFF),
			(clip > 7 ? 0x0 : 0xFFFFFFFF)
		};
		*minX &= ~7;
		*startMask = _mm256_load_si256((__m256i *)clipMask);
	}
}

static void
_fill_rect_avx2(fsr2i rect, fsColorf color, fsRenderContext *context, fsu32 scanline) {
	fsr2i fillRect = fs_rect_intersect(rect, context->clipRect);
	if (!fs_rect_has_area(fillRect)) {
		return;
	}

	fsi32 minX = fs_rect_left_edge(fillRect);
	fsi32 minY = fs_rect_top_edge(fillRect);
	fsi32 maxX = fs_rect_right_edge(fillRect);
	fsi32 maxY = fs_rect_bottom_edge(fillRect);

	fsu32 scanlineTotal = context->scanlineTotal;
	_fix_minY(&minY, scanline, scanlineTotal);

	fsr32 invSrcAlpha = 1.f - color.a;
	color *= 255.f;

	__m256i kMaskFF = _mm256_set1_epi32(0xFF);
	__m256 colorR_8 = _mm256_set1_ps(color.r);
	__m256 colorG_8 = _mm256_set1_ps(color.g);
	__m256 colorB_8 = _mm256_set1_ps(color.b);
	__m256 colorA_8 = _mm256_set1_ps(color.a);
	__m256 invSrcAlpha_8 = _mm256_set1_ps(invSrcAlpha);

	__m256i rowStartClipMask_8, rowEndClipMask_8;
	_mm256_get_clip_masks_aligned(&minX, &maxX, &rowStartClipMask_8, &rowEndClipMask_8);

	fsOffscreenBuffer *buffer = context->buffer;
	for (fsi32 y = minY; y < maxY; y += scanlineTotal) {
		fsu32 *pixel = buffer->buffer + (intptr_t)(y * buffer->width + minX);
		__m256i writeMask_8 = rowStartClipMask_8;
		for (fsi32 x = minX; x < maxX; x += 8) {
			__m256i pixel_8 = _mm256_load_si256((__m256i *)pixel);
			__m256i pixelR_8 = _mm256_and_si256(pixel_8, kMaskFF);
			__m256i pixelG_8 = _mm256_and_si256(_mm256_srli_epi32(pixel_8, 8), kMaskFF);
			__m256i pixelB_8 = _mm256_and_si256(_mm256_srli_epi32(pixel_8, 16), kMaskFF);

			__m256 blendedR_8 = _mm256_add_ps(_mm256_mul_ps(_mm256_cvtepi32_ps(pixelR_8), invSrcAlpha_8), colorR_8);
			__m256 blendedG_8 = _mm256_add_ps(_mm256_mul_ps(_mm256_cvtepi32_ps(pixelG_8), invSrcAlpha_8), colorG_8);
			__m256 blendedB_8 = _mm256_add_ps(_mm256_mul_ps(_mm256_cvtepi32_ps(pixelB_8), invSrcAlpha_8), colorB_8);
			__m256 blendedA_8 = colorA_8;

			__m256i blendedRG_8 = _mm256_or_si256(_mm256_cvtps_epi32(blendedR_8), _mm256_slli_epi32(_mm256_cvtps_epi32(blendedG_8), 8));
			__m256i blendedBA_8 = _mm256_or_si256(_mm256_slli_epi32(_mm256_cvtps_epi32(blendedB_8), 16), _mm256_slli_epi32(_mm256_cvtps_epi32(blendedA_8), 24));

			__m256i out_8 = _mm256_or_si256(blendedRG_8, blendedBA_8);
			__m256i maskedOut_8 = _mm256_or_si256(_mm256_and_si256(writeMask_8, out_8), _mm256_andnot_si256(writeMask_8, pixel_8));
			_mm256_store_si256((__m256i *)pixel, maskedOut_8);

			pixel += 8;
			if (x + 16 < maxX) {
				writeMask_8 = _mm256_set1_epi32(-1);
			} else {
				writeMask_8 = rowEndClipMask_8;
			}
		}
	}
}

LARGE_INTEGER g_timebaseFreq;

uint64_t 
get_absolute_time() {
	LARGE_INTEGER time;
	QueryPerformanceCounter(&time);
	return time.QuadPart;
}

float 
get_seconds_elapsed(uint64_t absStart, uint64_t absEnd) {
	uint64_t elapsed = absEnd - absStart;
	return ((float)elapsed / (float)g_timebaseFreq.QuadPart);
}


int main(int argc, const char * argv[])
{
	QueryPerformanceFrequency(&g_timebaseFreq);

	fsr2i rect = fs_rect_create(0, 0, 1240, 780);
	fsColorf color = fs_vcreate<fsr32, 4>(0.f, 0.4f, 0.8f, 1.f);

	fsOffscreenBuffer buffer;
	buffer.buffer = (fsu32 *)malloc(sizeof(fsu32) * rect.size.w * rect.size.h);
	buffer.width = rect.size.w;
	buffer.height = rect.size.h;

	fsRenderContext renderContext;
	renderContext.buffer = &buffer;
	renderContext.clipRect = rect;
	renderContext.scanlineTotal = 1;

	int it = 0;
	while (it < 100) {
		uint64_t startTime = get_absolute_time();

		_fill_rect_avx2(rect, color, &renderContext, 0);

		uint64_t endTime = get_absolute_time();
		float elapsed = get_seconds_elapsed(startTime, endTime);
		printf("%d, %f\n", it, elapsed * 1000.f);

		it++;
	}

	free(buffer.buffer);

	return 0;
}

Mārtiņš Možeiko

#12392

July 7, 2017

And this is only for debug build? For release with optimizations performance is good, right?

Flyingsand
So it looks like maybe I'll have to download and build it manually as it says here on LLVM's site

You don't need to build it yourself, you can simply take windows binaries from here: http://releases.llvm.org/download.html

Btw, if you are using AVX2 and have Haswell+ CPU then you can use FMA instructions. With that you can replace add(mul(a, b), c) with fma(a, b, c):

From

1	__m256 blendedR_8 = _mm256_add_ps(_mm256_mul_ps(_mm256_cvtepi32_ps(pixelR_8), invSrcAlpha_8), colorR_8);

to

1	__m256 blendedR_8 = _mm256_fmadd_ps(_mm256_cvtepi32_ps(pixelR_8), invSrcAlpha_8, colorR_8);

I don't know if that will help for you, but for code I've been writing this gives significant performance boost.

Another potential optimization - if you don't plan to go sRGB route, then do the blending directly on bytes, don't convert to floats. Then instead of 8 pixels you'll be able to process 32 pixels at once. At that point maybe it would make sense to switch to 4x8 tiles instead of doing 1x32.

Flyingsand

#12396

July 8, 2017

mmozeiko
And this is only for debug build? For release with optimizations performance is good, right?

Performance is certainly better, and there is much less fluctuation, but VS2013 still consistently beats VS2015 in release mode. Here is a graph of the same three runs for each version of VS with the same number of iterations:
Graph

mmozeiko
Flyingsand
So it looks like maybe I'll have to download and build it manually as it says here on LLVM's site

You don't need to build it yourself, you can simply take windows binaries from here: http://releases.llvm.org/download.html

Oh, good stuff! I'll look into that this weekend. Thanks!

mmozeiko
Btw, if you are using AVX2 and have Haswell+ CPU then you can use FMA instructions. With that you can replace add(mul(a, b), c) with fma(a, b, c):

From
1
__m256 blendedR_8 = _mm256_add_ps(_mm256_mul_ps(_mm256_cvtepi32_ps(pixelR_8), invSrcAlpha_8), colorR_8);
to
1
__m256 blendedR_8 = _mm256_fmadd_ps(_mm256_cvtepi32_ps(pixelR_8), invSrcAlpha_8, colorR_8);
I don't know if that will help for you, but for code I've been writing this gives significant performance boost.

Another potential optimization - if you don't plan to go sRGB route, then do the blending directly on bytes, don't convert to floats. Then instead of 8 pixels you'll be able to process 32 pixels at once. At that point maybe it would make sense to switch to 4x8 tiles instead of doing 1x32.

Good point on the FMA instruction. I do have that on my machine, and it makes a noticeable difference. Using FMA, the performance improves in VS2015 by shaving off roughly 1.0 - 1.5 ms in both debug and release, but the fluctuation in timings is still there in debug (and still exhibits more fluctuation than I would like in release, but it is more acceptable). On VS2013 it's looking pretty good with FMA -- getting consistently sub 1.0 ms timings.

I haven't decided yet whether to go with sRGB, but if I do, another great tip. Thanks!

Ultimately, I think going with VS2013 is the better choice at this point. I don't use a lot of advanced C++ features (mostly some judicious use of templates and lambdas), so I can deal with the lack of support in an older version of VS.

Flyingsand

#12408

July 9, 2017

So I've downloaded both Clang and Visual Studio 2017, and ran the same tests (for Clang I used VS2015). For the sake of completeness, I've added them to the existing graphs:
Debug graph
Release graph

That spike in the release graph for VS2017 has to be some kind of fluke.. (It was just over 22 ms!) Generally speaking, VS2017 seems to be even a bit worse than VS2015! Anyway, it's not looking too good for anything post-VS2013.. :(