Guide - How to avoid C/C++ runtime on Windows

Mārtiņš Možeiko

#6170

March 17, 2016

Oops, I gave you wrong instruction. _mm_cvtsd_si32 instruction rounds to nearest integer instead of truncation like C casting does. You should use _mm_cvttsd_si32 function that will truncate like C casting does. That will probably fix precision issues if that was the problem.

Anyway, if SSE instructions doesn't do what you want, just write pure C code first for cast. If you don't know how, take a look at compiler-rt source file from llvm project I gave link to. It should be pretty obvious how that works. Step through with debugger to see the values.

As for wchar_t - if you are using just Windows API and want to run just on Windows, then using wchar_t maybe is OK. But once you want to go cross platform, then using wchar_t is just very wrong in my opinion. utf-8 is by default used in most, if not all API on Linux and OSX. So only weird OS here is Windows. That's one of reasons why I'm saying to use utf-8 makes sense (and there are more, of course). 1 OS vs 2 OS'es ;)

But more serious reason against wchar_t is that people assume that each unicode character is exactly one wchar_t element. That's why they say you should prefer wchar_t over utf8. Sure, if you deal only with English and other European languages then that is correct. But universally that is not correct. Once your code will need to deal with arbitrary Unicode (Chinese hieroglyphs) then your code will break if it assumes that 1 char = 1 wchar_t element. Even your String class has wrong code in many places because of this reason (Left, Right, Mid members). For full UTF-16 support unicode char can take up to 2 wchar_t elements. But your code will cut such characters in half. Thus producing invalid UTF-16 string. If you repeat cutting, concatenating operation many times, you will get garbage in your string. This will lead to rendering garbage, crashes or other security issues. For example:
https://www.cvedetails.com/cve/CVE-2015-5380
https://www.cvedetails.com/cve/CVE-2012-2135
Then why use wchar_t and support multi-wchar_t characters, if you can use utf-8 from the start? Using utf-8 will allow your code to be exactly the same for ansi and utf-8 strings.

Edited by Mārtiņš Možeiko on March 17, 2016, 5:13am

Fred Harris

#6171

March 17, 2016

Oops, I gave you wrong instruction. _mm_cvtsd_si32 instruction rounds to nearest integer instead of truncation like C casting does. You should use _mm_cvttsd_si32 function that will truncate like C casting does. That will probably fix precision issues if that was the problem.

That fixed it! I guess ‘close’ does only count in horseshoes and hand grenades! :) Maybe now you can cut me some slack for dropping the u in _dtoui3?

I believe with that change it might be good enough. Seems to be working identical to my 64 bit code now, with the limitations imposed by the former’s smaller integer register size of course. I need to test a bit yet to be sure, but I think that’s it. I can’t thank you enough Mmozeiko. You’ve really helped. First with that _fltused thing – now this.

And you’ve made good points with the encoding thing. My code is only used in Pennsylvania where I live and work. But I do post on C++ forums such as here occasionally, and I’d like to think my code is workable anywhere it is run (China included). So I need to study up further on character encodings with an eye to making some changes.

Now that I’m nearly done with this project of eliminating the C Runtime Library, I’m wondering what limitations it might impose on the things I typically do. As I experiment with it I’ll surely find out! Top on my list though are ODBC database access and Microsoft’s COM (Component Object Model). I’m big into COM. It’s the object model I prefer over the typical way C++ looks at OOP. If I had to hazard a guess I’m thinking COM might work. Part of the reason I suspect that is that I recall in Microsoft’s ATL (Active Template Library) they typically eliminated the C Std. Lib. At least I think I have that right. I never really used ATL much or liked it very much. I preferred to do COM in the raw without all that weird science.

In terms of ODBC I’m less sure it will work loading the odbc32.lib. I’m guessing it might have dependencies on the C Std Lib. Any thoughts on this Mmozeiko?

Funny, in my brief excursions into Linux where I experimented some with Xlib, Motif (lesstif), and GTK, I seem to recall dealing with four byte characters in one or more of those above mentioned technologies. My understanding was that two byte characters were designed to accommodate all the languages on Earth. I figured the extra two bytes were to accommodate languages such as Klingon and Romulian when we eventually encounter them when we have a star ship Enterprise. :)

Edited by Fred Harris on March 17, 2016, 6:57pm

Mārtiņš Možeiko

#6172

March 17, 2016

Cool, I'm glad you got it working!

Using COM will work fine. The COM objects are implemented in different DLLs. You don't control what they use. They might use C runtime, and they might not use any runtime. That is all fine. And you don't need to use C runtime to access COM objects. Simply speaking COM objects is just an vtable you get pointer to. Calling side doesn't care about implementation. And then you're calling function pointers to whatever implementation they have. No C/C++ runtime is involved here.

Edited by Mārtiņš Možeiko on March 17, 2016, 7:16pm

Fred Harris

#6173

March 17, 2016

Thanks. Good info. Sounds like it will depend on whether the COM dll makes calls on the C runtime. That raises some interesting questions. One of the most challenging projects I ever undertook was to create an ActiveX Grid Control for use in my projects. I used PowerBASIC for that, and coded it first with Windows Custom Control architecture. After having gotten that to work I morphed it into a COM object which supported my custom IGrid Interface as well as IConnectionPointContainer and IConnectionPoint. When I finally finished it the executable size was about 49k, and with UPX it compacted down to 22k. I'm guesdsing its likely the smallest grid control anywhere.

Couple years later I redid it in C++. Mostly I wanted a 64 bit version, and like I previously mentioned, PowerBASIC just does 32 bit. I agonized a bit over wehether to code it in C or C++, or rather compile always as C++ but use C idioms. I finally decided to use C++ because I couldn't live without my String Class, particularly my Parse function, which you have. Only later did it occur to me that I could take that Parse code out of the class and save some binary size. Anyway, it ended up about 85 k or something like that, and UPX'ed down to about 43 k.

With this functionality of removing the C Runtime it would be interesting to see if I could reduce the code size even further. I'm excited about this!

Edited by Fred Harris on March 17, 2016, 7:49pm

x13pixels

#6175

March 18, 2016

Fantastic information mmozeiko. Thanks much.

Fred Harris

#6176

March 19, 2016

OK Martins, so I've been trying to beef up my knowledge of character encodings with regard to your comments about the use I've made of the wchar_t type. You are recommending I just use UTF-8, and if not that then the 32 bit character type. Well, lets start with UTF-8. None of the articles I've scanned tell exactly how to actually USE UTF-8. As far as I know, in Windows using C or C++ this is an exhaustive list of the variable types I have available for my use in string handling if I want library support...

1 2	char wchar_t

Of course, there are piles of type redefinitions of the above, and even wchar_t boils down to a typedef of unsigned short int. I can find no mention of an actual UTF-8 data type. Or does one simply just use the char data type?

Getting down to even more specfics, how would I translate Dennis Ritchie's famous Hello, World! program to UTF-8, or is it already by just using the char data type...

#include <stdio.h>

int main(void)
{
 char szBuffer[]="Hello, World!";
 printf("%s\n",szBuffer);
 return 0;
}

If my surmises above are correct, i.e., just use the char data type and make no use of wchar_t, then I'm guessing you are against the whole tchar.h macro setup Microsoft has, for example, where _tcslen gets transmorgrified into strlen through the mysterious alchemy of tchar.h if _UNICODE isn't defined and wcslen if it is, etc.???

Am I reading you correct on this?

Edited by Fred Harris on March 19, 2016, 3:09pm

Mārtiņš Možeiko

#6178

March 19, 2016

Yes, Windows API doesn't support UTF-8. You'll need to convert it to wide char for every call. It's not that a big of deal.

char* utf8string = ...;
wchar_t wide[256]; // whatever max size you want, or get size first and allocate on heap/pool
MultiByteToWideChar(CP_UTF8, 0, utf8string, strlen(utf8string), wide, ArrayCount(wide));
WindowsApiFunction(wide, ...);

Of course you can write your own utf8-to-utf16 string converter, it's pretty trivial.

As for your hello world program - any char whose value is in 32-127 interval is a valid utf8 string. So in your example string contains only 32-127 ascii characters. So it is utf8 string. That's the beauty of utf8. Your program will properly use and output utf8 string.

Because utf-8 is multi-byte encoding, you simply use any type that is byte for storing bytes in array. char is find. unsigned char is fine. char is better, because it allows to use ascii strings (32-127) for string literals.

All the tchar stuff is total nonsense. Why would you want to switch to ansi encoding? All modern Windows'es since Windows NT internally uses Unicode, so using A functions your are making it perform conversions to UTF-16 anyway. So there is no reason to use A functions if you want to use unicode. Just use W functions and drop ansi stuff. Use unicode with UTF-8 encoding everywhere.

Of course you'll need to write couple string functions like strlen, but some of them standard functions can be used as is (like strcat, strcmp, strcpy) because they don't care about encoding - they will simply copy or compare bytes which is fine for utf-8. Here's example how to write optimized strlen function for utf8: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

Fred Harris

#6403

April 21, 2016

Hello Martins!

Its me again. Stuck on another floating point issue in x86 32 bit with eliminating the C Runtime. And yes, I'm still working on it. Been three months now.

Do you know anything about _ftol2 and _ftol2_sse? I'm getting unresolved externals on those in compiling my ActiveX Grid Control dll. I'm assuming they are abbreviations for float to long. Odd thing is, nowhere in my code are there any four byte floats declared or used.

Here's the deal. My ActiveX Grid Control compiles/links/runs perfectly in x64. But in x86 its giving me those linker errors. The code is heavy with #ifdef Debug conditionals where I output debugging information to a log file, so I know exactly right down to the very statement what's causing the problem and errors. There really isn't much floating point math in the grid. Originally when I wrote the code there wasn't any. But couple years after that I got turned on to the necessity of writing "High DPI Aware" code, so that if the user changed Display or Display Resolution settings in Control Panel, my screens wouldn't look like s***. So to code that I needed about a dozen more lines of code and I have a situation where a double gets multiplied by an int and the result stored in a Windows DWORD. The double is a DPI scaling factor which commonly takes on values of 1.0, 1.25, or 1.5. I suppose other values are possible, but I believe those are the only values I’ve ever seen in playing with those settings in Control Panel on my specific laptop. The thought occurred to me that perhaps because the values are so simple and easily expressible in a four byte float that might be why the compiler is using _ftol2 instead of _dtoui which we’ve previously dealt with. This is actually the block of code from the grid dll where the doubles are declared and initialized…

// DPI Handling
double dpiX, dpiY;
double rxRatio, ryRatio;
hDC = GetDC(NULL);
dpiX=GetDeviceCaps(hDC, LOGPIXELSX);
dpiY=GetDeviceCaps(hDC, LOGPIXELSY);
rxRatio=(dpiX/96);
ryRatio=(dpiY/96);

Its those rxRatio/ryRatio variables that take on such values as 1.0, 1.25, 1.5, etc. The way I use them is that they need to be multiplied against everything in an app that specifies the sizes of anything, such as the x, y, cx, and cy variables in CreateWindowEx() calls to create and position objects. For example, say you wanted a top level window at 75, 75 on the desktop that was 320 pixels wide and 300 pixels high…

1	hWnd=CreateWindowEx(0, szClassName, szClassName, WS_OVERLAPPEDWINDOW, 75, 75, 320, 300, HWND_DESKTOP, 0, hIns, 0);

What you would do after obtaining the above DPI values would be to multiply all those numbers by rxRatio or ryRatio, as the case may be. However, I came up with a better solution that just uses a macro to do that as follows…

#define SizX(x)      x * rxRatio
#define SizY(y)      y * ryRatio

…so the above CreateWindowEx() call becomes even simpler…

[code]
hWnd=CreateWindowEx(0, szClassName, szClassName, WS_OVERLAPPEDWINDOW, SizX(75), SizY(75), SizX(320), SizY(300), HWND_DESKTOP, 0, hIns, 0);

That’s what’s failing in the 32 bit builds and generating the linker errors. I have been trying to solve it using the techniques you showed me about a month or so ago when I was fighting with that _dtoui3 thingie. If you recall, to solve that problem I created this function…

#ifdef _M_IX86

unsigned int __cdecl DoubleToU32(double x)
{
 return (unsigned int)_mm_cvttsd_si32(_mm_set_sd(x));
}

#endif

…and its use in 32 bit builds was as follows in peeling off digits of a double to convert to a character string…

while(k<=17)
{
  if(k == i)
     k++;
  *(p1+k)=48+(char)n;
  x=x*10;
  #ifdef _M_IX86
     n=DoubleToU32(x);
  #else   
     n = (size_t)x;
  #endif   
  x = x-n;
  k++;
}

So first thing I did was to try to use code like that in ftol2 and ftol2_sse implementations wrapped in extern “C”’s to see if that would link….

extern "C" long __cdecl _ftol2_sse(double x)
{
 return _mm_cvtsd_si32(_mm_set_sd(x));
}


extern "C" long __cdecl _ftol2(double x)
{
 return _mm_cvtsd_si32(_mm_set_sd(x));
}

Note I used the versions that round instead of truncate. That solved the unresolved externals and the code built. But it doesn’t work. Doesn’t crash or anything; just doesn’t work. The result of the multiplication ends up being zero. I looked around the long list of compiler intrinsics to see if I could find anything better, and I did, so I tried this….

extern "C" long _ftol2(float x)
{
 return _mm_cvtss_si32(_mm_set_ss(x));
} 

extern "C" long _ftol2_sse(float x)
{
 return _mm_cvtss_si32(_mm_set_ss(x));
}

That didn’t work either. I’m some confused about what’s going on because the dll code that won’t work seems similar to exe code that does work. When I saw what was happening I decided to make a small exe test program to see if, in 32 bit, one could declare a double and assign a number to it, multiply that by an int, and assign the result to a DWORD…

// cl StrTst5.cpp /O1 /Os /GS- /link TCLib.lib kernel32.lib 
#include <windows.h>
#include "stdlib.h"
#include "stdio.h"

extern "C" int   _fltused=1;
#define SizX(x)  x * rxRatio

int main()
{
 double rxRatio    = 1.25;
 int    iColWidths = 110;
 DWORD  pColWidths;

 pColWidths = SizX(iColWidths);
 printf("pColWidths = %u\n",pColWidths); 
 getchar();
 
 return 0;
}

...and that works perfectly. The result is 137...

C:\Code\VStudio\Grids\x86>cl StrTst5.cpp /O1 /Os /GS- /link TCLib.lib kernel32.lib
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

StrTst5.cpp
Microsoft (R) Incremental Linker Version 9.00.21022.08
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:StrTst5.exe
TCLib.lib
kernel32.lib
StrTst5.obj

C:\Code\VStudio\Grids\x86>StrTst5
pColWidths = 137

Now admittedly, the code in the dll is a lot more complicated, but its unclear to me how that would make any difference, as the fundamental operation taking place is no different than above. What I mean by 'complicated' in the dll is that it has to handle 'instance data' and multiple instantiations of grids correctly, so everything is done through dynamic memory allocations and pointers. The actual statement from the grid code that's failing in 32 bit is this...

1	pGridData2->pColWidths[i] = SizX(strFieldData[0].iVal()); // <<< Line Causing Problems!!!

The variable pGridData is a pointer to a GridData object, which is dynamically allocated for each grid, and its where I hang pointers to all the grid's private 'instance' data. One of the members is a pointer to another memory block where I store all the column widths specified by the user when the grid is instantiated. These are modifiable at run time by the user through dragging the column dividers with the mouse. That's the GridData->pColWidths[] member, which is typed as DWORDs. The SizX() I've already described. The strFieldData[0].iVal() term is a member function call on my String Class where I'm extracting the column widths from the grid setup string passed in by the user/client and converting them to ints. So yes, that's complicated, but fundamentally no different than a multiplication of a double by an int with the rounded result going to a DWORD. And that appears to be where _ftol2 and _ftol2_sse enter the picture somehow. What do you think Martins? Any ideas how I might solve this???

By the way, by removing the High DPI Aware code, which only amounts to little more than I've shown above, the grid buulds and runs fine - just like its x64 counterpart.

Mārtiņš Možeiko

#6405

April 21, 2016

_ftol2 and _ftol2_sse functions have their own non-standard calling convention. Argument in ST(0) and return value in EDX:EAX. You cannot implement it with just C code. You need to write assembly. I wrote how to implement these two functions in first post of this topic. Read also warning below code to understand limitation of that implementation. You most likely will want more correct code with either SSE2 cvttps2dq (or similar) instruction or more x87 FPU instructions. Check what SDL library does: https://hg.libsdl.org/SDL/file/80...1b90/src/stdlib/SDL_stdlib.c#l320

Its much better to avoid C/C++ style or implicit casts in code to avoid generating implicit dependencies like this. Simply create your own casting functions (DoubleToU32, FloatToU32 and others) and use them.

Edited by Mārtiņš Možeiko on April 21, 2016, 10:12pm

Fred Harris

#6406

April 22, 2016

Oh! Sorry Martins! I feel dumb. You covered this early on and I forgot. My answers are right there. At the time I read it several months ago I was having other problems, which you covered, and I forgot about your coverage of the _ftol2 issue.

TM

#6910

May 9, 2016

Are there any public domain math libraries that one could use? Or at least have reasonable licences for gamedev?
The one problem I see with avoiding the c runtime is replacing the math library functions.
Especially these ones: sin, cos, atan2, sqrt.

Also do you write your own optimized memcpy, memcmp etc when you don't use the c runtime, or do you just make the compiler emit intrinsics for them?

ratchetfreak

#6912

May 9, 2016

sin, cos and atan2 are not really necessary for game code. If you have to get the angle with atan2 from a cos and sin you will probably be sending the angle (possibly slightly modified) back through to sin or cos a bit later. You can probably optimize that roundtrip out using some trig identities.

Either way sin, cos, invsqrt (= 1/sqrt(x) from which you can get the sqrt: sqrt(x) = x*invsqrt(x)) and even memcpy have machine instructions for them. So the general solution is to use assembly or intrinsics to invoke them.

Mārtiņš Možeiko

#6921

May 9, 2016

Sometimes sin/cos are needed. For example, to construct rotation matrix from users input (ingame editor or something).

@Cranky: for sin/cos I suggest to look at http://gruntthepeon.free.fr/ssemath/ page. It has very permissive license (zlib). And it has SSE and NEON optimized sin/cos/exp/log implementations with very good accuracy and performance. For sqrt there is SSE intrinsic that generates 1 instruction - _mm_sqrt_ss (or _mm_sqrt_ps for 4x floats). For atan2 you can check out this code: https://github.com/michael-quinlan/ut-sse/blob/master/sse/sseMath.h (MIT license) It uses bunch of functions, but you can extract raw SSE intrinsics for atan2 code.

As for memcpy - usually you just do for loop, like Casey does, and copy needed data manually. This way compiler can optimize this copy better than generic memcpy. If it sees that length is 8 bytes, then it can generate just one mov instruction in x86_64.

For generic whatever-amount-of-bytes memcpy you can use architecture specific stuff. For example, on Intel arch you can use simple rep movsb instruction:

inline void CopyMemory(uint8_t* dst, const uint8_t* src, size_t size)
{
    assert(src >= dst + size || src + size <= dst); // only for non-overlapping ranges
    __movsb(dst, src, size);
}

For discussion about how to implement memcpy and benchmarks for different implementations see this topic: https://hero.handmade.network/forums/code-discussion/t/157
rep movsb on modern CPU's is not so bad. CPU's have special optimizations for it.

@ratchetfreak: while x86 CPU has FPU instructions for sin/cos/atan2 you really shouldn't use them. SSE/SSE2 will give you better performance. And on x86_64 code it will avoid transferring values from SSE to x87 FPU registers and back again.

Edited by Mārtiņš Možeiko on May 9, 2016, 5:59pm

1

#8414

September 4, 2016

I was wondering what you thought of an alternative I found. If you change the entry function with the /ENTRY linker option, avoiding the crt entry function and also not using the crt anywhere else, the only function you depend on from the crt dlls in an optimised build is memset. Then you can statically link in only that function. This way you totally sidestep all the intrinsic functions and any other compiler functionality having to be copied or reimplemented. After all we want these functions anyway so I don't see a problem with getting them normally. It would probably be fine to use other simple functions like math ones too. As far as I can tell this works as you would want it to, but i'm not an expert.

Mārtiņš Možeiko

#8422

September 4, 2016

Yes, that is also an option what you can do to avoid CRT startup functionality. But the point of this topic was to avoid whole C runtime, because some people want to see (or write) all the code that runs in your program and not rely on some unknown code inserted by compiler.