Guide - How to avoid C/C++ runtime on Windows

sin, cos and atan2 are not really necessary for game code. If you have to get the angle with atan2 from a cos and sin you will probably be sending the angle (possibly slightly modified) back through to sin or cos a bit later. You can probably optimize that roundtrip out using some trig identities.

Either way sin, cos, invsqrt (= 1/sqrt(x) from which you can get the sqrt: sqrt(x) = x*invsqrt(x)) and even memcpy have machine instructions for them. So the general solution is to use assembly or intrinsics to invoke them.

Sometimes sin/cos are needed. For example, to construct rotation matrix from users input (ingame editor or something).

@Cranky: for sin/cos I suggest to look at http://gruntthepeon.free.fr/ssemath/ page. It has very permissive license (zlib). And it has SSE and NEON optimized sin/cos/exp/log implementations with very good accuracy and performance. For sqrt there is SSE intrinsic that generates 1 instruction - _mm_sqrt_ss (or _mm_sqrt_ps for 4x floats). For atan2 you can check out this code: https://github.com/michael-quinlan/ut-sse/blob/master/sse/sseMath.h (MIT license) It uses bunch of functions, but you can extract raw SSE intrinsics for atan2 code.

As for memcpy - usually you just do for loop, like Casey does, and copy needed data manually. This way compiler can optimize this copy better than generic memcpy. If it sees that length is 8 bytes, then it can generate just one mov instruction in x86_64.

For generic whatever-amount-of-bytes memcpy you can use architecture specific stuff. For example, on Intel arch you can use simple rep movsb instruction:
1
2
3
4
5
inline void CopyMemory(uint8_t* dst, const uint8_t* src, size_t size)
{
    assert(src >= dst + size || src + size <= dst); // only for non-overlapping ranges
    __movsb(dst, src, size);
}

For discussion about how to implement memcpy and benchmarks for different implementations see this topic: https://hero.handmade.network/forums/code-discussion/t/157
rep movsb on modern CPU's is not so bad. CPU's have special optimizations for it.

@ratchetfreak: while x86 CPU has FPU instructions for sin/cos/atan2 you really shouldn't use them. SSE/SSE2 will give you better performance. And on x86_64 code it will avoid transferring values from SSE to x87 FPU registers and back again.

Edited by Mārtiņš Možeiko on
I was wondering what you thought of an alternative I found. If you change the entry function with the /ENTRY linker option, avoiding the crt entry function and also not using the crt anywhere else, the only function you depend on from the crt dlls in an optimised build is memset. Then you can statically link in only that function. This way you totally sidestep all the intrinsic functions and any other compiler functionality having to be copied or reimplemented. After all we want these functions anyway so I don't see a problem with getting them normally. It would probably be fine to use other simple functions like math ones too. As far as I can tell this works as you would want it to, but i'm not an expert.
Yes, that is also an option what you can do to avoid CRT startup functionality. But the point of this topic was to avoid whole C runtime, because some people want to see (or write) all the code that runs in your program and not rely on some unknown code inserted by compiler.
I'm not sure if this is useful to anyone, but I hacked together the equivalent code for Linux (might also work on macOS, haven't tested yet).

!!UPDATE! THERE'S A REASON NOT TO USE INT $0x80 ON x64!
main.c:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
//this is a wrapper around the Linux exit syscall.
//!! If you want to save an instruction, make status a long,
//    replace the middle two lines of the asm block with
//    "movq %0, %%rdi\n\t", and use the compiler option "-fno-builtin"
__attribute__((noreturn)) void exit(int status)
{
    //exit() is normally implemented in glibc,
    //  but since we're not using that, we have to
    //  do the syscall manually.
    asm("movq $60, %%rax\n\t"
      "xorq %%rdi, %%rdi\n\t"
      "movl %0, %%edi\n\t"
      "syscall" :: "r" (status) : "rax", "rdi");
    //Using int $0x80 here causes Linux to use the older x86 syscall table
    //  which, notably, has exit at eax==1, and passes the first param in ebx.
    //syscall will use the newer x64 table and API.

    //This is a compiler intrinsic that does nothing but
    //  indicate that execution will never reach a certain area.
    //  Mostly required to avoid warnings about a noreturn function returning.
    __builtin_unreachable();
}

//This is where we will tell the linker to start executing our code.
__attribute__((noreturn)) void start(void)
{
    //... Our code goes here.

    //All done, call our exit syscall wrapper.
    exit(0);
}


Compile this using:
1
gcc -o nortlinux -nostdlib -Wl,-estart main.c


The resulting executable, when run, will do nothing and immediately exit.

What are the implications of building Linux applications without glibc?

Well, the big one is that no syscall wrappers are provided.
See, unlike Windows, Linux doesn't have a kernel library separate from its standard C library. glibc implements both Linux's syscall interface as well as the C stdlib. Avoiding glibc means rolling your own syscall wrappers. Fortunately, unlike Windows, Linux's syscalls are all very well documented.

Allocating large arrays seems to be no problem, floating-point seems to work without extra code, but zero-initializing large arrays does create an "undefined reference to 'memset'" linker error. Using the memset implementation from the original post mostly solves the problem, though (at least in pure C mode), size_t is not defined by default -- use unsigned long or typedef it.

I would think it's safe to assume that if you use this method with C++, all the 'features' it provides that are backed by the runtime -- exceptions, new/delete, RTTI, global object ctors/dtors, pure virtuals -- would not be available.

UPDATE: macOS has so far stymied my efforts to dodge the startup code.
For Linux, it seems that there are two other compiler options you can use: "-nodefaultlibs" and "-nostartfiles", but documentation on GCC seems to suggest "-nostdlib" implies both. Docs also suggest that linking with libgcc.a may be required if it starts complaining about routines GCC includes.

Edited by Spicy Wolf on Reason: Fixed big problem with mixing Linux syscall tables.
Spicy Wolf
I'm not sure if this is useful to anyone, but I hacked together the equivalent code for Linux (might also work on macOS, haven't tested yet).
<snip>

Awesome sauce! I'm sure this will come in handy to a few people. :)

Also, welcome.
So, another macOS update.

I managed to beat clang into accepting nostdlib by also using "-static" to avoid being forced to use libSystem.dylib and crt1.o, but I didn't save that many bytes: 4184 vs. 4288. I've been investigating using NASM to directly control the code being executed, but the version included with my system (or maybe the build tools? I'm not sure) only builds for 32-bit, not 64-bit like I'm interested in. I'll have to get a newer version and try that out.

Some other useful information: As far as the 64-bit syscall interface for macOS, it's largely the same as Linux, except all the syscall numbers are 0x2000000 higher than the Linux ones (e.g. exit, 60 for Linux, is 0x200003C for macOS), and the parameters are passed, in order, in rdi, rsi, rdx, r10, r8, and r9.

...

Okay, so I just realized something kind of big. DON'T USE INT $0x80 IN LINUX X64! It redirects to the old x86 syscall table, which has a totally different API. I'll be updating my other post regarding this, but the syscall numbers and registers to use are totally different between the two.

EDIT 12/26/2016: I just compared against the minimal Windows executable I made, and that's 3,072 bytes, and there's probably switches I could use to get that down, or some sort of tool similar to GNU strip, but it's not that much smaller than the minimal macOS executable above.

Edited by Spicy Wolf on Reason: New information, avoiding double post.
Does snprintf and swprintf (the C standard, not the Windows standard) exist in msvcrt?
I am not sure if i am failing with exporting it or if i got the wrong names (as Windows usually have similar names and adds underscores and whatnot.
MSVC before VS2015 has _snprintf which is not compatible with C99 snprintf. It has some differences/incompleteness.
Starting with VS2015 msvc C runtime has C99 compliant snprintf function.

Basically any underscore prefixed function in MSVC C runtime has some non-standard behavior. That's why the underscore - so somebody doesn't accidentally call it and wonder why behavior is different.
Ah, thanks for clarifying, i wasn't sure what it meant, so i kinda thought the underscore was used to explicitly call the original function (C compliant).

Well that explains why i can't obtain the function, is there any equivalent to it in Windows itself?
If you need to use sprintf functionality without default MSVC runtime you can choose from multiple options:

1) wsprintfA function from user32.dll. It has some limitations - max output is 1024 chars, no float (and probably more advanced formatters) support.
2) _snprintf from msvcrt.dll. Either link dynamically to it, or do GetProcAddress. This snprintf is pre-c99, so no formatters like %zu, but otherwise it works fine. msvcrt.dll file will be present on all Windows machines, it is basically C runtime that is used by Windows internal components. GCC on Windows (MinGW) uses it as its C runtime library.
3) stb_printf - it doesn't depend on CRT functions
4) c99-snprintf - also a good single file sprintf.

Edited by Mārtiņš Možeiko on
stb_printf includes stdarg.h and stdlib.h for va_args & friends. This will make it use the CRT, right?
va_arg is actually in stdarg.h, not stdlib.h. That's a bug in stb_sprintf.

And it is ok to use stdarg.h even if you are not linking to CRT. Because stdarg.h doesn't contain runtime functions. It contains only compile time compiler-specific functionality. Just like intrin.h contains intrinsics to generate special CPU instructions (like __rdtsc) instead of calling real functions.

It ok also to use stdint.h (provides typedefs for int types) and inttypes.h (for printf macros like PRIu64) and stddef.h (for size_t and NULL) and limits.h (for various macros like INT_MAX) and float.h (for macros like FLT_MAX).

Edited by Mārtiņš Možeiko on
I knew about stdint.h, inttypes.h, stddef.h and limits.h. I've never looked into stdarg.h, so thanks for the info! I'll take a look myself when I get a chance!

stdlib.h definitely needs linking to CRT, that was the main point of the question. Knowing that, if I ever need a non-CRT printf implementation I'll absolutely check this one out!

Thanks!
As aidtopia noted, replacing the memset and memcpy intrinsics doesn't seem to combine with Whole Program Optimization (/GL). It is still possible to use it, by putting the replacement intrinsics in their own static library. That way you can still apply Whole Program Optimization to the rest of the project.