Guide - How to avoid C/C++ runtime on Windows

I'm not sure if this is useful to anyone, but I hacked together the equivalent code for Linux (might also work on macOS, haven't tested yet).

!!UPDATE! THERE'S A REASON NOT TO USE INT $0x80 ON x64!
main.c:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
//this is a wrapper around the Linux exit syscall.
//!! If you want to save an instruction, make status a long,
//    replace the middle two lines of the asm block with
//    "movq %0, %%rdi\n\t", and use the compiler option "-fno-builtin"
__attribute__((noreturn)) void exit(int status)
{
    //exit() is normally implemented in glibc,
    //  but since we're not using that, we have to
    //  do the syscall manually.
    asm("movq $60, %%rax\n\t"
      "xorq %%rdi, %%rdi\n\t"
      "movl %0, %%edi\n\t"
      "syscall" :: "r" (status) : "rax", "rdi");
    //Using int $0x80 here causes Linux to use the older x86 syscall table
    //  which, notably, has exit at eax==1, and passes the first param in ebx.
    //syscall will use the newer x64 table and API.

    //This is a compiler intrinsic that does nothing but
    //  indicate that execution will never reach a certain area.
    //  Mostly required to avoid warnings about a noreturn function returning.
    __builtin_unreachable();
}

//This is where we will tell the linker to start executing our code.
__attribute__((noreturn)) void start(void)
{
    //... Our code goes here.

    //All done, call our exit syscall wrapper.
    exit(0);
}


Compile this using:
1
gcc -o nortlinux -nostdlib -Wl,-estart main.c


The resulting executable, when run, will do nothing and immediately exit.

What are the implications of building Linux applications without glibc?

Well, the big one is that no syscall wrappers are provided.
See, unlike Windows, Linux doesn't have a kernel library separate from its standard C library. glibc implements both Linux's syscall interface as well as the C stdlib. Avoiding glibc means rolling your own syscall wrappers. Fortunately, unlike Windows, Linux's syscalls are all very well documented.

Allocating large arrays seems to be no problem, floating-point seems to work without extra code, but zero-initializing large arrays does create an "undefined reference to 'memset'" linker error. Using the memset implementation from the original post mostly solves the problem, though (at least in pure C mode), size_t is not defined by default -- use unsigned long or typedef it.

I would think it's safe to assume that if you use this method with C++, all the 'features' it provides that are backed by the runtime -- exceptions, new/delete, RTTI, global object ctors/dtors, pure virtuals -- would not be available.

UPDATE: macOS has so far stymied my efforts to dodge the startup code.
For Linux, it seems that there are two other compiler options you can use: "-nodefaultlibs" and "-nostartfiles", but documentation on GCC seems to suggest "-nostdlib" implies both. Docs also suggest that linking with libgcc.a may be required if it starts complaining about routines GCC includes.

Edited by Spicy Wolf on Reason: Fixed big problem with mixing Linux syscall tables.
Spicy Wolf
I'm not sure if this is useful to anyone, but I hacked together the equivalent code for Linux (might also work on macOS, haven't tested yet).
<snip>

Awesome sauce! I'm sure this will come in handy to a few people. :)

Also, welcome.
So, another macOS update.

I managed to beat clang into accepting nostdlib by also using "-static" to avoid being forced to use libSystem.dylib and crt1.o, but I didn't save that many bytes: 4184 vs. 4288. I've been investigating using NASM to directly control the code being executed, but the version included with my system (or maybe the build tools? I'm not sure) only builds for 32-bit, not 64-bit like I'm interested in. I'll have to get a newer version and try that out.

Some other useful information: As far as the 64-bit syscall interface for macOS, it's largely the same as Linux, except all the syscall numbers are 0x2000000 higher than the Linux ones (e.g. exit, 60 for Linux, is 0x200003C for macOS), and the parameters are passed, in order, in rdi, rsi, rdx, r10, r8, and r9.

...

Okay, so I just realized something kind of big. DON'T USE INT $0x80 IN LINUX X64! It redirects to the old x86 syscall table, which has a totally different API. I'll be updating my other post regarding this, but the syscall numbers and registers to use are totally different between the two.

EDIT 12/26/2016: I just compared against the minimal Windows executable I made, and that's 3,072 bytes, and there's probably switches I could use to get that down, or some sort of tool similar to GNU strip, but it's not that much smaller than the minimal macOS executable above.

Edited by Spicy Wolf on Reason: New information, avoiding double post.
Does snprintf and swprintf (the C standard, not the Windows standard) exist in msvcrt?
I am not sure if i am failing with exporting it or if i got the wrong names (as Windows usually have similar names and adds underscores and whatnot.
MSVC before VS2015 has _snprintf which is not compatible with C99 snprintf. It has some differences/incompleteness.
Starting with VS2015 msvc C runtime has C99 compliant snprintf function.

Basically any underscore prefixed function in MSVC C runtime has some non-standard behavior. That's why the underscore - so somebody doesn't accidentally call it and wonder why behavior is different.
Ah, thanks for clarifying, i wasn't sure what it meant, so i kinda thought the underscore was used to explicitly call the original function (C compliant).

Well that explains why i can't obtain the function, is there any equivalent to it in Windows itself?
If you need to use sprintf functionality without default MSVC runtime you can choose from multiple options:

1) wsprintfA function from user32.dll. It has some limitations - max output is 1024 chars, no float (and probably more advanced formatters) support.
2) _snprintf from msvcrt.dll. Either link dynamically to it, or do GetProcAddress. This snprintf is pre-c99, so no formatters like %zu, but otherwise it works fine. msvcrt.dll file will be present on all Windows machines, it is basically C runtime that is used by Windows internal components. GCC on Windows (MinGW) uses it as its C runtime library.
3) stb_printf - it doesn't depend on CRT functions
4) c99-snprintf - also a good single file sprintf.

Edited by Mārtiņš Možeiko on
stb_printf includes stdarg.h and stdlib.h for va_args & friends. This will make it use the CRT, right?
va_arg is actually in stdarg.h, not stdlib.h. That's a bug in stb_sprintf.

And it is ok to use stdarg.h even if you are not linking to CRT. Because stdarg.h doesn't contain runtime functions. It contains only compile time compiler-specific functionality. Just like intrin.h contains intrinsics to generate special CPU instructions (like __rdtsc) instead of calling real functions.

It ok also to use stdint.h (provides typedefs for int types) and inttypes.h (for printf macros like PRIu64) and stddef.h (for size_t and NULL) and limits.h (for various macros like INT_MAX) and float.h (for macros like FLT_MAX).

Edited by Mārtiņš Možeiko on
I knew about stdint.h, inttypes.h, stddef.h and limits.h. I've never looked into stdarg.h, so thanks for the info! I'll take a look myself when I get a chance!

stdlib.h definitely needs linking to CRT, that was the main point of the question. Knowing that, if I ever need a non-CRT printf implementation I'll absolutely check this one out!

Thanks!
As aidtopia noted, replacing the memset and memcpy intrinsics doesn't seem to combine with Whole Program Optimization (/GL). It is still possible to use it, by putting the replacement intrinsics in their own static library. That way you can still apply Whole Program Optimization to the rest of the project.
Right, that is true. Although if you are using handmade-like build, where everything is single-translation unit, then /GL doesn't really matter. There are no optimizations across translation-units, because there is only one translation unit. So you won't gain anything from link-time optimizations. Everything can be optimized during compilation with /O2 or /Ox /Ot.
Greetings everyone, I know that it's quite an old thread but, I've noticed it just a few days ago and since then I'm trying as hard as I can to implement the casting functions by myself with inline assembly. I think I'm pretty much done with the _ftoui3 (float to unsigned int casting) function, even though I haven't done any exception checking and I don't know how efficient it is, it seems to be working quite nicely. You can judge for yourself:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
 __declspec(naked) void _ftoui3() {
        __asm {				
			push ebp
			mov ebp, esp

            ;cvttss2si eax, xmm0 ; This instruction also truncates but stores the value as a signed integer, just like fisttp :(

			sub esp, 4
			movd [ebp-4], xmm0
			fld [ebp-4]
			fabs

			sub esp, 4
			fisttp dword ptr [ebp-8]
			mov eax, [ebp-8]

			add esp, 8

			pop ebp
			ret
        }
    }


However, I got stuck on the _dtoui3 (double to unsigned int casting) function. You see, the problem is that although there are indeed instructions that can truncate floating point numbers successfully, they all share a common unfortunate drawback: They all store the truncated value as a signed integer, which will lead to some data loss when a value stored in a double exceeds the limit of a signed integer. I've scanned the whole Internet trying to find different instructions or different methods to tackle this problem but to no avail.
If anyone of you could share your wisdom with me and shed some light on this minor topic, I would appreciate it very much. I'm certain that there is a way since the CRT managed to implement it somehow, but the disassembly of this function looks very intimidating so I can't figure out anything out of it.
I found this instruction, but it seems to be AVX512. https://www.felixcloutier.com/x86/VCVTTSD2USI.html
SedatedSnail

I found this instruction, but it seems to be AVX512. https://www.felixcloutier.com/x86/VCVTTSD2USI.html


Oh man, that's so close!!!
Unfortunately, there aren't many people yet (myself included) who got the necessary processor to support this instruction set :(