my libc vs system's libc Insane difference

bewwys

#15618

June 24, 2018

Hi everyone,

I'm coding a C standard library for school.

When I use the bench toolkit to compare my library to the system's library the differences are just completely insanes.

here a picture showing the difference.

To see if it was my implementation that cause the perf difference I used the code source from the libgcc. But still the performance are just completely différents between the system's library and the libgcc's.

I really don't understand...

Edited by bewwys on June 25, 2018, 12:24am

Mārtiņš Možeiko

#15619

June 25, 2018

Can you show your implementation source?

glibc provides optimized string functions. That includes SSE or NEON or whatever else is available on your CPU. Are you using SIMD for these functions?

Here's an implementation of memchr in glibc for x86_64 architecture: https://sourceware.org/git/?p=gli...sdeps/x86_64/memchr.S;hb=HEAD#l29

Even "plain C" implementation is optimized with bunch of tricks (comparing 4 bytes at a time): https://sourceware.org/git/?p=gli...;a=blob;f=string/memchr.c;hb=HEAD

Here's a good article about using SSE4.2 string functions for strcmp, strlen and strstr functions: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 You can do a lot of cool tricks with pcmpistri instruction.

Edited by Mārtiņš Možeiko on June 25, 2018, 12:05am

bewwys

#15620

June 25, 2018

mmozeiko

Can you show your implementation source?
pcmpistri instruction.

Because because libgcc dosent have memchr here is my

void *ft_memchr(const void *ptr, int value, int num)
{
    int i;

    i = 0;
    while (i < num)
    {
        if (*((char *)ptr + i) == value)
            return ((void *)((char *)ptr + i));
        i++;
    }
    return (0);
}

As you can see I don't use anything like SIMD in fact I know that SIMD exist because I started to read Computer System A programmer Perspective. But I don't Know yet how to use it.

It's look like really interesting but also so complicated xD.

Aside: Does SIMD need to be used everywhere and should be a common practice or it's specific to certains projects ?

And thank you just have made my day :)

Edited by bewwys on June 25, 2018, 12:23am

Mārtiņš Možeiko

#15621

June 25, 2018

Because because libgcc dosent have memchr here is my

I assume you mean glibc. And glibc has memchr. Here it is: https://www.gnu.org/software/libc...earch-Functions.html#index-memchr

SIMD can be used where you need to process a lot of data with same algorithm. That could be drawing pixels, searching or sorting data, or some mathematical calculations. In your case you need to process byte array and perform exactly same operation, which means SIMD will do great. Typical x86 CPUs have 128-bit wide registers, some even 256-bit (and there are 512-bit ones coming very soon). In case of 256-bit registers you could process 256/8 = 32 bytes in one iteration. That means your code would run almost 32x faster.

Check out Handmade Hero videos where Casey started to use SSE instructions:
https://hero.handmade.network/episode/code/day115
https://hero.handmade.network/episode/code/day116/

Handmade Ray is also a worth to checkout, how SSE and AVX instructions were introduced there: https://www.youtube.com/watch?v=dpvrPYdTkPw

Edited by Mārtiņš Možeiko on June 25, 2018, 1:28am

ratchetfreak

#15628

June 25, 2018

If you enable optimizations in your compiler it will used SIMD when it can find opportunities to do so. Often this is in loops where each iteration is independent of the previous one.

For memchr this isn't the case because of the return in the loop. This means the compiler has to account for the edge case that num is a lie and the memory address directly after the match is invalid and causes a segfault.

You could explicitly read values N at a time to make it clear to the compiler that it's allowed.

But once you head into those kind of changes take very special care that things stay valid and it is actually an improvement.

Edited by ratchetfreak on June 25, 2018, 10:50am

bewwys

#15644

June 28, 2018

Thank you ratchetfreak,

ratchetfreak
Often this is in loops where each iteration is independent of the previous one.

Where it can be the case ? Do you have an example ?

ratchetfreak

You could explicitly read values N at a time to make it clear to the compiler that it's allowed.

How can I do that ? Do I just need to replace the return value by a variable and return it after the loop to enable the compiler optimization to take place ?

Thank you :)

ratchetfreak

#15646

June 28, 2018

The most basic example is something like

size_t count;
int *A = //...;
int *B = //...;

for(int i = 0; i < count; i++){
    A[i] += B[i]+constant;
}

each iteration through the loop does not affect what happens in the other iterations, order of memory accesses can be shifted around so the compiler could do a partial unroll and then transform the function into:

size_t count;
int *A = //...;
int *B = //...;
int i;
for(i = 0; i+3 < count; i+=4){
    int A0 = A[i+0];
    int A1 = A[i+1];
    int A2 = A[i+2];
    int A3 = A[i+3];

    int B0 = B[i+0];
    int B1 = B[i+1];
    int B2 = B[i+2];
    int B3 = B[i+3];

    A0 += B0 + constant;
    A1 += B1 + constant;
    A2 += B2 + constant;
    A3 += B3 + constant;

    A[i+0] = A0;
    A[i+1] = A1;
    A[i+2] = A2;
    A[i+3] = A3;
}

//do the last few
for(; i < count; i++){
    A[i] += B[i] + constant;
}

The first for loop would actually be implemented using simd

You can do the same transformation:

void *ft_memchr(const void *ptr, int value, int num)
{
    int i;
    char* cptr = (char*)ptr;
    for(i = 0; i+3 < num; i+=4){
        char val0 = cptr[i+0];
        char val1 = cptr[i+1];
        char val2 = cptr[i+2];
        char val3 = cptr[i+3];

        if(val0 == value) 
            return &cptr[i+0];
        if(val1 == value) 
            return &cptr[i+1];
        if(val2 == value) 
            return &cptr[i+2];
        if(val3 == value) 
            return &cptr[i+3];
    }
    for(;i<num; i++){
        if(cptr[i] == value) 
            return &cptr[i];
    }
    return (0);
}

There are ways to implement the cascaded if check with simd.

Doing manual unroll gets pretty verbose once you do go really wide and the surface area for copy-paste-edit errors gets larger. So it is really important to ensure that you check for correctness and make sure it actually helps performance.

Mārtiņš Možeiko

#15648

June 28, 2018

And next step is to load 4 bytes with one uint32_t load (or 8 with uint64_t) and do only one comparison. Yes, only one comparison for each 4 or 8 bytes. That's what glibc generic C code for memchr does: https://sourceware.org/git/?p=gli...ob;f=string/memchr.c;hb=HEAD#l100

bewwys

#15653

June 29, 2018

Just by theses few steps I've learned a lot. But to be honest with you it's a little bit overwhelming for me guys. I'm reading a book call computer system a programmer's perspective and I'm currently at the first chapter.

Thank you guys :)