Memory bandwidth + implementing memcpy

Mārtiņš Možeiko

#893

December 21, 2014

After watching Day 25 I want to comment on memory bandwidth thing. I seriously doubt any code will get those 32GB/s Casey was looking up online. That number is max CPU supported memory bandwidth. Real bandwidth will be lower. To test this, I wrote small test application that does memcpy from one memory buffer to other. And I am not seeing max number from spec sheet (on my laptop it is 25.6 GB/s for i7-4750HQ CPU). To see if maybe we can write better memcpy I implemented also memcpy with SSE and AVX instructions - still not getting that number.

So the numbers for my laptop with i7-4750HQ CPU (16GB of memory with two 8GB modules) using "clang -O2" compiler are follwing.

Simple memcpy gives me 6.51 GiB/s.

Using SSE2 with following code gives me 3.97 GiB/s. That means memcpy is better optimized that naive SSE2.

// dst and src must be 16-byte aligned
// size must be multiple of 16*8 = 128 bytes
static void CopyWithSSE(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 8 * sizeof(__m128i);
    while (size)
    {
        __m128 a = _mm_load_ps((float*)(src + 0*sizeof(__m128)));
        __m128 b = _mm_load_ps((float*)(src + 1*sizeof(__m128)));
        __m128 c = _mm_load_ps((float*)(src + 2*sizeof(__m128)));
        __m128 d = _mm_load_ps((float*)(src + 3*sizeof(__m128)));
        __m128 e = _mm_load_ps((float*)(src + 4*sizeof(__m128)));
        __m128 f = _mm_load_ps((float*)(src + 5*sizeof(__m128)));
        __m128 g = _mm_load_ps((float*)(src + 6*sizeof(__m128)));
        __m128 h = _mm_load_ps((float*)(src + 7*sizeof(__m128)));
        _mm_store_ps((float*)(dst + 0*sizeof(__m128)), a);
        _mm_store_ps((float*)(dst + 1*sizeof(__m128)), b);
        _mm_store_ps((float*)(dst + 2*sizeof(__m128)), c);
        _mm_store_ps((float*)(dst + 3*sizeof(__m128)), d);
        _mm_store_ps((float*)(dst + 4*sizeof(__m128)), e);
        _mm_store_ps((float*)(dst + 5*sizeof(__m128)), f);
        _mm_store_ps((float*)(dst + 6*sizeof(__m128)), g);
        _mm_store_ps((float*)(dst + 7*sizeof(__m128)), h);

        size -= stride;
        src += stride;
        dst += stride;
    }
}

I tried with less registers to see if code can be smaller. That gives still 3.97 GiB/s.

// dst and src must be 16-byte aligned
// size must be multiple of 16*2 = 32 bytes
static void CopyWithSSESmall(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m128);
    while (size)
    {
        __m128 a = _mm_load_ps((float*)(src + 0*sizeof(__m128)));
        __m128 b = _mm_load_ps((float*)(src + 1*sizeof(__m128)));
        _mm_store_ps((float*)(dst + 0*sizeof(__m128)), a);
        _mm_store_ps((float*)(dst + 1*sizeof(__m128)), b);

        size -= stride;
        src += stride;
        dst += stride;
    }
}

Then I tried store instruction that doesn't pollute cache. That gives 6.67 GiB/s. That's very close to memcpy, and I'm guessing C runtime on Linux uses this instruction.

// dst and src must be 16-byte aligned
// size must be multiple of 16*2 = 32 bytes
static void CopyWithSSENoCache(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m128);
    while (size)
    {
        __m128 a = _mm_load_ps((float*)(src + 0*sizeof(__m128)));
        __m128 b = _mm_load_ps((float*)(src + 1*sizeof(__m128)));
        _mm_stream_ps((float*)(dst + 0*sizeof(__m128)), a);
        _mm_stream_ps((float*)(dst + 1*sizeof(__m128)), b);

        size -= stride;
        src += stride;
        dst += stride;
    }
}

I tried to use prefetch instructions, but that did not give any reasonable speedup. I'm guessing modern CPUs can predict linear memory access pretty efficiently and does prefetch automatically.

Then I tried using AVX instructions. This gives 4.00 GiB/s. This is not better than SSE.

// dst and src must be 32-byte aligned
// size must be multiple of 32*16 = 512 bytes
static void CopyWithAVX(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 16 * sizeof(__m256i);
    while (size)
    {
        __m256i a = _mm256_load_si256((__m256i*)src + 0);
        __m256i b = _mm256_load_si256((__m256i*)src + 1);
        __m256i c = _mm256_load_si256((__m256i*)src + 2);
        __m256i d = _mm256_load_si256((__m256i*)src + 3);
        __m256i e = _mm256_load_si256((__m256i*)src + 4);
        __m256i f = _mm256_load_si256((__m256i*)src + 5);
        __m256i g = _mm256_load_si256((__m256i*)src + 6);
        __m256i h = _mm256_load_si256((__m256i*)src + 7);
        __m256i i = _mm256_load_si256((__m256i*)src + 8);
        __m256i j = _mm256_load_si256((__m256i*)src + 9);
        __m256i k = _mm256_load_si256((__m256i*)src + 10);
        __m256i l = _mm256_load_si256((__m256i*)src + 11);
        __m256i m = _mm256_load_si256((__m256i*)src + 12);
        __m256i n = _mm256_load_si256((__m256i*)src + 13);
        __m256i o = _mm256_load_si256((__m256i*)src + 14);
        __m256i p = _mm256_load_si256((__m256i*)src + 15);
        _mm256_store_si256((__m256i*)dst + 0, a);
        _mm256_store_si256((__m256i*)dst + 1, b);
        _mm256_store_si256((__m256i*)dst + 2, c);
        _mm256_store_si256((__m256i*)dst + 3, d);
        _mm256_store_si256((__m256i*)dst + 4, e);
        _mm256_store_si256((__m256i*)dst + 5, f);
        _mm256_store_si256((__m256i*)dst + 6, g);
        _mm256_store_si256((__m256i*)dst + 7, h);
        _mm256_store_si256((__m256i*)dst + 8, i);
        _mm256_store_si256((__m256i*)dst + 9, j);
        _mm256_store_si256((__m256i*)dst + 10, k);
        _mm256_store_si256((__m256i*)dst + 11, l);
        _mm256_store_si256((__m256i*)dst + 12, m);
        _mm256_store_si256((__m256i*)dst + 13, n);
        _mm256_store_si256((__m256i*)dst + 14, o);
        _mm256_store_si256((__m256i*)dst + 15, p);

        size -= stride;
        src += stride;
        dst += stride;
    }
}

Let's see if reducing register cound help or at least doesn't make everything worse. It doesn't, I'm getting 3.99 GiB/s for this.

// dst and src must be 32-byte aligned
// size must be multiple of 32*2 = 64 bytes
static void CopyWithAVXSmall(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m256i);
    while (size)
    {
        __m256i a = _mm256_load_si256((__m256i*)src + 0);
        __m256i b = _mm256_load_si256((__m256i*)src + 1);
        _mm256_store_si256((__m256i*)dst + 0, a);
        _mm256_store_si256((__m256i*)dst + 1, b);

        size -= stride;
        src += stride;
        dst += stride;
    }
}

Using store instruction that doesn't pollute cache helps. Now getting 6.64 GiB/s - same speed as for SSE.

// dst and src must be 32-byte aligned
// size must be multiple of 32*2 = 64 bytes
static void CopyWithAVXNoCache(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m256i);
    while (size)
    {
        __m256i a = _mm256_load_si256((__m256i*)src + 0);
        __m256i b = _mm256_load_si256((__m256i*)src + 1);
        _mm256_stream_si256((__m256i*)dst + 0, a);
        _mm256_stream_si256((__m256i*)dst + 1, b);

        size -= stride;
        src += stride;
        dst += stride;
    }
}

Then I tried couple of crazy things.
First I tried using "rep movsb", "rep movsl" and "rep movsq" instructions. These typically are not recommended to use on modern CPUs. But I was surprised that this gives better speed that just using SSE instructions - ~5.5 GiB/s for all three variants. So for smaller moves using "rep movsb" is OK to use in my opinion. I tried unaligned addresses (not multiple of 16), still ok - around 5.5 GiB/s. I'm guessing modern CPU's recognize "rep movsX" instruction as special case and does "the right thing" automatically.

static void __movsb(void* dst, const void* src, size_t size)
{
    __asm__ __volatile__("rep movsb" : "+D"(dst), "+S"(src), "+c"(size) : : "memory");
}

static void __movsd(void* dst, const void* src, size_t size)
{
    __asm__ __volatile__("rep movsl" : "+D"(dst), "+S"(src), "+c"(size) : : "memory");
}

static void __movsq(void* dst, const void* src, size_t size)
{
    __asm__ __volatile__("rep movsq" : "+D"(dst), "+S"(src), "+c"(size) : : "memory");
}

Then I went with completely crazy stuff - copying in parallel with two threads:

const size_t kThreadCount = 2;

struct ThreadWorkData
{
    uint8_t* src;
    uint8_t* dst;
    size_t size;

    volatile bool RunThread;
};

static ThreadWorkData ThreadData[kThreadCount];
static volatile long ThreadsReady;

static void* ThreadProc(void* Arg)
{
    size_t ThreadIndex = (size_t)Arg;
    ThreadWorkData* MyData = &ThreadData[ThreadIndex];
    for (;;)
    {
        while (!MyData->RunThread)
        {
        }
        CopyWithSSENoCache(MyData->dst, MyData->src, MyData->size);
        __sync_add_and_fetch(&ThreadsReady, 1);
        MyData->RunThread = false;
    } 
    return 0;
}

static void SetupThreads()
{
    for (size_t i=0; i<kThreadCount; i++)
    {
        pthread_t thread;
        pthread_create(&thread, 0, ThreadProc, (void*)i);
    }
}

// dst and src must be 32-byte aligned
// size must be multiple of 32*2*kThreadCount = 64*kThreadCoutn bytes
static void CopyWithThreads(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t size1 = size / kThreadCount;

    ThreadsReady = 0;
    for (size_t i=0; i<kThreadCount; i++)
    {
        ThreadData[i].dst = dst;
        ThreadData[i].src = src;
        ThreadData[i].size = size1;
        ThreadData[i].RunThread = true;

        dst += size1;
        src += size1;
    }

    while (ThreadsReady != kThreadCount)
    {
    }
}

This gives 7.7 GiB/s. So a bit better than other solutions. Increasing thread count to 4 (my CPU is quad core) doesn't help, speed stays the same.

So in summary, here are speeds for my laptop with i7-4750HQ, compiled with clang, running under Linux:

memcpy = 6.51 GiB/s
CopyWithSSE = 3.97 GiB/s
CopyWithSSESmall = 3.97 GiB/s
CopyWithSSENoCache = 6.67 GiB/s
CopyWithAVX = 4.00 GiB/s
CopyWithAVXSmall = 3.99 GiB/s
CopyWithAVXNoCache = 6.64 GiB/s
CopyWithRepMovsb = 5.69 GiB/s
CopyWithRepMovsd = 5.22 GiB/s
CopyWithRepMovsq = 5.19 GiB/s
CopyWithRepMovsbUnaligned = 5.11 GiB/s
CopyWithThreads = 7.70 GiB/s

Numbers on my desktop with i5-750 (no AVX instruction set), compiled with Visual Studio 2013 (x64):

memcpy = 3.46 GiB/s
CopyWithSSE = 3.48 GiB/s
CopyWithSSESmall = 3.43 GiB/s
CopyWithSSENoCache = 4.79 GiB/s
CopyWithRepMovsb = 4.08 GiB/s
CopyWithRepMovsd = 4.11 GiB/s
CopyWithRepMovsq = 4.01 GiB/s
CopyWithRepMovsbUnaligned = 3.93 GiB/s
CopyWithThreads = 4.44 GiB/s

Only memcpy is less efficient (apparently it doesn't use SSE store instructions that doesn't pollute cache).

My conclusion on all this - if you want to implement fast memcpy, don't bother with SSE on modern CPU's. Just use "classic" rep movsb instruction. MSVC has intrisinc for that (__movsb) on GCC/clang it is pretty trivial to implement (see code above). And don't expect numbers to be close to max bandwidth numbers you see in spec sheets.

Also about CopyMemory - it is actually #define to use memcpy. For VS2013 CopyMemory (and its friends) in minwinbase.h are defined like this:

#define MoveMemory RtlMoveMemory
#define CopyMemory RtlCopyMemory
#define FillMemory RtlFillMemory
#define ZeroMemory RtlZeroMemory

RtlXYZ functions are in winnt.h file:

#define RtlEqualMemory(Destination,Source,Length) (!memcmp((Destination),(Source),(Length)))
#define RtlMoveMemory(Destination,Source,Length) memmove((Destination),(Source),(Length))
#define RtlCopyMemory(Destination,Source,Length) memcpy((Destination),(Source),(Length))
#define RtlFillMemory(Destination,Length,Fill) memset((Destination),(Fill),(Length))
#define RtlZeroMemory(Destination,Length) memset((Destination),0,(Length))

So there is not "Windows" function for copying memory. And because we are implementing everything ourselves, and Casey said he won't use standard library for anything, it is not "fair" to use CopyMemory :) We need to implement it ourselves. My suggestion:

void CopyMemory(void* dst, const void* src, size_t size)
{
#ifdef _MSC_VER
    __movsb(dst, src, size);
#elif defined(__i386__) || defined(__x86_64___)
    __asm__ __volatile__("rep movsb" : "+D"(dst), "+S"(src), "+c"(size) : : "memory");
#else
    #error TODO: implement for other architectures
#endif
}

All my source code is available here: MemSpeed.cpp. It can be compiled with MSVC2013, clang and gcc.

Edited by Mārtiņš Možeiko on May 9, 2016, 5:55pm

Livet Ersomen Strøm

#901

December 21, 2014

Nice and interesting post.

Can you test the Linux build against the Win build on the same hardware, please?

I am interested, because I am on Linux now, since about the windows 10 preview came along. I frankly find the new Windows metrostyle obscene, and extremely ugly. IS like:wtf? Every time I see it, I perceive a worse experience then the last time. Its like having a cutting tool to your eyeballs! It has "bad taste" written all over it. Who the fuck would consider writing a new API for a freaking TILING engine? Hahha. Thats what it is. TILING 2.0. It's ugly as hell, and it's anoying ;)

But on the other hand I find Linux to feel very slow and unresponsive. Also frequent strange crashes and hangups, not of the os, but the apps I am using. Firefox, System monitor, copying and so on. It's also ideling in like 5-10% CPU on every core, which I find terrible. It uses now 10% on core 3, for doing absolutely nope. And I would never consider using this platform for serious coding. Espesially of games or other performant code. And I am frankly in a bit of shock when I hear other people doing it. What have I missed?

Windows is blazing fast if done right, and the MSDN papers have proven me wrong so many times, that I now trust them completely. Everytime I was suspecting there was something wrong with the API it always turned out that there was not. And that I was just misunderstanding something. I don't know how many times I realized that the real asshole was me. I would gauge MSDN has a failure-rate of about 0,1% or something. You can hardly get better information than that.

While some of the higher level features doesn't seem all that right to me, the core OS is just about as fast and well done as it could be, on the average. And after 30+ years, what would you expect? But Linux seems as sluggish and slow as everytime I tried it in the last 15 years. And now that ubuntu is bundled with spyware, and a disabled firewall, with google and Amazons hands all over it, the net result is the feeling that it's the worst of two worlds. It's slow, and it will become a launchpad for marketing powers to flourish. I don't like that. And it doesn't help moving distros, as they will be where the users are. I do not like that at all.

However, up on seing that my win32 applications runs very well on the Wine plattform, way better than in the Linux itself, and close to win32, I guess I to some extent IS considering it, still, as an option for the future.

Sorry for ranting in your thread.

I be very interested to read about things you have to say about memory speeds.

Mārtiņš Možeiko

#902

December 21, 2014

Kladdehelvete
Can you test the Linux build against the Win build on the same hardware, please?

Sorry, I don't have Windows on same hardware as Linux. And I think its pointless to do that, because these functions are pure memory operations there won't be difference in speed, because nothing depends OS (as long as memory are really in physical memory, not in pagefile).

Only thing such test will show is difference between compilers (MSVC vs clang/gcc). I can test that if you want. So for clang built executable running on my i7-4750HQ laptop you see my numbers above. Running Visual C++ compiled executable using wine gives following numbers:

[mmozeiko@dev ~]$ WINEARCH=win64 WINEPREFIX=~/.wine64 wine ./MemSpeed.exe
memcpy = 4.48 GiB/s
CopyWithSSE = 4.48 GiB/s
CopyWithSSESmall = 4.48 GiB/s
CopyWithSSENoCache = 7.67 GiB/s
CopyWithAVX = 4.51 GiB/s
CopyWithAVXSmall = 4.49 GiB/s
CopyWithAVXNoCache = 7.64 GiB/s
CopyWithRepMovsb = 5.85 GiB/s
CopyWithRepMovsd = 5.60 GiB/s
CopyWithRepMovsq = 5.57 GiB/s
CopyWithRepMovsbUnaligned = 5.50 GiB/s
CopyWithThreads = 7.55 GiB/s

As you can see numbers are pretty close to Linux native ones. Except memcpy, which makes sense - I made MSVC to link statically with C runtime, so wine executes memcpy from Microsoft Visual C/C++ instead of calling glibc memcpy.

But on the other hand I find Linux to feel very slow and unresponsive.

In my experience it is other way around. Linux is much more responsive for me. In I/O, process creation, threading stuff.. etc. In my work I often need to compile whole llvm/clang - it is very large C++ project. On same hardware doing that under Windows using Visual C++ makes it 2x or 3x longer than using clang/gcc on Linux. Using clang/gcc on Windows is still longer than on Linux. My guess would that this is because Windows simply doesn't optimize such low-level stuff anymore, they are changing only high-level stuff (UI & Metro) nowdays. But Linux does (example). But let's leave this for different thread.

Edited by Mārtiņš Možeiko on December 21, 2014, 11:36pm

Filip

#968

December 22, 2014

FYI, I did a naïve port of your code to mac os x and did a run on my 2012 retina macbook pro. I get some interesting results:

memcpy = 7.44 GiB/s
CopyWithSSE = 3.92 GiB/s
CopyWithSSESmall = 3.75 GiB/s
CopyWithSSENoCache = 0.34 GiB/s
CopyWithAVX = 4.09 GiB/s
CopyWithAVXSmall = 4.08 GiB/s
CopyWithAVXNoCache = 5.87 GiB/s
CopyWithRepMovsb = 7.59 GiB/s
CopyWithRepMovsd = 7.17 GiB/s
CopyWithRepMovsq = 7.25 GiB/s
CopyWithRepMovsbUnaligned = 6.90 GiB/s
CopyWithThreads = 0.67 GiB/s

And btw, the "port" was only:
1. link with -mavx2
2. add clock_gettime impl found on stack overflow:

#ifdef __MACH__
#include <sys/time.h>
#define CLOCK_REALTIME 0 
#define CLOCK_MONOTONIC 0 
//clock_gettime is not implemented on OSX
int clock_gettime(int /*clk_id*/, struct timespec* t) {
    struct timeval now;
    int rv = gettimeofday(&now, NULL);
    if (rv) return rv;
    t->tv_sec  = now.tv_sec;
    t->tv_nsec = now.tv_usec * 1000;
    return 0;
}
#endif

Mārtiņš Možeiko

#969

December 22, 2014

Can you check assembly code for CopyWithSSENoCache function? Something isn't right there.
And you need to use only "-mavx" compiler flag. I'm not using AVX2 instructions, just AVX.

For timing on OSX I would use functions from <mach/mach_time.h> header. mach_absolute_time() returns ticks in uint64_t, and with mach_timebase_info(...) you can get how many tikcs are in second. These functions are very similar to QueryPerformanceCounter and QueryPerformanceFrequency on Windows.

Edited by Mārtiņš Možeiko on December 22, 2014, 10:49pm

Filip

#970

December 22, 2014

hrm. I'm sorry I just forgot the optimization flag.
Now the results seems more plausible:

memcpy = 7.54 GiB/s
CopyWithSSE = 5.57 GiB/s
CopyWithSSESmall = 5.55 GiB/s
CopyWithSSENoCache = 7.88 GiB/s
CopyWithAVX = 5.51 GiB/s
CopyWithAVXSmall = 5.53 GiB/s
CopyWithAVXNoCache = 7.75 GiB/s
CopyWithRepMovsb = 7.30 GiB/s
CopyWithRepMovsd = 7.10 GiB/s
CopyWithRepMovsq = 7.29 GiB/s
CopyWithRepMovsbUnaligned = 6.77 GiB/s
CopyWithThreads = 8.62 GiB/s

Andrew Bromage

#975

December 23, 2014

If your stdlib is any good, then it probably selects an algorithm based on the size, the architecture, the degree of overlap (if it's memmove), and whether the source and destination are both aligned or not. That last point is important on SSE; it's the difference between movaps and movups.

As several people discovered, rep movsb isn't as bad as it's often alleged to be. So much so, that even modern memcpy/memmove just go ahead and use it if the block isn't very big, or to do the initial and final unaligned parts.

For larger blocks, modern stdlibs will often detect the CPU at startup time, and if it's a large block of memory, will use a version tuned for that.

Dejan

#981

December 23, 2014

The latest Intel recommendations (seems to agree with the numbers everyone has posted):

"Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for software in common situations like memory copy and set operations."

and

"For processors supporting enhanced REP MOVSB/STOSB, implementing memcpy with REP MOVSB will provide even more compact benefits in code size and better throughput than using the combination of REP MOVSD+B. For processors based on Intel microarchitecture code name Ivy Bridge, implementing memcpy using ERMSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors."

ben

#984

December 23, 2014

Yeah these things are very microarchitecture dependent. I believe the optimal way of doing memcpy varies depending whether you are on sandy bridge, ivy bridge or haswell.

Simon Anciaux

#22129

January 28, 2020

Hey, mmozeiko, I have some questions about the CopyWithSSENoCache function.

// dst and src must be 16-byte aligned
// size must be multiple of 16*2 = 32 bytes
static void CopyWithSSENoCache(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m128);
    while (size)
    {
        __m128 a = _mm_load_ps((float*)(src + 0*sizeof(__m128)));
        __m128 b = _mm_load_ps((float*)(src + 1*sizeof(__m128)));
        _mm_stream_ps((float*)(dst + 0*sizeof(__m128)), a);
        _mm_stream_ps((float*)(dst + 1*sizeof(__m128)), b);

        size -= stride;
        src += stride;
        dst += stride;
    }
}

I don't know much about SSE so I'm asking in case there are details I'm not aware of.

Why do you use two load and two stream ? Couldn't we use only one and not have the requirement of the size being a multiple of 32 ? Is it because the CPU can issue two of these on different ports at the same time ? I tested it with only 1 load and 1 stream and it seem to work with no much performance difference (on my CPU which is quite old now, 2009 lynnfield i7 860).

Is there a reason for not using load_si128 and stream_si128 instead (load_si128 as "better" thoughput) ? I tried it and it seems that there are no differences ? I also tried stream_load_si128 SEE 4.2 instruction, but it didn't seem to matter at all.

static void CopyWithSSENoCache_single(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = sizeof(__m128);
    while (size)
    {
        __m128 a = _mm_load_ps((float*)(src + 0*sizeof(__m128)));
        _mm_stream_ps((float*)(dst + 0*sizeof(__m128)), a);
        
        size -= stride;
        src += stride;
        dst += stride;
    }
}

static void CopyWithSSENoCache_load_si128_stream_ps(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m128);
    while (size)
    {
        __m128i a = _mm_load_si128((__m128i*)(src + 0*sizeof(__m128)));
        __m128i b = _mm_load_si128((__m128i*)(src + 1*sizeof(__m128)));
        _mm_stream_ps((float*)(dst + 0*sizeof(__m128)), *(__m128*)&a);
        _mm_stream_ps((float*)(dst + 1*sizeof(__m128)), *(__m128*)&b);
        
        size -= stride;
        src += stride;
        dst += stride;
    }
}

static void CopyWithSSENoCache_load_si128_stream_si128(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m128);
    while (size)
    {
        __m128i a = _mm_load_si128((__m128i*)(src + 0*sizeof(__m128)));
        __m128i b = _mm_load_si128((__m128i*)(src + 1*sizeof(__m128)));
        _mm_stream_si128((__m128i*)(dst + 0*sizeof(__m128i)), a);
        _mm_stream_si128((__m128i*)(dst + 1*sizeof(__m128i)), b);
        
        size -= stride;
        src += stride;
        dst += stride;
    }
}

static void CopyWithSSENoCache_stream_load_si128_stream_si128(uint8_t* dst, uint8_t* src, size_t size)
{
    size_t stride = 2 * sizeof(__m128);
    while (size)
    {
        /* SSE 4.2 */
        __m128i a = _mm_stream_load_si128((__m128i*)(src + 0*sizeof(__m128)));
        __m128i b = _mm_stream_load_si128((__m128i*)(src + 1*sizeof(__m128)));
        _mm_stream_si128((__m128i*)(dst + 0*sizeof(__m128i)), a);
        _mm_stream_si128((__m128i*)(dst + 1*sizeof(__m128i)), b);
        
        size -= stride;
        src += stride;
        dst += stride;
    }
}

/* Different runs give different result, but all SSENoCache functions seem equivalent. */
memcpy = 3.81 GiB/s
CopyWithSSE = 3.86 GiB/s
CopyWithSSESmall = 3.81 GiB/s
CopyWithSSENoCache = 5.14 GiB/s
CopyWithSSENoCache_single = 5.18 GiB/s
CopyWithSSENoCache_load_si128_stream_ps = 5.13 GiB/s
CopyWithSSENoCache_load_si128_stream_si128 = 5.17 GiB/s
CopyWithSSENoCache_stream_load_si128_stream_si128 = 5.13 GiB/s
CopyWithRepMovsb = 4.33 GiB/s
CopyWithRepMovsd = 4.36 GiB/s
CopyWithRepMovsq = 4.46 GiB/s
CopyWithRepMovsbUnaligned = 4.36 GiB/s
CopyWithThreads = 5.06 GiB/s

Mārtiņš Možeiko

#22135

January 29, 2020

Intuition usually tells me put two or more same kind of SSE instructions next to each other, because usually there is some other code around this that pipelines with these instructions well enough. If you have only one instruction then often there is some kind of bottleneck on pipeline and it does not run as fast as expected. I am pretty sure one pair of load+store instructions in a loop was slower than two that I wrote here. But I don't remember much what I measured 5y ago here :)

No idea about differences between integer vs float loads. I actually did not know that there are any differences in throughput for these. My assumption was that loading is loading memory - no matter how it will be used - integer or float. Never bothered looking up exact numbers...

Edited by Mārtiņš Možeiko on January 29, 2020, 1:15am

Simon Anciaux

#22138

January 29, 2020

Thanks.

longtran2904

#30400

February 24, 2025

Running a slightly modified version of the code on my machine (i7-8700 with 2x8GB of memory) gives a better result (for some reason, 2 threads always seem to get the best result):

I have some hardware-oriented questions:

Looking at the Intel spec, my CPU's Max Memory Bandwidth is 41.6 GB/s. ~12-14 GiB/s is way lower than that. Why is that?
Can the memory latency be calculated using the linked spec and the bandwidth number above?
Which part of the CPU accounted for memory transactions? Does each core have its own memory subsystem, or does it depend more on the shared cache (e.g. L3)? Does it matter how many cores access memory parallelly (ignore cache coherency and NUMA/MESI)?
Does the RAM's speed and frequency affect real-life bandwidth? Is that why my actual bandwidth is way lower than the listed one?

Mārtiņš Možeiko

#30401

February 24, 2025

Because intel specs say what CPU/chipset is capable of. But you can put all kinds of different memory modules in system. Memory has their own speed they run. For example, DDR4-2666 means it is doing 2666 megatransfers per second. One transfer is 8 bytes. So 2666*8 = 21328 MB/s. Of course, that is max speed shared across all processes in system. In practice you'll see smaller number there.

It's similar to have SATA3 has max bandwidth of 6Gbps, but you don't get that with hard disks. Even if they say they support SATA3 they run only ~200MB/s which is 1.6Gbps. Only when you use SSD drives you can max out SATA3 bandwidth ~550MB/s = 4.4Gbps (there's extra bandwidth needed for protocol overhead & encoding).
You cannot calculate latency from bandwidth. To measure latency you need to write different code - typically walking linked list of pointers. Something like this: https://github.com/FedeParola/memory-latency
All CPU cores share same memory bus.
Yes, max bandwidth is calculated from "frequency". For DDR4 see the table here: https://en.wikipedia.org/wiki/DDR4_SDRAM#JEDEC_standard_DDR4_module

Edited by Mārtiņš Možeiko on February 24, 2025, 10:45pm

Replying to longtran2904 (#30400)

longtran2904

#30402

February 25, 2025

I've never examined a RAM spec closely before, and the internet mostly reports these 2400/2666/3200 numbers as MHz, but from the look of it, the actual unit is MT/s. The correct term for this seems to be data rate, not speed or frequency.

If you look at the memory clock, IO bus clock, and the data rate of any given RAM in the table you linked me to, you see that the next number is 4x the previous number. Why is that?

So 2666*8 = 21328 MB/s.

If I have 2 sticks of DDR4-2666 RAM, does it double the transfer rate? If I do the math backward, to reach 41.6 GB/s I need DDR4-5200 RAM.

All CPU cores share same memory bus.

Let's say the memory bandwidth in practice is 40 GB/s and the CPU has 4 cores. If a single core is touching memory, it can reach that number. But if 4 cores start to touch memory simultaneously, and because the total bandwidth is still 40 GB/s, is each core's perceived bandwidth roughly 10 GB/s? If all I'm saying is true, then why does using two threads to memcpy give the best result overall? And keep increasing the number of threads makes the result worse?

If reading from and writing to memory is the job of the shared cache (L3), then after a core issues a read/write request, it's free to do anything else that doesn't depend on that memory op, right? I ask because a common way to reach the max throughput when doing disk IO is to issue a bunch of async reads/writes and process the previous fetch while waiting for the current pending ops. When those ops finish you can issue new ones while processing the currents. So I guess the same applies to memory IO?

Replying to mmozeiko (#30401)