Memory bandwidth + implementing memcpy

Mārtiņš Možeiko

#30403

February 25, 2025

I don't know how DDR memory works internally with all those memory and bus rates, you'll need to read DDR spec for that.

If I have 2 sticks of DDR4-2666 RAM, does it double the transfer rate?

Right, for double of sticks the bandwidth goes up 2x. But because memcpy is doing both read and write - that divides bandwidth in half, so you're back to 21328MB/s.

But if 4 cores start to touch memory simultaneously, and because the total bandwidth is still 40 GB/s, is each core's perceived bandwidth roughly 10 GB/s?

Correct.

If all I'm saying is true, then why does using two threads to memcpy give the best result overall?

My guess would be because second thread hides minor stalls in pipeline of first thread. Single thread almost never maxes out bandwidth, because it needs to do some other things too. But once you start adding too many threads there are other deficiencies in cache management or other parts of CPU implementation when talking to memory.

When those ops finish you can issue new ones while processing the currents. So I guess the same applies to memory IO?

Yes, that is right. This works the same with just regular instructions. For example, if you have a very slow operation, like divide, then you can issue bunch of other simpler instructions in "background" and get them to compute for free. It's just natural way how CPU operates.

This is a nice article that talks about how much Zen5 compute can do vs memory access: http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/

A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth.

Replying to longtran2904 (#30402)

longtran2904

#30405

February 25, 2025

Right, for double of sticks the bandwidth goes up 2x. But because memcpy is doing both read and write - that divides bandwidth in half, so you're back to 21328MB/s.

This sounds too good to be true. Rather than 2x8GB, why don't I just use 4x4GB? Then wouldn't my memory bandwidth increase by a factor of 4? And if this is the case, then why do hardware vendors, when designing a 16GB RAM stick, don't split that stick into 4 sections—each with its own control chip? Remember that the price of a 16GB stick is roughly equal to 2x8GB and 4x4GB. Is it because they don't have enough physical space?

One thing I wonder though, is what contributes more to my 2x bandwidth over your. Is it because I'm using a newer CPU (8700 vs 4750), or is it more of a difference between DDR3 and DDR4? If someone uses a newer CPU than me but still the same DDR4 RAM, do you think they see this huge of a difference?

Regarding executing other instructions in the background while waiting for some pending memory IO, is there a reliable way in C for me to signal that I want to load this address in without actually doing any work with it? Like a dummy *ptr; but doesn't get optimized out by the compiler.

Replying to mmozeiko (#30403)

Mārtiņš Možeiko

#30406

February 25, 2025

It depends on CPU & chipset. Most desktop CPU's support only up to dual channel model. So only 2x bandwidth increase, even if you use 4 sticks. This is nothing new, computers have been doing this for last 20 years or so. See this whitepaper from Kingston from 2003: https://web.archive.org/web/20110929024052/http://www.kingston.com/newtech/MKF_520DDRwhitepaper.pdf

You can think of it as "RAID-0" but for memory. There's much else going on there. Memory controller simply splits transfers in half and does them in parallel to each memory module. As long as CPU & motherboard vendor designs for this - they need physically extra space on chip & board to put wires to memory modules.

Some Intel and server AMD CPU's have triple channel or quad channels. Then the sticks should be inserted in multiple of threes or fours and bandwidth increases 3x or 4x: https://en.wikipedia.org/wiki/Multi-channel_memory_architecture#Triple-channel_architecture

Asking for CPU to get memory without using it just yet is called prefetching. CPU does it automatically - that's called hardware prefetcher. It is trying to figure out patterns in your memory acccesses and loads them ahead of time into faster caches. You can put prefetch instructions to explicitly prefetch memory locations you want - that's called software prefetching. Sometimes it helps, sometimes it will not. Depends on your code. If you prefetch it too much, then you're wasting too much bandwidth and/or cache space, and your code may run much slower. You can get a lot of information about prefetching in Agner or AMD & Intel optimization manuals.

Replying to longtran2904 (#30405)

longtran2904

#30408

February 26, 2025

Some Intel and server AMD CPU's have triple channel or quad channels. Then the sticks should be inserted in multiple of threes or fours and bandwidth increases 3x or 4x

So I asked why don't vendors design a single RAM stick in a single RAM slot that increases the bandwidth 2x/3x/4x, not 2x/3x/4x the RAM slots. Is there any weird server CPU or motherboard doing this?

I guess your answer is that, in theory, you could do it, but it also depends on the CPU and motherboard vendors, not the RAM vendors alone. In practice, after weighing the cost-benefit analysis, most vendors choose not to.

Regarding the dual channel mode, the only way to reach that bandwidth when running the benchmark, is for the OS to allocate half of the memory pages on one RAM stick and the other half to the other, right? How can you make sure or nudge the OS to do the right thing?

Replying to mmozeiko (#30406)

Mārtiņš Možeiko

#30409

February 26, 2025

They have been doing that all the time. DDR -> DDR2 -> DDR3 -> DDR4 -> DDR5. Each new generation increase bandwidth over previous ones. As you say that's not just a memory stick vendor problem, everybody needs to agree & standartize on this - cpu vendor, m/b vendor, memory stick vendor. How to connect it all physically together, how to make all the protocol work with higher clock rates, agree on voltage, etc.

Regarding the dual channel mode, the only way to reach that bandwidth when running the benchmark, is for the OS to allocate half of the memory pages on one RAM stick and the other half to the other, right?

No, it's all transparent to OS. The memory addresses don't matter. Memory controller takes care of that. As long as you do bulk reads and writes (which CPU does for you on cache line granularity) it will be double speed. I don't know exact details, but my assumption would be that every 16 bytes are split in half - 8 and 8, and each of them goes into separate channel. Because 8 byte transfers are what memory controller talks to memory.

Edited by Mārtiņš Možeiko on February 26, 2025, 6:41pm

Replying to longtran2904 (#30408)

longtran2904

#30410

February 27, 2025

They have been doing that all the time. DDR -> DDR2 -> DDR3 -> DDR4 -> DDR5. Each new generation increase bandwidth over previous ones.

Nice point, haven't thought about it that way.

Memory controller takes care of that. As long as you do bulk reads and writes (which CPU does for you on cache line granularity) it will be double speed. I don't know exact details, but my assumption would be that every 16 bytes are split in half - 8 and 8, and each of them goes into separate channel. Because 8 byte transfers are what memory controller talks to memory.

Are you talking about the "memory controller" on the CPU, motherboard, or OS? Are you saying the CPU automatically does this, not the OS?

The memory addresses don't matter.

I don't understand why it's not. As you said, you must have two separate sticks to use dual channel mode, meaning that if all your allocation is on one stick, the CPU can't access both bytes simultaneously.

Replying to mmozeiko (#30409)

Mārtiņš Možeiko

#30411

February 27, 2025

In old days memory controller used to be part of chipset - it was called "northbridge". Nowadays it is integrated into CPU.

You cannot "allocate on one stick". Memory controller assigns addresses to sticks automatically. So if you have two sticks it just puts all memory like this: S0 S1 S0 S1 S0 S1 ... where each Sx is 8 bytes. So when you read 32 bytes, it will read [S0 S1 S0 S1] - 4 pieces that are automatically split between memory modules (this is just my assumption how it splits. It may split in larger chunks than 8 bytes, I don't know exactly). And as you know you cannot really read anything smaller than 64 bytes. Even if you read one byte, the CPU transfers whole cache line from memory - all 64 bytes. So as long as this number is greater than 16, you will get double of bandwidth.

Edited by Mārtiņš Možeiko on February 27, 2025, 5:51am

Replying to longtran2904 (#30410)

longtran2904

#30412

February 27, 2025

Memory controller assigns addresses to sticks automatically.

Ah, ok. I always thought this was done by the OS. This clears a lot of things up. I guess my question is these assigned addresses are physical, not virtual, and the OS still needs to assign virtual ones, right?

Super dumb question (in practice you would never do this), but what happens if I turn my computer on with just one stick, then while it running, I plug in another? Does the CPU just not recognize that one, remap all the addresses, or start splitting addresses in half now?

Replying to mmozeiko (#30411)

Mārtiņš Možeiko

#30413

February 27, 2025

OS does not see this mapping. It's all transparent to it. Even CPU does not see this mapping. It's internal thing of memory controller. CPU just says write addresses N to M with these bytes. And controller does it.

On consumer motherboards I would assume hotplugging of memory is not supported. BIOS or OS probably won't recognize newly plugged memory.

I think some servers may support it, but I don't know more details about them. My assumption would be that you need to plug them in or out in pairs, otherwise dual channel mode would not work.

Edited by Mārtiņš Možeiko on February 27, 2025, 7:19pm

Replying to longtran2904 (#30412)

longtran2904

#30414

February 28, 2025

OS does not see this mapping. It's all transparent to it. Even CPU does not see this mapping. It's internal thing of memory controller. CPU just says write addresses N to M with these bytes. And controller does it.

So in reality you have virtual addresses -> physical addresses -> real addresses and the mapping is done by App -> OS -> CPU -> memory controller?

Replying to mmozeiko (#30413)

Mārtiņš Možeiko

#30415

February 28, 2025

Sure, you can think of it that way.

Replying to longtran2904 (#30414)