Memory alignment for atomic operations

MSDN states that parameters for InterlockedCompareExchange and InterlockedIncrement functions must be aligned on a 32-bit boundary:

The parameters for this function must be aligned on a 32-bit boundary; otherwise, the function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems. See _aligned_malloc.

I see that currently all of the volatile variables are allocated on the stack. Does the compiler take care of the alignment? Does it treat volatile variables differently? Would one need to take special care if those variables were allocated on the heap?
Yes, for variables allocated on stack compiler aligns them automatically.

It gets tricky with stuff on heap. VirtualAlloc will return memory that begins on page boundary (typically 4KB). malloc will return memory that is typically aligned by 8 (because that is size of largest basic C/C++ standard type - double). C++ new probably has same alignment rules as malloc.

So if you cast returned pointer from VirtualAlloc or malloc to one monolithic structure type (or array of one type) - all the members of this structure will also be properly aligned. But if you use result as pointer to bytes and partition memory based on sizeof of each individual type, then you need to take care of alignment manually.

Edited by Mārtiņš Možeiko on
then there is _aligned_malloc which will take care of alignment to arbitrary boundaries for you:

https://msdn.microsoft.com/en-us/library/8z34s9c6.aspx

that is, it will make sure the returned pointer will point to properly aligned memory. you still need to manually align data you subsequently place at later addresses within the allocated memory to whatever boundary it requires.

Edited by d7samurai on
There's really no such thing as non-32-bit-aligned 32-bit value on the stack unless you really go out of your way to make them :) On the stack, things will always be aligned to their size. So a 32-bit values will always be 32-bit aligned unless you did something special to break that.

I don't think I've ever seen a case personally where I got a non-32-bit aligned 32-bit value _by accident_.

- Casey
mmozeiko
VirtualAlloc will return memory that begins on page boundary (typically 4KB).

I don't have a Windows machine handy, but last time I worked in this area, the page size was 64kB on Windows. Can someone call GetSystemInfo() and check please?

Note that the operating system page size is not the same as the hardware page size. The OS page size is the granularity at which the OS manages memory. This may be a multiple of the CPU page size.

Assuming that my hazy memory is correct, someone might want to ask Raymond Chen why NTOS uses 64k pages. My guess is that either it's more convenient for DOS emulation, or because it was the least common multiple of sensible page sizes on all the interesting architectures at the time. (ARM supports 64k pages.)
My suspicion is that choice of 4k/64k/2mb (Windows also allows 2mb pages nowadays) as a page size has more to do with keeping the page table small than it does with any architecture-specific concern? But honestly, I have no idea.

- Casey
Pseudonym73
Can someone call GetSystemInfo() and check please?


4096 on Windows 8.1 64-bit (16GB total memory, if that changes anything).
And also 4096 on Windows XP 32-bit (under VirtualBox).

Edited by Mārtiņš Možeiko on
mmozeiko
4096 on Windows 8.1 64-bit (16GB total memory, if that changes anything).
And also 4096 on Windows XP 32-bit (under VirtualBox).

Got it. As noted, it's a vague memory from about 2003, NT 4 or Win 2k. It's also possible that once upon a time, NT shipped with different OS page sizes on workstation and server install, or it might have been custom settings.
Pseudonym, I do not think you are hallucinating here - I think it's just that GetSystemInfo() returns the page granularity, not the allocation granularity. While VirtualProtect() et al all operate on the page granularity, I think VirtualAlloc may well (at least sometimes) operate on its own granularity which is some multiple of the page granularity. Does that make sense?

So I think to actually figure this out you'd have to call VirtualAlloc() a bunch of times and see if it was aligning to 4k or 64k.

- Casey
You are right:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <windows.h>

int main()
{
  for (int i=0; i<100; i++)
  {
    uint64_t addr = (uint64_t)VirtualAlloc(NULL, 123123*(rand()%100+1), MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
    printf("0x%016llx %% 64k = %i\n", addr, addr % (64*1024));
  }
}


For me that prints out addresses always on 64K boundary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
0x00000000025e0000 % 64k = 0
0x0000000005740000 % 64k = 0
0x000000000a720000 % 64k = 0
0x00000000020f0000 % 64k = 0
0x000000000d040000 % 64k = 0
0x0000000012280000 % 64k = 0
0x0000000013fe0000 % 64k = 0
0x0000000019cb0000 % 64k = 0
0x000000001e200000 % 64k = 0
0x0000000022c00000 % 64k = 0
0x0000000027860000 % 64k = 0
0x0000000027f70000 % 64k = 0
0x000000002b580000 % 64k = 0
0x00000000315d0000 % 64k = 0
0x00000000336c0000 % 64k = 0
0x0000000037f90000 % 64k = 0

...


Even if I change to allocate only very small amount of bytes (1,2,3...).

Edited by Mārtiņš Možeiko on
Yeah. That was my recollection. I'm not sure I know why that happens - it might have something to do with the way the page table works, in terms of look-ups or something. I suspect Mark Russinovich probably has the answer in one of his lectures somewhere... I've probably even heard it, but then forgot it :/

- Casey
That's interesting, maybe even more so the way you investigate things, thanks. :)
Raymond Chen has an answer. Pretty deep legacy stuff...
Yikes...

- Casey
OK, so that's where the vague memory comes from.

The thing I was working on, as I said, was a database server, and so we used page-structured files (e.g. B-trees) extensively. 64k is also the granularity of memory-mapped files on Windows, so that is the optimum page size.

Raymond Chen
You don't want to introduce gratuitous differences between architectures because it makes porting code between them harder.

From the point of view of someone trying to write a system which runs on Windows, Linux, and Solaris, that counts as a "gratuitous difference".