1 2 3 4 5 6 7 8 9 | template <typename T, typename Allocator> void fsList<T, Allocator>::enumerate_forward(void(*it)(T)) { nodeType *p = _head; while (p) { it(p->data); p = p->next; } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | struct fsPoolAllocator { void *memory; size_t blockSize; fsList<void*> freeList; void init(size_t blockSize, size_t count) { size_t poolSize = blockSize * count; memory = malloc(poolSize); this->blockSize = blockSize; void *ptr = memory; for (size_t block = 0; block < count; ++block) { fsAssert(((intptr_t)ptr & (blockSize - 1)) == 0); freeList.push_back(ptr); ptr = (uint8_t *)ptr + blockSize; fsAssert((ptrdiff_t)ptr <= (ptrdiff_t)(((uint8_t *)memory + poolSize))); } } void deinit() { free(memory); memory = nullptr; } void* allocate(size_t size) { void *mem = freeList.front(); freeList.pop_front(); return mem; } void deallocate(void *ptr) { freeList.push_back(ptr); } }; |
mmozeiko
How does your benchmark code looks like? Maybe there is some difference between two approaches?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #define ELEMENT_COUNT 100000 fsList<int, fsPoolAllocator> numbers; size_t nodeSize = sizeof(fsList<int, fsPoolAllocator>::nodeType); nodeSize = (nodeSize + 32) - (nodeSize % 32); // Align nodes to 32 bytes printf("Node size: %zu\n", nodeSize); numbers.allocator.init(nodeSize, ELEMENT_COUNT); for (int i = 0; i < ELEMENT_COUNT; ++i) { numbers.push_back(i); } uint64_t cyclesStart = __builtin_readcyclecounter(); numbers.enumerate_forward([](int num) { num += 10; }); uint64_t cyclesElapsed = __builtin_readcyclecounter() - cyclesStart; printf("Elapsed: %llu\n", cyclesElapsed); |
ratchetfreak
also what's the spread on the timings?
the difference is pretty steep but if it's dominated by a context switch because interrupt or other OS shenanigans it basically invalidates your benchmark.
btaylor2401
Those timings look pretty much how I would expect them to.
To begin with, let's ask: what makes something cache-efficient? What's the metric? To be cache-efficient, the data I need next should usually be in the cache before I need it.
There are two ways this happens:
1. Data packed tightly in memory, so that loading one element loads part of another on the same cache line.
2. Prefetching due to predictable access patterns.
Number 2 doesn't apply here: by virtue of using a linked list, and therefore pointers, the CPU has no way of predicting your memory access pattern. Even if you, the programmer, know that the pointer must point within some particular range, the CPU does *not*. That pointer could point anywhere -- even to data not resident in physical memory.
EDIT: To clarify, this matters because the CPU doesn't know where your pointer points until *after* it has been loaded, at which point you're not prefetching, you're just fetching.
btaylor2401
Okay. Now let's look at the benchmark. You allocate all of the nodes in a single pass, and then iterate over the list, timing the iteration.
There are a couple of confounding issues here. First, by allocating all of the nodes in one pass, malloc will effectively act like a pool allocator (within limits -- if your heap is fragmented you won't get contiguous allocations, but fresh on startup you will.) So both lists have similar allocations. However, your pool allocator aligns to 32 bytes, while malloc does not (iirc, malloc does no alignment at all, though I could be mistaken.)
A cache line on most (all?) x64 processors is 64 bytes. Your node structure is 24 bytes in size. By aligning to 32 bytes, you're fitting two nodes on each cache line. However, the malloc path fits 2 and 2/3 of a node on one cache line. This is your timing difference -- the malloc path fetches more nodes per cache line read, and so it runs faster. (If you look at the timings, they're very close to a 1.3x speedup, as this would imply.)
btaylor2401
Another confounding issue, which you may or may not have run into here, is that since your nodes are tightly packed, the *second* iteration through the list will be much faster if the entire thing fits in cache. Since you allocate the list first, and then iterate over it immediately, you've prewarmed the cache for your benchmark and aren't getting an accurate read on its performance. A simple fix might be to interleave the tests:
- Alloc list A
- Alloc list B
- Test list A
- Test list B
Ensure that you're using large enough lists that they don't fit in cache, and this will mean both tests start with a cold cache.
Finally, you (should) end up with very similar timings between the two after these issues are resolved. You'll need to use proper statistical methods, not just eyeballing the raw values, to compare the two. (Looking at how the distribution as a whole changes, not just individual runs.)
Flyingsand
Right. And in #1, that's basically the CPU checking the cache first for the data, and if it misses it goes to main RAM?
Flyingsand
Yes, this is more or less what I saw when I examined the list in the debugger without the pool allocator. malloc was essentially behaving as a pool allocator as much as it could, and right at the head of the list most nodes were contiguous. However, they were not packed into 24 bytes, they were also 32-byte aligned as you can see here in the debug window (the bottom three nodes in particular):
List debug window
That's why I later aligned the pool allocator to 32 bytes because I wondered if the compiler had done that because aligned reads are faster?
btaylor2401Flyingsand
Right. And in #1, that's basically the CPU checking the cache first for the data, and if it misses it goes to main RAM?
Correct. (Well, it doesn't look for the *data*, it looks for the cache line. Reads don't happen on a finer granularity than that.)
Flyingsand
Yes, this is more or less what I saw when I examined the list in the debugger without the pool allocator. malloc was essentially behaving as a pool allocator as much as it could, and right at the head of the list most nodes were contiguous. However, they were not packed into 24 bytes, they were also 32-byte aligned as you can see here in the debug window (the bottom three nodes in particular):
List debug window
That's why I later aligned the pool allocator to 32 bytes because I wondered if the compiler had done that because aligned reads are faster?
Those actually appear to be 16 byte aligned, not 32. (Ox...310 is not 32 byte aligned). This is probably down to the malloc implementation. (Pool allocators for various small size classes, more general allocator for large allocations, makes freeing cheaper and reduces heap fragmentation.)
btaylor2401
Because your struct is 24 bytes, you're effectively getting 32 byte blocks there, but that's not due to the efficiency of reading it.
Alignment matters for efficiency in two places:
- From memory, is the read aligned to a cache line. (If you have a 32 byte block that straddles a cache line, you read 128 bytes -- not a great use of bandwidth unless you need the rest of that data too.)
- From cache, is the read aligned for the register size. This mostly matters for SSE, which until recently would be slow loading data not aligned to 16 bytes. (Which may be why the OSX allocator appears to prefer 16 byte alignment.)
- (I *think* this is largely irrelevant for the general registers (rax/eax/ax and so on) -- I think they give you the masking "for free", but I'm not entirely sure.)
I'm willing to bet you can get a solid win over the malloc version by going back to packed 24 byte blocks. 8 byte alignment is optimal for anything that isn't SSE, so there's no performance reason *not* to (and plenty in favor, since it means you're not wasting 16 bytes out of each cache line.)
btaylor2401
Regarding statistics: there's a whole mess of things you have to do to do this correctly, but you can at the very least get a better picture of what's going on by grabbing a lot of samples (a few hundred) and calculating the rough order statistics. (Mean, standard deviation, quartiles.) Unfortunately WolframAlpha won't help here, because it doesn't let you give it a large data set, but the calculations are straightforward:
mean = sum(x) / count
stddev = sqrt(mean - (sum(x*x)/count))
median = middle item in list (if count is odd) or average of two middle items (if count is even)
Then the first and third quartiles are the median of the sublists to either side of the median.
If you're going to keep tweaking this (and you should, there's a *whole* lot you can still do to make this more efficient), go ahead and set up the testing and statistics framework because it will make the rest of the job a lot easier.
EDIT: Some example code for the statistics calculation: https://gist.github.com/ACEfanatic02/74c1e28d4d9777f815705842b97ea11e
btaylor2401malloc does 16 byte alignment
However, your pool allocator aligns to 32 bytes, while malloc does not (iirc, malloc does no alignment at all, though I could be mistaken.)
btaylor2401As I said earlier, modern versions of malloc align allocations by 16 bytes (this is so that malloc plays nicely with sse registers). Thus, this isn't entirely true and the performance loss is probably coming from something else.
A cache line on most (all?) x64 processors is 64 bytes. Your node structure is 24 bytes in size. By aligning to 32 bytes, you're fitting two nodes on each cache line. However, the malloc path fits 2 and 2/3 of a node on one cache line. This is your timing difference -- the malloc path fetches more nodes per cache line read, and so it runs faster. (If you look at the timings, they're very close to a 1.3x speedup, as this would imply.)
btaylor2401If there is one useful thing I learned in my high school stat class it has to be this. Understanding basic statics really helps when doing benchmarks and can help you get more accurate results.
Finally, you (should) end up with very similar timings between the two after these issues are resolved. You'll need to use proper statistical methods, not just eyeballing the raw values, to compare the two. (Looking at how the distribution as a whole changes, not just individual runs.)
btaylor2401Your forumula for standard deviation is not correct, the proper formula is:
mean = sum(x) / count
stddev = sqrt(mean - (sum(x*x)/count))
median = middle item in list (if count is odd) or average of two middle items (if count is even)
Then the first and third quartiles are the median of the sublists to either side of the median.
Flyingsand
I'm a bit stumped now as to what I can do to make the cache-friendly list even friendlier! Making it a singly-linked list does help a little since the node size is reduced by 8 bytes, but it's not significant enough to really matter. Overall the performance of enumerating the list seems very good to me, and with very low overhead now that the free list in the pool allocator is intrusive. I will continue doing some research to see if there are tricks I haven't thought of, because I do want to keep digging to find out more about how to effectively utilize the cache when it matters.
1 2 3 4 5 6 7 8 9 10 | struct ListNode { u16 next; u16 prev; u32 data; }; struct List { ListNode * pool; u16 pool_count; u16 freelist_head; }; |
1 2 3 4 5 6 7 | struct List { u16 * next; u16 * prev; u32 * data; u16 count; u16 freelist_head; }; |
cubercaleb
btaylor2401Your forumula for standard deviation is not correct, the proper formula is:
mean = sum(x) / count
stddev = sqrt(mean - (sum(x*x)/count))
median = middle item in list (if count is odd) or average of two middle items (if count is even)
Then the first and third quartiles are the median of the sublists to either side of the median.
stddev = sqrt(sum((xi-u)^2)/count)
where xi each of the sample values
If you really want to get all statistical about it you could use a two sample t-test for difference of means.
btaylor2401Those two formulas are very different and produce different results as far as can tell, so no they are not the same.
The two formulas for standard deviation are equivalent. The advantage of this formulation is that it can be computed in a single pass, and also allows for incremental update, by storing sum and sum squared.
cubercalebbtaylor2401Those two formulas are very different and produce different results as far as can tell, so no they are not the same.
The two formulas for standard deviation are equivalent. The advantage of this formulation is that it can be computed in a single pass, and also allows for incremental update, by storing sum and sum squared.
1 2 3 4 5 6 7 8 9 | var = sum((x - u)^2) / n = sum((x*x - 2*x*u + u*u)) / n = (sum(x*x) - sum(2*x*u) + sum(u*u)) / n = (sum(x*x) - 2*u*sum(x) + sum(u*u)) / n = (sum(x*x) - 2*u*sum(x) + n*u*u) / n = (sum(x*x) - 2*(sum(x)^2)/n + n*(sum(x)/n)^2) / n = ((sum(x*x) - 2*(sum(x)^2)/n + sum(x)^2/n)) / n = (sum(x*x) - sum(x)^2/n) / n = sum(x*x)/n - sum(x)^2 |
btaylor2401Since subtraction is anticommutative I don't think that you can make that jump. But I am not entirely sure.
= sum((x*x - 2*x*u + u*u)) / n
= (sum(x*x) - sum(2*x*u) + sum(u*u)) / n