I'm experimenting with making cache-friendly linked lists, using some of the knowledge I remember reading about EASTL (Electronic Arts' STL replacement) where they have a "fixed" list which allocates a fixed block of memory that the list allocates out of. This supposedly makes the list more cache-friendly and makes enumerating it faster. So I went with a memory pool managed by a free list approach, but ended up with some curious results.
These are the cycle counts I measured for enumerating a list of 100,000 elements:
844,012 cycle average with no custom allocator
1,385,378 cycle average with the custom memory pool allocator
(Tested on a late 2013 iMac i7 quad core 3.5 GHz machine, using clang/llvm 7.1 with -Os).
And just to be clear, this is not about the actual allocation. Just simply enumerating the list from front to back.
Here is the list enumerator:
1 2 3 4 5 6 7 8 9 | template <typename T, typename Allocator> void fsList<T, Allocator>::enumerate_forward(void(*it)(T)) { nodeType *p = _head; while (p) { it(p->data); p = p->next; } } |
(Yeah.. templates. :) Still struggling to find a better solution for metaprogramming).
This is the simple pool allocator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | struct fsPoolAllocator { void *memory; size_t blockSize; fsList<void*> freeList; void init(size_t blockSize, size_t count) { size_t poolSize = blockSize * count; memory = malloc(poolSize); this->blockSize = blockSize; void *ptr = memory; for (size_t block = 0; block < count; ++block) { fsAssert(((intptr_t)ptr & (blockSize - 1)) == 0); freeList.push_back(ptr); ptr = (uint8_t *)ptr + blockSize; fsAssert((ptrdiff_t)ptr <= (ptrdiff_t)(((uint8_t *)memory + poolSize))); } } void deinit() { free(memory); memory = nullptr; } void* allocate(size_t size) { void *mem = freeList.front(); freeList.pop_front(); return mem; } void deallocate(void *ptr) { freeList.push_back(ptr); } }; |
For each new element in the list, the "allocate" method is called which just returns the element at the front of the free list (the next free block of memory). For testing, I'm just using integers as the data, so each list node is 24 bytes in size. I've also tried rounding that up to 32 so each block allocated in the memory pool will be aligned on a 32-byte boundary, but this didn't have any effect on the speed of the list enumeration.
Furthermore, after pushing all 100,000 elements to the list, no other operation is performed before the enumeration, so there are no gaps in the list due to adding/removing nodes in between. In the debugger, I can see that each node is 32 bytes apart.
Am I approaching this in the wrong way? I'm puzzled as to why there is a measurable and consistent difference in execution between enumerating the list normally and using the pool allocator, which has all it's nodes contiguously laid out in memory.