Cache-friendly lists

Greetings!
I'm experimenting with making cache-friendly linked lists, using some of the knowledge I remember reading about EASTL (Electronic Arts' STL replacement) where they have a "fixed" list which allocates a fixed block of memory that the list allocates out of. This supposedly makes the list more cache-friendly and makes enumerating it faster. So I went with a memory pool managed by a free list approach, but ended up with some curious results.

These are the cycle counts I measured for enumerating a list of 100,000 elements:
844,012 cycle average with no custom allocator
1,385,378 cycle average with the custom memory pool allocator
(Tested on a late 2013 iMac i7 quad core 3.5 GHz machine, using clang/llvm 7.1 with -Os).

And just to be clear, this is not about the actual allocation. Just simply enumerating the list from front to back.

Here is the list enumerator:
1
2
3
4
5
6
7
8
9
template <typename T, typename Allocator>
void fsList<T, Allocator>::enumerate_forward(void(*it)(T))
{
    nodeType *p = _head;
    while (p) {
        it(p->data);
        p = p->next;
    }
}


(Yeah.. templates. :) Still struggling to find a better solution for metaprogramming).

This is the simple pool allocator:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
struct fsPoolAllocator {
    void *memory;
    size_t blockSize;
    fsList<void*> freeList;
    
    void init(size_t blockSize, size_t count) {
        size_t poolSize = blockSize * count;
        memory = malloc(poolSize);
        this->blockSize = blockSize;
        void *ptr = memory;
        for (size_t block = 0; block < count; ++block) {
            fsAssert(((intptr_t)ptr & (blockSize - 1)) == 0);
            freeList.push_back(ptr);
            ptr = (uint8_t *)ptr + blockSize;
            fsAssert((ptrdiff_t)ptr <= (ptrdiff_t)(((uint8_t *)memory + poolSize)));
        }
    }
    
    void deinit() {
        free(memory);
        memory = nullptr;
    }
    
    void* allocate(size_t size) {
        void *mem = freeList.front();
        freeList.pop_front();
        return mem;
    }
    
    void deallocate(void *ptr) {
        freeList.push_back(ptr);
    }
};


For each new element in the list, the "allocate" method is called which just returns the element at the front of the free list (the next free block of memory). For testing, I'm just using integers as the data, so each list node is 24 bytes in size. I've also tried rounding that up to 32 so each block allocated in the memory pool will be aligned on a 32-byte boundary, but this didn't have any effect on the speed of the list enumeration.

Furthermore, after pushing all 100,000 elements to the list, no other operation is performed before the enumeration, so there are no gaps in the list due to adding/removing nodes in between. In the debugger, I can see that each node is 32 bytes apart.

Am I approaching this in the wrong way? I'm puzzled as to why there is a measurable and consistent difference in execution between enumerating the list normally and using the pool allocator, which has all it's nodes contiguously laid out in memory.
How does your benchmark code looks like? Maybe there is some difference between two approaches?

also what's the spread on the timings?

the difference is pretty steep but if it's dominated by a context switch because interrupt or other OS shenanigans it basically invalidates your benchmark.
mmozeiko
How does your benchmark code looks like? Maybe there is some difference between two approaches?



Here is the benchmark code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#define ELEMENT_COUNT 100000
    fsList<int, fsPoolAllocator> numbers;
    size_t nodeSize = sizeof(fsList<int, fsPoolAllocator>::nodeType);
    nodeSize = (nodeSize + 32) - (nodeSize % 32); // Align nodes to 32 bytes
    printf("Node size: %zu\n", nodeSize);
    numbers.allocator.init(nodeSize, ELEMENT_COUNT);
    
    for (int i = 0; i < ELEMENT_COUNT; ++i) {
        numbers.push_back(i);
    }
    
    uint64_t cyclesStart = __builtin_readcyclecounter();
    numbers.enumerate_forward([](int num) {
        num += 10;
    });
    uint64_t cyclesElapsed = __builtin_readcyclecounter() - cyclesStart;
    printf("Elapsed: %llu\n", cyclesElapsed);


For benchmarking with the default allocator, the list is just declared as "fsList[int]" (post does not like angle brackets..) and I comment out init for the allocator.

Edited by Flyingsand on
ratchetfreak
also what's the spread on the timings?

the difference is pretty steep but if it's dominated by a context switch because interrupt or other OS shenanigans it basically invalidates your benchmark.


Here are a sample of 6 timings that I just ran:

Custom pool allocator --
1,706,203
1,620,620
1,215,040
1,766,276
1,268,936
1,147,148

Normal allocator --
850,960
1,121,608
1,184,688
862,780
974,948
842,036

In only one case was the pool allocator slightly faster. But most of the time it's roughly 1.5 - 2 times slower!
Those timings look pretty much how I would expect them to.

To begin with, let's ask: what makes something cache-efficient? What's the metric? To be cache-efficient, the data I need next should usually be in the cache before I need it.

There are two ways this happens:
1. Data packed tightly in memory, so that loading one element loads part of another on the same cache line.
2. Prefetching due to predictable access patterns.

Number 2 doesn't apply here: by virtue of using a linked list, and therefore pointers, the CPU has no way of predicting your memory access pattern. Even if you, the programmer, know that the pointer must point within some particular range, the CPU does *not*. That pointer could point anywhere -- even to data not resident in physical memory.
EDIT: To clarify, this matters because the CPU doesn't know where your pointer points until *after* it has been loaded, at which point you're not prefetching, you're just fetching.

Okay. Now let's look at the benchmark. You allocate all of the nodes in a single pass, and then iterate over the list, timing the iteration.

There are a couple of confounding issues here. First, by allocating all of the nodes in one pass, malloc will effectively act like a pool allocator (within limits -- if your heap is fragmented you won't get contiguous allocations, but fresh on startup you will.) So both lists have similar allocations. However, your pool allocator aligns to 32 bytes, while malloc does not (iirc, malloc does no alignment at all, though I could be mistaken.)

A cache line on most (all?) x64 processors is 64 bytes. Your node structure is 24 bytes in size. By aligning to 32 bytes, you're fitting two nodes on each cache line. However, the malloc path fits 2 and 2/3 of a node on one cache line. This is your timing difference -- the malloc path fetches more nodes per cache line read, and so it runs faster. (If you look at the timings, they're very close to a 1.3x speedup, as this would imply.)

Another confounding issue, which you may or may not have run into here, is that since your nodes are tightly packed, the *second* iteration through the list will be much faster if the entire thing fits in cache. Since you allocate the list first, and then iterate over it immediately, you've prewarmed the cache for your benchmark and aren't getting an accurate read on its performance. A simple fix might be to interleave the tests:
- Alloc list A
- Alloc list B
- Test list A
- Test list B

Ensure that you're using large enough lists that they don't fit in cache, and this will mean both tests start with a cold cache.

Finally, you (should) end up with very similar timings between the two after these issues are resolved. You'll need to use proper statistical methods, not just eyeballing the raw values, to compare the two. (Looking at how the distribution as a whole changes, not just individual runs.)

Edited by Bryan Taylor on Reason: Clarification
btaylor2401
Those timings look pretty much how I would expect them to.

To begin with, let's ask: what makes something cache-efficient? What's the metric? To be cache-efficient, the data I need next should usually be in the cache before I need it.

There are two ways this happens:
1. Data packed tightly in memory, so that loading one element loads part of another on the same cache line.
2. Prefetching due to predictable access patterns.

Number 2 doesn't apply here: by virtue of using a linked list, and therefore pointers, the CPU has no way of predicting your memory access pattern. Even if you, the programmer, know that the pointer must point within some particular range, the CPU does *not*. That pointer could point anywhere -- even to data not resident in physical memory.
EDIT: To clarify, this matters because the CPU doesn't know where your pointer points until *after* it has been loaded, at which point you're not prefetching, you're just fetching.


Right. And in #1, that's basically the CPU checking the cache first for the data, and if it misses it goes to main RAM?

btaylor2401

Okay. Now let's look at the benchmark. You allocate all of the nodes in a single pass, and then iterate over the list, timing the iteration.

There are a couple of confounding issues here. First, by allocating all of the nodes in one pass, malloc will effectively act like a pool allocator (within limits -- if your heap is fragmented you won't get contiguous allocations, but fresh on startup you will.) So both lists have similar allocations. However, your pool allocator aligns to 32 bytes, while malloc does not (iirc, malloc does no alignment at all, though I could be mistaken.)

A cache line on most (all?) x64 processors is 64 bytes. Your node structure is 24 bytes in size. By aligning to 32 bytes, you're fitting two nodes on each cache line. However, the malloc path fits 2 and 2/3 of a node on one cache line. This is your timing difference -- the malloc path fetches more nodes per cache line read, and so it runs faster. (If you look at the timings, they're very close to a 1.3x speedup, as this would imply.)


Yes, this is more or less what I saw when I examined the list in the debugger without the pool allocator. malloc was essentially behaving as a pool allocator as much as it could, and right at the head of the list most nodes were contiguous. However, they were not packed into 24 bytes, they were also 32-byte aligned as you can see here in the debug window (the bottom three nodes in particular):
List debug window
That's why I later aligned the pool allocator to 32 bytes because I wondered if the compiler had done that because aligned reads are faster?

btaylor2401

Another confounding issue, which you may or may not have run into here, is that since your nodes are tightly packed, the *second* iteration through the list will be much faster if the entire thing fits in cache. Since you allocate the list first, and then iterate over it immediately, you've prewarmed the cache for your benchmark and aren't getting an accurate read on its performance. A simple fix might be to interleave the tests:
- Alloc list A
- Alloc list B
- Test list A
- Test list B

Ensure that you're using large enough lists that they don't fit in cache, and this will mean both tests start with a cold cache.

Finally, you (should) end up with very similar timings between the two after these issues are resolved. You'll need to use proper statistical methods, not just eyeballing the raw values, to compare the two. (Looking at how the distribution as a whole changes, not just individual runs.)


Good point. I ran a few other tests with this fix and now the results between the two allocators are getting closer together. As you said, this warrants a better statistical method, but these are just some raw values for now:

Pool allocator --
Elapsed: 1015524 (list 1)
Elapsed: 924764 (list 2)
--
Elapsed: 1241191 (list 1)
Elapsed: 958227 (list 2)
--
Elapsed: 1837579 (list 1)
Elapsed: 1073383 (list 2)

Default allocator (malloc) --
Elapsed: 1298696 (list 1)
Elapsed: 1144624 (list 2)
--
Elapsed: 1553068 (list 1)
Elapsed: 1386119 (list 2)
--
Elapsed: 1396677 (list 1)
Elapsed: 1132846 (list 2)

In fact, the pool allocator seems to be just slightly outdoing malloc now with these preliminary results.
Thanks for the feedback/info. Lots of good things for me to think about! :)

Edited by Flyingsand on
Flyingsand

Right. And in #1, that's basically the CPU checking the cache first for the data, and if it misses it goes to main RAM?

Correct. (Well, it doesn't look for the *data*, it looks for the cache line. Reads don't happen on a finer granularity than that.)

Flyingsand

Yes, this is more or less what I saw when I examined the list in the debugger without the pool allocator. malloc was essentially behaving as a pool allocator as much as it could, and right at the head of the list most nodes were contiguous. However, they were not packed into 24 bytes, they were also 32-byte aligned as you can see here in the debug window (the bottom three nodes in particular):
List debug window
That's why I later aligned the pool allocator to 32 bytes because I wondered if the compiler had done that because aligned reads are faster?

Those actually appear to be 16 byte aligned, not 32. (Ox...310 is not 32 byte aligned). This is probably down to the malloc implementation. (Pool allocators for various small size classes, more general allocator for large allocations, makes freeing cheaper and reduces heap fragmentation.)

Because your struct is 24 bytes, you're effectively getting 32 byte blocks there, but that's not due to the efficiency of reading it.

Alignment matters for efficiency in two places:
- From memory, is the read aligned to a cache line. (If you have a 32 byte block that straddles a cache line, you read 128 bytes -- not a great use of bandwidth unless you need the rest of that data too.)
- From cache, is the read aligned for the register size. This mostly matters for SSE, which until recently would be slow loading data not aligned to 16 bytes. (Which may be why the OSX allocator appears to prefer 16 byte alignment.)
- (I *think* this is largely irrelevant for the general registers (rax/eax/ax and so on) -- I think they give you the masking "for free", but I'm not entirely sure.)

I'm willing to bet you can get a solid win over the malloc version by going back to packed 24 byte blocks. 8 byte alignment is optimal for anything that isn't SSE, so there's no performance reason *not* to (and plenty in favor, since it means you're not wasting 16 bytes out of each cache line.)


Regarding statistics: there's a whole mess of things you have to do to do this correctly, but you can at the very least get a better picture of what's going on by grabbing a lot of samples (a few hundred) and calculating the rough order statistics. (Mean, standard deviation, quartiles.) Unfortunately WolframAlpha won't help here, because it doesn't let you give it a large data set, but the calculations are straightforward:

mean = sum(x) / count
stddev = sqrt(mean - (sum(x*x)/count))
median = middle item in list (if count is odd) or average of two middle items (if count is even)
Then the first and third quartiles are the median of the sublists to either side of the median.

If you're going to keep tweaking this (and you should, there's a *whole* lot you can still do to make this more efficient), go ahead and set up the testing and statistics framework because it will make the rest of the job a lot easier.

EDIT: Some example code for the statistics calculation: https://gist.github.com/ACEfanatic02/74c1e28d4d9777f815705842b97ea11e

Edited by Bryan Taylor on
btaylor2401
Flyingsand

Right. And in #1, that's basically the CPU checking the cache first for the data, and if it misses it goes to main RAM?

Correct. (Well, it doesn't look for the *data*, it looks for the cache line. Reads don't happen on a finer granularity than that.)

Flyingsand

Yes, this is more or less what I saw when I examined the list in the debugger without the pool allocator. malloc was essentially behaving as a pool allocator as much as it could, and right at the head of the list most nodes were contiguous. However, they were not packed into 24 bytes, they were also 32-byte aligned as you can see here in the debug window (the bottom three nodes in particular):
List debug window
That's why I later aligned the pool allocator to 32 bytes because I wondered if the compiler had done that because aligned reads are faster?

Those actually appear to be 16 byte aligned, not 32. (Ox...310 is not 32 byte aligned). This is probably down to the malloc implementation. (Pool allocators for various small size classes, more general allocator for large allocations, makes freeing cheaper and reduces heap fragmentation.)


Oops, bad wording on my part. Yes, 16-byte aligned, but in 32-byte blocks.

btaylor2401

Because your struct is 24 bytes, you're effectively getting 32 byte blocks there, but that's not due to the efficiency of reading it.

Alignment matters for efficiency in two places:
- From memory, is the read aligned to a cache line. (If you have a 32 byte block that straddles a cache line, you read 128 bytes -- not a great use of bandwidth unless you need the rest of that data too.)
- From cache, is the read aligned for the register size. This mostly matters for SSE, which until recently would be slow loading data not aligned to 16 bytes. (Which may be why the OSX allocator appears to prefer 16 byte alignment.)
- (I *think* this is largely irrelevant for the general registers (rax/eax/ax and so on) -- I think they give you the masking "for free", but I'm not entirely sure.)

I'm willing to bet you can get a solid win over the malloc version by going back to packed 24 byte blocks. 8 byte alignment is optimal for anything that isn't SSE, so there's no performance reason *not* to (and plenty in favor, since it means you're not wasting 16 bytes out of each cache line.)


I went back to 24-byte packed blocks, and that resulted in another slight speed win for the pool allocator! And yes, I believe OS X allocates on 16-byte boundaries by default. I ran into this issue several years ago when writing SSE to optimize some DSP code, and when I ported it to Windows it promptly crashed..

btaylor2401

Regarding statistics: there's a whole mess of things you have to do to do this correctly, but you can at the very least get a better picture of what's going on by grabbing a lot of samples (a few hundred) and calculating the rough order statistics. (Mean, standard deviation, quartiles.) Unfortunately WolframAlpha won't help here, because it doesn't let you give it a large data set, but the calculations are straightforward:

mean = sum(x) / count
stddev = sqrt(mean - (sum(x*x)/count))
median = middle item in list (if count is odd) or average of two middle items (if count is even)
Then the first and third quartiles are the median of the sublists to either side of the median.

If you're going to keep tweaking this (and you should, there's a *whole* lot you can still do to make this more efficient), go ahead and set up the testing and statistics framework because it will make the rest of the job a lot easier.

EDIT: Some example code for the statistics calculation: https://gist.github.com/ACEfanatic02/74c1e28d4d9777f815705842b97ea11e


I'm definitely going to keep tweaking this. I'm very interested to know what I can do to make this more efficient, and *how* much more efficient I can make it. One immediate thing that comes to mind is that when enumerating the list in forward or reverse, there is always a waste of 8 bytes because one of the links is never used (either next or prev depending on the direction), so if I can get rid of that for enumeration 4 (16 byte) nodes can fit in a cache line.

And I think a statistics framework has all sorts of applications, so that will be a good thing to have. Thanks!

Edited by Flyingsand on
So here's a little update on my progress so far.

The main bit of optimization I've done is to make the free list for the pool allocator intrusive so that it uses the pool memory to store the links instead of using an external list to manage the free blocks. That speeds up allocating new nodes in the cache-friendly list by quite a bit as well as naturally reducing memory usage.

Then I set up my statistics framework and did some measurements, comparing the list using the default allocator (malloc/free), my custom pool allocator, and std::list. I also modified the code that runs the tests and measures the performance of enumerating through the lists to allocate some memory periodically when allocating the nodes for the lists to better represent a more scattered allocation scheme that you would expect during normal execution. i.e. For allocating a 10,000-node list, an unrelated malloc (128 bytes) runs every 1000 nodes. That way, as was pointed out earlier, malloc doesn't behave as a pool allocator since all allocations aren't up front.

That statistics gathered is based on running the above 100 times, and these are the results I've found after running the test 15 times.

Mean values
Minimum values
Maximum values
Median values
Standard deviation values

So clearly, using the pool allocator performs far better in most cases. It's interesting to see the fluctuations in the max values graph, comparing to the very stable minimum values. But I would guess that this could be the result of certain operating system events like context switches? It's also worth noting that I purposely had a few other applications up and running in the background during tests 12 - 15 because I was curious to see how that might affect the outcomes. So that accounts for the last couple of spikes in the standard deviations graph for the pool allocator, exhibiting slightly more random behaviour due to thread interruptions I would think. The means and min values stay fairly consistent for the pool allocator though.

It's also funny to see just how badly std::list performs compared to even my basic list using malloc/free. To be fair, the values measured are cycles, so in the end it's not quite as dramatic as it seems, but still.. :)

I'm a bit stumped now as to what I can do to make the cache-friendly list even friendlier! Making it a singly-linked list does help a little since the node size is reduced by 8 bytes, but it's not significant enough to really matter. Overall the performance of enumerating the list seems very good to me, and with very low overhead now that the free list in the pool allocator is intrusive. I will continue doing some research to see if there are tricks I haven't thought of, because I do want to keep digging to find out more about how to effectively utilize the cache when it matters.

Edited by Flyingsand on
btaylor2401
However, your pool allocator aligns to 32 bytes, while malloc does not (iirc, malloc does no alignment at all, though I could be mistaken.)
malloc does 16 byte alignment
btaylor2401
A cache line on most (all?) x64 processors is 64 bytes. Your node structure is 24 bytes in size. By aligning to 32 bytes, you're fitting two nodes on each cache line. However, the malloc path fits 2 and 2/3 of a node on one cache line. This is your timing difference -- the malloc path fetches more nodes per cache line read, and so it runs faster. (If you look at the timings, they're very close to a 1.3x speedup, as this would imply.)
As I said earlier, modern versions of malloc align allocations by 16 bytes (this is so that malloc plays nicely with sse registers). Thus, this isn't entirely true and the performance loss is probably coming from something else.
btaylor2401
Finally, you (should) end up with very similar timings between the two after these issues are resolved. You'll need to use proper statistical methods, not just eyeballing the raw values, to compare the two. (Looking at how the distribution as a whole changes, not just individual runs.)
If there is one useful thing I learned in my high school stat class it has to be this. Understanding basic statics really helps when doing benchmarks and can help you get more accurate results.

btaylor2401
mean = sum(x) / count
stddev = sqrt(mean - (sum(x*x)/count))
median = middle item in list (if count is odd) or average of two middle items (if count is even)
Then the first and third quartiles are the median of the sublists to either side of the median.
Your forumula for standard deviation is not correct, the proper formula is:
stddev = sqrt(sum((xi-u)^2)/count)
where xi each of the sample values
If you really want to get all statistical about it you could use a two sample t-test for difference of means.

Edited by Caleb on
Flyingsand
I'm a bit stumped now as to what I can do to make the cache-friendly list even friendlier! Making it a singly-linked list does help a little since the node size is reduced by 8 bytes, but it's not significant enough to really matter. Overall the performance of enumerating the list seems very good to me, and with very low overhead now that the free list in the pool allocator is intrusive. I will continue doing some research to see if there are tricks I haven't thought of, because I do want to keep digging to find out more about how to effectively utilize the cache when it matters.

At this point, what optimizations you make is going to depend on usage. A list holding integers and a list holding 100+ byte structs require different optimizations.

In your current case, with integers, a significant amount of node memory goes to pointers, as you've noticed. However, remember that we're using pooled allocation here; you don't *need* 64 bits of addressing, because all of your nodes come from a single, much smaller block of memory. Pointers are convenient because they have first-class support in the language, but since access is wrapped in an iterator this isn't a dealbreaker.

So let's not use pointers.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct ListNode {
    u16 next;
    u16 prev;
    u32 data;
};
struct List {
    ListNode * pool;
    u16 pool_count;
    u16 freelist_head;
};

Instead of storing pointers for next and prev, we store array indices into the pool array. Now, this is *not necessarily a safe change to make*. It introduces some new complications. To do a traditional linked list, you need a NULL value to indicate the end of the list. If we use NULL=0, next and prev actually store (index + 1). Alternatively, we can use NULL=0xFFFF, which makes the indexing less error prone but means (!next) is no longer a null check.

How small you can make this node struct depends on your usage, and the largest list you need to support.

Another trick is to apply an SoA transform to the list:
1
2
3
4
5
6
7
struct List {
    u16 * next;
    u16 * prev;
    u32 * data;
    u16 count;
    u16 freelist_head;
};

This may or may not be an improvement. (A lot of misconception that SoA always makes things faster; it's a *different* performance profile, not necessarily a better one.)

The result of this change is that the actual iteration doesn't touch the data. If you're always touching the data, this isn't a win (and, for large pools, may actually be a loss since each array is going to fall on different cache lines.) However, for operations that don't need to read the data itself, you may touch much less memory if the data type is large. (For example, freelist operations tend to only touch next and prev, not the data.)

(As an aside, while this is probably a wash for a list, for associative arrays (i.e., hashtables), this is a really, really important optimization -- you only touch the data you need.)

EDIT
cubercaleb

btaylor2401
mean = sum(x) / count
stddev = sqrt(mean - (sum(x*x)/count))
median = middle item in list (if count is odd) or average of two middle items (if count is even)
Then the first and third quartiles are the median of the sublists to either side of the median.
Your forumula for standard deviation is not correct, the proper formula is:
stddev = sqrt(sum((xi-u)^2)/count)
where xi each of the sample values
If you really want to get all statistical about it you could use a two sample t-test for difference of means.

The two formulas for standard deviation are equivalent. The advantage of this formulation is that it can be computed in a single pass, and also allows for incremental update, by storing sum and sum squared.

Also, while a t-test can tell you if the mean time changes, that's not always the most useful measure for an optimization. Not for a realtime system. You need to verify that a change that reduces the mean does not greatly increase the standard deviation at the same time. (Because if hitting your worst case drops a frame, it doesn't *matter* if the average is good.)

Edited by Bryan Taylor on
btaylor2401
The two formulas for standard deviation are equivalent. The advantage of this formulation is that it can be computed in a single pass, and also allows for incremental update, by storing sum and sum squared.
Those two formulas are very different and produce different results as far as can tell, so no they are not the same.
cubercaleb
btaylor2401
The two formulas for standard deviation are equivalent. The advantage of this formulation is that it can be computed in a single pass, and also allows for incremental update, by storing sum and sum squared.
Those two formulas are very different and produce different results as far as can tell, so no they are not the same.

Okay, so I re-derived the formula and you are correct, I got it wrong. Here's the correct derivation:
1
2
3
4
5
6
7
8
9
var = sum((x - u)^2) / n
= sum((x*x - 2*x*u + u*u)) / n
= (sum(x*x) - sum(2*x*u) + sum(u*u)) / n
= (sum(x*x) - 2*u*sum(x) + sum(u*u)) / n
= (sum(x*x) - 2*u*sum(x) + n*u*u) / n
= (sum(x*x) - 2*(sum(x)^2)/n + n*(sum(x)/n)^2) / n
= ((sum(x*x) - 2*(sum(x)^2)/n + sum(x)^2/n)) / n
= (sum(x*x) - sum(x)^2/n) / n
= sum(x*x)/n - sum(x)^2

(For variance -- standard deviation is just the square root of this.)

Also, since we're talking about samples rather than a population, the final division is by (n-1), not n.

So:
stddev = sqrt((sum(x*x) - sum(x)^2/n) / (n - 1))

Thanks for catching that. I've updated my sample code to match: https://gist.github.com/ACEfanatic02/74c1e28d4d9777f815705842b97ea11e
btaylor2401
= sum((x*x - 2*x*u + u*u)) / n
= (sum(x*x) - sum(2*x*u) + sum(u*u)) / n
Since subtraction is anticommutative I don't think that you can make that jump. But I am not entirely sure.