But what about Radix sort? It is usually lower than O(n*log(n)). It is practically

**O(n)**sort (to be accurate it is O(k*n) where k is digit count used in RadixSort).

**Linear from element count!**It is very simple to implement, and is very branch-prediction friendly. Usual sort algorithms are very unfriendly to branch-prediction, because they compare individual elements and do different stuff depending on branch. For non-trivial amount of elements (thousands of elements) RadixSort is faster than any QuickSort implementation.

Only disadvantage is that it is not in-place sort. It needs temporary memory where to store objects while sorting. But with out fast memory management it is very cheap operation. Typical implementation uses bytes as "digits" for RadixSort. Here's the sample implementation:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | // creates u32 value from r32 so that sorted u32 values give same order // as original r32 values when sorted // if float is negative, flips all bits // if float is positive, flips only sign bit inline u32 FloatToU32Key(r32 Key) { u32 KeyBits = *((u32*)&Key); u32 Mask = -s32(KeyBits >> 31) | 0x80000000; KeyBits ^= Mask; return KeyBits; } inline u32 GetByteN(u32 Value, u32 ByteIndex) { Assert(ByteIndex < 4); return (Value >> (8*ByteIndex)) & 0xFF; } internal void SortEntries(render_group *RenderGroup) { u32 Count = RenderGroup->PushBufferElementCount; tile_sort_entry *Entries = (tile_sort_entry *)(RenderGroup->PushBufferBase + RenderGroup->SortEntryAt); // allocate memory, so don't call this in a loop, but we know this method is // called once per frame (at least for now) // or another option would be to reserve memory in PushBuffer in render_group // structure next to where original entries are allocated tile_sort_entry *Temp = PushArray(TempArena, Count, tile_sort_entry); tile_sort_entry *Source = Entries; for (u32 DigitIndex=0; DigitIndex<4; DigitIndex++) { u32 Counts[256] = {}; for (u32 Index=0; Index<Count; Index++) { u32 Key = FloatToU32Key(Source[Index].SortKey); u32 Digit = GetByteN(Key, DigitIndex); Counts[Digit]++; } u32 TotalCount = 0; for (u32 Index=0; Index<ArrayCount(Counts); Index++) { u32 CurrentCount = Counts[Index]; Counts[Index] = TotalCount; TotalCount += CurrentCount; } Assert(TotalCount == Count); for (u32 Index=0; Index<Count; Index++) { tile_sort_entry* Entry = Source + Index; u32 Key = FloatToU32Key(Entry->SortKey); u32 Digit = GetByteN(Key, DigitIndex); Temp[Counts[Digit]++] = *Entry; } tile_sort_entry *Swap = Source; Source = Temp; Temp = Swap; } } |

No comparisons at all. Except loop conditions which are very branch-predictable.

Tricky part about FloatToU32Key is explained here: http://stereopsis.com/radix.html

It can be avoided with a bit more code for creating Count array: http://www.codercorner.com/RadixSortRevisited.htm

And if the key is not float but integer, then whole FloatToU32Key function can be avoided and Entry->SortKey can be used directly. Code becomes even simpler.