Day 229: what about radix sort?

Mārtiņš Možeiko

#5553

January 6, 2016

In episode 229 Casey said that we'll probably implement better sorting method than BubbleSort. Something that is O(n*log(n)) not O(n*n). So that would be something like MergeSort, HeapSort or QuickSort. Which is "lowest you can possibly do".

But what about Radix sort? It is usually lower than O(n*log(n)). It is practically O(n) sort (to be accurate it is O(k*n) where k is digit count used in RadixSort). Linear from element count! It is very simple to implement, and is very branch-prediction friendly. Usual sort algorithms are very unfriendly to branch-prediction, because they compare individual elements and do different stuff depending on branch. For non-trivial amount of elements (thousands of elements) RadixSort is faster than any QuickSort implementation.

Only disadvantage is that it is not in-place sort. It needs temporary memory where to store objects while sorting. But with out fast memory management it is very cheap operation. Typical implementation uses bytes as "digits" for RadixSort. Here's the sample implementation:

// creates u32 value from r32 so that sorted u32 values give same order
// as original r32 values when sorted
// if float is negative, flips all bits
// if float is positive, flips only sign bit
inline u32 FloatToU32Key(r32 Key)
{
    u32 KeyBits = *((u32*)&Key);
    u32 Mask = -s32(KeyBits >> 31) | 0x80000000;
    KeyBits ^= Mask;
    return KeyBits;
}

inline u32 GetByteN(u32 Value, u32 ByteIndex)
{
    Assert(ByteIndex < 4);
    return (Value >> (8*ByteIndex)) & 0xFF;
}

internal void
SortEntries(render_group *RenderGroup)
{
    u32 Count = RenderGroup->PushBufferElementCount;
    tile_sort_entry *Entries = (tile_sort_entry *)(RenderGroup->PushBufferBase + RenderGroup->SortEntryAt);

    // allocate memory, so don't call this in a loop, but we know this method is
    // called once per frame (at least for now)
    // or another option would be to reserve memory in PushBuffer in render_group
    // structure next to where original entries are allocated
    tile_sort_entry *Temp = PushArray(TempArena, Count, tile_sort_entry);
    tile_sort_entry *Source = Entries;

    for (u32 DigitIndex=0; DigitIndex<4; DigitIndex++)
    {
        u32 Counts[256] = {};
        for (u32 Index=0; Index<Count; Index++)
        {
             u32 Key = FloatToU32Key(Source[Index].SortKey);
             u32 Digit = GetByteN(Key, DigitIndex);
             Counts[Digit]++;
        }

        u32 TotalCount = 0;
        for (u32 Index=0; Index<ArrayCount(Counts); Index++)
        {
            u32 CurrentCount = Counts[Index];
            Counts[Index] = TotalCount;
            TotalCount += CurrentCount;
        }
        Assert(TotalCount == Count);

        for (u32 Index=0; Index<Count; Index++)
        {
            tile_sort_entry* Entry = Source + Index;
            u32 Key = FloatToU32Key(Entry->SortKey);
            u32 Digit = GetByteN(Key, DigitIndex);

            Temp[Counts[Digit]++] = *Entry;
        }

        tile_sort_entry *Swap = Source;
        Source = Temp;
        Temp = Swap;
    }
}

No comparisons at all. Except loop conditions which are very branch-predictable.

Tricky part about FloatToU32Key is explained here: http://stereopsis.com/radix.html
It can be avoided with a bit more code for creating Count array: http://www.codercorner.com/RadixSortRevisited.htm

And if the key is not float but integer, then whole FloatToU32Key function can be avoided and Entry->SortKey can be used directly. Code becomes even simpler.

Edited by Mārtiņš Možeiko on January 6, 2016, 6:24pm

Ginger Bill

#5555

January 6, 2016

This is the first time that I have seen radix sort and I am now very interested.

How does perform in practice? I notice it uses temporary memory but a lot of sorting algorithms do anyway.

Mārtiņš Možeiko

#5558

January 6, 2016

Yeah, radix sort is my favorite sort. If you have integer or float keys, it is very simple to implement and it is very fast in practice. You can do it also for a strings, but implementation gets a bit tricky.

Here's a small benchmark for this radix sort implementation and std::sort (QuickSort with optimizations for small element count): https://gist.github.com/mmozeiko/0bd42648536b164f5dc8

With GCC 5.3.0 it gives following times (in milliseconds):

Count          Quick     Radix
64              0.00      0.00
128             0.00      0.00
256             0.01      0.00
512             0.02      0.01
1024            0.04      0.01
2048            0.09      0.02
4096            0.19      0.06
8192            0.45      0.09
16384           0.83      0.19
32768           1.75      0.42
65536           3.88      0.85
131072          7.99      1.84
262144         17.24      4.10
524288         36.52      7.15
1048576        76.17     18.17

Faster than quick sort :)

Edited by Mārtiņš Možeiko on January 6, 2016, 6:23pm

Ginger Bill

#5563

January 6, 2016

I think this is going to be my next favourite sort. However, if space is a problem, I think I will need to use something else.

Thank you for bringing this up.

Roderic Bos

#5586

January 8, 2016

Don't know if everybody knows this, but this seems a helpfull page for choosing/understanding a sorting algorithm:

http://www.sorting-algorithms.com/

Andrew Bromage

#5610

January 10, 2016

mmozeiko
In episode 229 Casey said that we'll probably implement better sorting method than BubbleSort. Something that is O(n*log(n)) not O(n*n). So that would be something like MergeSort, HeapSort or QuickSort. Which is "lowest you can possibly do".

But what about Radix sort?

This was covered on day 232. One of the things that Casey mentioned during the Q&A on day 232 is that it's been proven that you can't do better than O(n log n) for a comparison-based sort.

The proof is surprisingly simple to understand if you understand just a little bit of information theory. So I'm going to sketch it out here.

Information theory is basically the theory of communicating over a communication channel. For analysing algorithms, this is handy, because you can think of your algorithm as communicating with a query operation, and with a little information theory, you can work out how much information it needs to get for your algorithm needs to do its job.

The main fact that you need to know is this: To communicate a number between 1 and N, you need to transfer at least log N bits of information. If it helps, for "bits" read "digits". To communicate a number between 1 and 1,000,000, you need to transmit six decimal digits. The base-10 logarithm of 1,000,000 is 6. But we're computer scientists, so all logarithms are base 2 unless otherwise specified.

In the discussion that follows, I'm going to assume that all of the sort keys are distinct, that is, that there are no two keys in the collection which test equal. Later, I'll mention (but not explain, much less prove) how it works in that case.

First, let's think about binary search. You have a collection of N numbers, and you need to find one of them. That means that you need to discover a number between 1 and N (or 0 and N-1). To discover such a number, you need log N bits of information.

Now suppose you have a query operation which returns a single bit of information. Like, oh, the "less than or equal to" operation, which takes two keys and returns a boolean. That means you need to call that query operator at least log N times. Each time you call it, you get one bit of information, and you need log N bits, so you need to call it log N times.

So the best search algorithm, under those circumstances, must take O(log N) time. Binary search takes O(log N) time. Binary search is therefore (asymptotically) optimal.

Now let's look at sorting. If you have a list of N elements, and the keys are distinct, then there N! possible orderings. That's the factorial of N. Sorting those elements is equivalent to discovering which permutation the elements are in. That means that you need to discover a number between 1 and N!, which means you need to learn log(N!) bits.

By Stirling's approximation, log(N!) = N log N + O(lower order crap). It follows that any sort algorithm must discover N log N bits of information. If you have a query operator which returns one bit of information (e.g. a comparison), then you must call it O(N log N) times. It follows that any comparison-based sort requires O(N log N) comparisons.

Heap sort and merge sort perform O(N log N) comparisons in the worst case, therefore they are (asymptotically) optimal.

QED

The reason why radix sort could do better is that its query operator returns more than one bit of information.

If there are duplicate keys, then things are a little more complex. In the limiting case, if all of the keys in the list are equal, then it should only take O(N) time to discover this and then say "we're done". If you want to take this into account, then it turns out that comparison-based sorting takes O(NH) time, where H is the entropy of the key distribution. You are not expected to understand this.

You can analyse lots of algorithms this way. For example, the "top k" problem which Casey also mentioned on day 232 can be done in O(N log k) comparisons. An algorithm which does this is left as an exercise.

mmozeiko
Usual sort algorithms are very unfriendly to branch-prediction, because they compare individual elements and do different stuff depending on branch.

More to the point, those branches are impossible to predict, because the result of the comparison operation is the very information that you are trying to discover!

Having said that, many sort algorithms are better about this than others. Insertion sort, for example, almost always runs faster than bubble sort on real hardware because its basic operation is not "compare and swap if necessary". The basic structure of bubble sort is:

for (i = 0; i < n; ++i) {
    for (j = 0; j + 1 < n; ++j) {
        compare-and-swap-if-necessary(A[j], A[J+1]);
    }
}

The "swap if necessary" operation is a mostly unpredictable branch which is executed O(n^2) times in the worst case. The basic structure of insertion sort, however, is:

for (i = 1; i < n; ++i) {
    tmp = A[i];
    j = i - 1;
    while (j >= 0 && A[j] > tmp) {
        A[j+1] = A[j];
        --j;
    }
    A[j+1] = tmp;
}

That branch is still mostly unpredictable, but it's the termination condition for the inner loop, so it only runs O(n) times. Remember what Casey said about constant factors? A mispredicted branch is expensive, so an algorithm which mispredicts branches O(n) times will typically beat one that mispredicts branches O(n^2) times.

Yeah, I never use bubble sort.

mmozeiko
Only disadvantage is that it is not in-place sort.

That is not true. The usual in-place variant of radix sort is sometimes called American flag sort.

Actually, you don't even need extra storage to store the digit counts! This is a very counter-intuitive result which was only settled a couple of years ago using one of the most brilliant pieces of bit hackery I've seen in recent times. Thanks to information theory, you need fewer bits to store an ascending sequence of integers than an unsorted sequence, and this extra space that you gain as the algorithm progresses is precisely equal to the extra space that you need to do radix sort!

They also give impractical algorithms for when you can't modify sort keys, but essentially this settles the problem from a theoretical point of view: Fixed sized integer sorting takes O(n) time with O(1) extra space in the worst case.

mmozeiko

// creates u32 value from r32 so that sorted u32 values give same order
// as original r32 values when sorted
// if float is negative, flips all bits
// if float is positive, flips only sign bit
inline u32 FloatToU32Key(r32 Key)
{
    u32 KeyBits = *((u32*)&Key);
    u32 Mask = -s32(KeyBits >> 31) | 0x80000000;
    KeyBits ^= Mask;
    return KeyBits;
}

I'm pretty sure this doesn't handle Inf or subnormal numbers correctly. Maybe that's not an issue...

Mārtiņš Možeiko

#5611

January 10, 2016

Yeah, I'm aware of algorithm to do radix sort inplace, but it does a lot more branching so I felt it will be slower than the trivial radix sort implementation with additional memory. Also afaik it is unstable sort, right?

Cool, the idea with not using any memory at all looks pretty cool. But I feel the implementation complexity will make it slower than simple 256 element array on stack.

As for infinity and denormals, I seriously doubt they would be present in our render buffer for sort keys. But if you don't like them, it should be pretty straight forward to replace them with normal integer keys - because our layers are discrete then the render command placement on layers can be assigned to simple integer values.

Here's the new benchmark with unstable inplace radix sort: https://gist.github.com/mmozeiko/0bd42648536b164f5dc8

Count          Quick     Radix   RadixInplace   RadixInplOpt
64              0.00      0.00      0.00           0.00
128             0.00      0.00      0.01           0.00
256             0.01      0.00      0.02           0.01
512             0.02      0.01      0.05           0.01
1024            0.03      0.01      0.13           0.02
2048            0.07      0.02      0.26           0.04
4096            0.17      0.04      0.45           0.09
8192            0.35      0.07      0.77           0.19
16384           0.77      0.15      1.55           0.42
32768           1.61      0.29      3.33           0.89
65536           3.51      0.59      6.49           1.83
131072          7.32      1.22     10.74           3.28
262144         14.79      2.49     14.99           6.53
524288         29.43      4.89     23.31          12.66
1048576        58.25      9.77     38.81          25.92

While it's still faster than QuickSort for larger element count, it is slower than simpler radix sort with additional memory. RadixInplOpt is almost the same as RadixInplace, but with small optimization - when element count is relatively small in one bucket (for example, less than 64) then insertion sort is used. It improves performance. Pretty much the same optimization is used in std::sort algorithm.

Edited by Mārtiņš Možeiko on January 10, 2016, 10:16pm

Andrew Bromage

#5612

January 10, 2016

mmozeiko
Yeah, I'm aware of algorithm to do radix sort inplace, but it does a lot more branching so I felt it will be slower than the trivial radix sort implementation with additional memory. Also afaik it is unstable sort, right?

The third algorithm in the paper is in-place, no extra memory, and stable. It's also over three pages full of pseudocode and nobody should ever try to implement it.

mmozeiko
Cool, the idea with not using any memory at all looks pretty cool. But I feel the implementation complexity will make it slower than simple 256 element array on stack.

Of course. Ultimately, that paper is a CS theory paper.

When CS theory papers give algorithms, they are generally existence proofs / galactic algorithms (sometimes both). A galactic algorithm (a term coined by Ken Regan, I believe) is an asymptotically "better" algorithm whose speedup will not be realised unless the size of the input is larger than the number of fundamental particles in this galaxy. An existence proof is merely to show that something is possible.

In this case, it is a proof that it is possible to stable-sort integers in O(n) time with constant extra memory. Since it's literally impossible to do better, the long-standing theoretical question is resolved.

Incidentally, since we're talking about P vs NP, I'd like to talk for a moment about my favourite existence proof algorithm: Tarski's algorithm for quantifier elimination in real closed fields. The existence of this algorithm shows that the theory of the real numbers with field operations is decidable, which is awesome. However, is it a practical algorithm?

Well, the complexity class called EXP or EXPTIME is, basically, O(2^n). DEXPTIME, or 2-EXPTIME, is O(2^(2^n)). You can similarly define 3-EXPTIME, and in general, k-EXPTIME. The union of all of the k-EXPTIMEs for all integers is the complexity class ELEMENTARY.

Tarski's algorithm is in the complexity class NONELEMENTARY. Let that sink in.

Mārtiņš Možeiko

#5613

January 11, 2016

Pseudonym73
The third algorithm in the paper is in-place, no extra memory, and stable.

Are you sure its stable? I tried to implement it, but all I get is unstable sort. Well I'm not yet 100% sure it's not some bug in my code so...

It seems FreeBSD also implements radix sort from same paper. Here's the source https://svnweb.freebsd.org/base/s...?revision=256281&view=markup#l132
It explicitly has "Unstable, in-place sort" comment before r_sort_a function.

As far as I understand paper it is unfair to say algorithm C doesn't use memory. It uses additional O(logN) memory on stack.

Edited by Mārtiņš Možeiko on January 11, 2016, 10:06pm

Andrew Bromage

#5614

January 11, 2016

mmozeiko
Are you sure its stable? I tried to implement it, but all I get is unstable sort. Well I'm not yet 100% sure it's not some bug in my code so...

Possible miscommunication. I was referring to the Patrascu paper, Radix Sorting With No Extra Space.

Mārtiņš Možeiko

#5615

January 11, 2016

OK, yeah I was talking about American Flag Sort.

fod669

#5617

January 12, 2016

I had a question in the stream today about combining the count steps into one. I was curious so I tried it myself using mmozeiko's testing code.

Here's the modified radix sort code. I just called it RadixSort5n because it makes 5 passes over the list.

static void RadixSort5n(r32 *Entries, r32* Temp, u32 Count)
{
	r32 *Source = Entries;

	u32 Counts[4][256] = {};
	for (u32 Index = 0; Index < Count; Index++)
	{
		u32 Key = FloatToU32Key(Source[Index]);
		Counts[0][GetByteN(Key, 0)]++;
		Counts[1][GetByteN(Key, 1)]++;
		Counts[2][GetByteN(Key, 2)]++;
		Counts[3][GetByteN(Key, 3)]++;
	}

	for (u32 DigitIndex = 0; DigitIndex < 4; DigitIndex++)
	{
		u32 TotalCount = 0;
		for (u32 Index = 0; Index < 256; Index++)
		{
			u32 CurrentCount = Counts[DigitIndex][Index];
			Counts[DigitIndex][Index] = TotalCount;
			TotalCount += CurrentCount;
		}
	}

	for (u32 DigitIndex = 0; DigitIndex < 4; DigitIndex++)
	{
		for (u32 Index = 0; Index < Count; Index++)
		{
			u32 Key = FloatToU32Key(Source[Index]);
			u32 Digit = GetByteN(Key, DigitIndex);

			Temp[Counts[DigitIndex][Digit]++] = Source[Index];
		}

		r32 *Swap = Source;
		Source = Temp;
		Temp = Swap;
	}
}

And here are the results:

Count          Quick     Radix   Radix5n   RadixInplace   RadixInplOpt
64              0.00      0.00      0.00      0.00           0.00
128             0.00      0.00      0.00      0.01           0.00
256             0.01      0.00      0.00      0.03           0.01
512             0.02      0.01      0.01      0.07           0.01
1024            0.04      0.01      0.01      0.17           0.03
2048            0.09      0.03      0.03      0.35           0.06
4096            0.20      0.07      0.05      0.59           0.11
8192            0.43      0.12      0.10      1.05           0.22
16384           0.91      0.24      0.20      2.16           0.49
32768           1.95      0.50      0.37      4.74           1.20
65536           4.08      0.94      0.80     10.20           2.22
131072          8.95      2.04      1.62     15.59           4.08
262144         18.26      4.14      3.30     21.46           7.95
524288         36.02      7.52      6.73     31.38          15.23
1048576        69.79     15.94     13.10     51.89          29.97

So overall it is a slight improvement, especially as the list gets larger. Which I suppose is to be expected as the algorithm is linear anyway, and the counting overhead gets less in comparison to the size of the list.

The only downside really is the cost of the extra 3 arrays.

Edited by fod669 on January 12, 2016, 4:28am

Mārtiņš Možeiko

#5618

January 12, 2016

Nice!
If list gets larger you can improve it even more - do only 4 passes. Basically process 11 bits at a time, that will change element count in Counts array from 256 to 2048 and you'll need only 3 such Counts arrays.

It will use a bit more stack memory, but if you are sorting million (or whatever) elements, then additional temporary 2048 elems x3 on stack won't make anything worse.

Btw, Casey didn't mention this on stream - but to easier understand Radix sort, you can start looking only at its inner loop. That code implements Counting sort: https://en.wikipedia.org/wiki/Counting_sort which is very easy to understand. Radix sort simply applies Counting sort multiple times for every digit in key. And because counting sort is stable sort, the result will still be sorted.

Edited by Mārtiņš Možeiko on January 12, 2016, 5:38am

Gianluca Alloisio

#5739

January 31, 2016

I've written a devlog about sorting for libLepre, and mentioned this thread. Thanks guys :) http://liblepre.tumblr.com/post/138434159025/devlog-33-sorting

Dumitru Frunza

#10057

December 22, 2016

One of the things that Casey mentioned during the Q&A on day 232 is that it's been proven that you can't do better than O(n log n) for a comparison-based sort.

There's another proof of this in the book "Data Structures and Algorithms" by Alfred Aho et al., Section 8.6 "A Lower Bound for Sorting by Comparisons",
which I find to be simple and accessible.

Here's my interpretation of it.

Consider the set of all possible initial orderings (also called 'permutations') of the elements of the input array.
The sorting algorithm can be thought of as a process by which this set is progressively "narrowed" down
until only one ordering remains - the sorted ordering.

As an illustration, take the array A that contains 3 elements a,b,c in some order, and which are comparable by the '<' (less than) relation. Start with
the set of all possible orderings {abc,acb,bac,bca,cab,cba}, then ask a question about the relation between two elements of the array - say A[0]<A[1]?
Split the set in two - one containing all orderings which are consistent with the answer 'yes' and the other consistent with the answer 'no'.
E.g. let A:=abc; if A[0]<A[1] then we know that a<b, and so pick the orderings (from the initial set) where a precedes b and make a new set {abc,acb,cab};
if A[0]>=A[1] then b<a so pick the orderings where b precedes a and make the set {bac,bca,cba}. Proceed in this manner recursively,
by asking a question and splitting the sets until the resulting sets contain only one ordering.

This process creates a binary tree called a 'decision tree', and the questions asked are specific to the sorting algorithm employed.
The running time of the algorithm, in the worst case, is proportional to the longest path in the decision tree. A balanced tree
is optimal in this sense, because the longest path from root to a leaf is minimal among other possible tree configurations.
Thus the best possible algorithm will produce a balanced decision tree.

In a balanced binary tree, the length of a path from root to its leaves is log(x), where x is the number of leaves.
That's easy to see

     * ------ 1 node, len(path)=log(1)=0
    / \
  *----*----- 2 nodes, len(path)=log(2)=1
 / \   / \
*---*-*---*-- 4 nodes, len(path)=log(4)=2
...........   x nodes, len(path)=log(x)

Let n be the number of elements in the input array, then the total number of leaves in the decision tree is n! (n factorial),
because that's the number of all possible initial orderings of the array, and each leaf in the decision tree contains only one (unique) ordering.
Then length of the longest path in the decision tree is log(n!).
Now n! can be (very roughly) approximated as n^n (n to the power of n), n! < n^n, so
log(n!) < log(n^n) = n*log(n)

q.e.d.