mmozeiko
In episode 229 Casey said that we'll probably implement better sorting method than BubbleSort. Something that is O(n*log(n)) not O(n*n). So that would be something like MergeSort, HeapSort or QuickSort. Which is "lowest you can possibly do".
But what about Radix sort?
This was covered on day 232. One of the things that Casey mentioned during the Q&A on day 232 is that it's been proven that you can't do better than O(n log n) for a comparison-based sort.
The proof is surprisingly simple to understand if you understand just a little bit of information theory. So I'm going to sketch it out here.
Information theory is basically the theory of communicating over a communication channel. For analysing algorithms, this is handy, because you can think of your algorithm as communicating with a query operation, and with a little information theory, you can work out how much information it needs to get for your algorithm needs to do its job.
The main fact that you need to know is this: To communicate a number between 1 and N, you need to transfer at least log N bits of information. If it helps, for "bits" read "digits". To communicate a number between 1 and 1,000,000, you need to transmit six decimal digits. The base-10 logarithm of 1,000,000 is 6. But we're computer scientists, so all logarithms are base 2 unless otherwise specified.
In the discussion that follows, I'm going to assume that all of the sort keys are distinct, that is, that there are no two keys in the collection which test equal. Later, I'll mention (but not explain, much less prove) how it works in that case.
First, let's think about binary search. You have a collection of N numbers, and you need to find one of them. That means that you need to discover a number between 1 and N (or 0 and N-1). To discover such a number, you need log N bits of information.
Now suppose you have a query operation which returns a single bit of information. Like, oh, the "less than or equal to" operation, which takes two keys and returns a boolean. That means you need to call that query operator
at least log N times. Each time you call it, you get one bit of information, and you need log N bits, so you need to call it log N times.
So the best search algorithm, under those circumstances, must take O(log N) time. Binary search takes O(log N) time. Binary search is therefore (asymptotically) optimal.
Now let's look at sorting. If you have a list of N elements, and the keys are distinct, then there N! possible orderings. That's the factorial of N. Sorting those elements is equivalent to discovering which permutation the elements are in. That means that you need to discover a number between 1 and N!, which means you need to learn log(N!) bits.
By
Stirling's approximation, log(N!) = N log N + O(lower order crap). It follows that any sort algorithm must discover N log N bits of information. If you have a query operator which returns one bit of information (e.g. a comparison), then you must call it O(N log N) times. It follows that any comparison-based sort requires O(N log N) comparisons.
Heap sort and merge sort perform O(N log N) comparisons in the worst case, therefore they are (asymptotically) optimal.
QED
The reason why radix sort could do better is that its query operator returns more than one bit of information.
If there are duplicate keys, then things are a little more complex. In the limiting case, if all of the keys in the list are equal, then it should only take O(N) time to discover this and then say "we're done". If you want to take this into account, then it turns out that comparison-based sorting takes O(NH) time, where H is the
entropy of the key distribution. You are not expected to understand this.
You can analyse lots of algorithms this way. For example, the "top k" problem which Casey also mentioned on day 232 can be done in O(N log k) comparisons. An algorithm which does this is left as an exercise.
mmozeiko
Usual sort algorithms are very unfriendly to branch-prediction, because they compare individual elements and do different stuff depending on branch.
More to the point, those branches are impossible to predict, because the result of the comparison operation is the very information that you are trying to discover!
Having said that, many sort algorithms are better about this than others. Insertion sort, for example, almost always runs faster than bubble sort on real hardware because its basic operation is not "compare and swap if necessary". The basic structure of bubble sort is:
| for (i = 0; i < n; ++i) {
for (j = 0; j + 1 < n; ++j) {
compare-and-swap-if-necessary(A[j], A[J+1]);
}
}
|
The "swap if necessary" operation is a mostly unpredictable branch which is executed O(n^2) times in the worst case. The basic structure of insertion sort, however, is:
| for (i = 1; i < n; ++i) {
tmp = A[i];
j = i - 1;
while (j >= 0 && A[j] > tmp) {
A[j+1] = A[j];
--j;
}
A[j+1] = tmp;
}
|
That branch is still mostly unpredictable, but it's the termination condition for the inner loop, so it only runs O(n) times. Remember what Casey said about constant factors? A mispredicted branch is expensive, so an algorithm which mispredicts branches O(n) times will typically beat one that mispredicts branches O(n^2) times.
Yeah, I never use bubble sort.
mmozeiko
Only disadvantage is that it is not in-place sort.
That is not true. The usual in-place variant of radix sort is sometimes called
American flag sort.
Actually, you don't even need extra storage to store the digit counts! This is a very counter-intuitive result
which was only settled a couple of years ago using one of the most brilliant pieces of bit hackery I've seen in recent times. Thanks to information theory, you need fewer bits to store an ascending sequence of integers than an unsorted sequence, and this extra space that you gain as the algorithm progresses is precisely equal to the extra space that you need to do radix sort!
They also give impractical algorithms for when you can't modify sort keys, but essentially this settles the problem from a theoretical point of view: Fixed sized integer sorting takes O(n) time with O(1) extra space in the worst case.
mmozeiko
| // creates u32 value from r32 so that sorted u32 values give same order
// as original r32 values when sorted
// if float is negative, flips all bits
// if float is positive, flips only sign bit
inline u32 FloatToU32Key(r32 Key)
{
u32 KeyBits = *((u32*)&Key);
u32 Mask = -s32(KeyBits >> 31) | 0x80000000;
KeyBits ^= Mask;
return KeyBits;
}
|
I'm pretty sure this doesn't handle Inf or subnormal numbers correctly. Maybe that's not an issue...