I made a comment about doing a merge sort from the ground up in Yesterday's Q&A and thought I would put the code here in a more readable format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Entry* Source = First;
Entry* Out

for(u32 RangeSize = 1;
    RangeSize < Count;
    RangeSize *= 2)
{
    for(u32 StartIndex = 0, MidIndex = RangeSize;
        MidIndex < Count;
        StartIndex += RangeSize*2, MidIndex += RangeSize*2)
    {
        entry* Start = Source + StartIndex;
        entry* Mid = Source + MidIndex;
        entry* End = Min(Mid+RangeSize, First+Count);

        entry* Dest = Out + StartIndex;

        //Note(ratchet): this is the exact same merge code as Casey's 
        // hoisted out for laziness reasons
        Merge(Start, Mid, End, Dest);

    }

    // Note(ratchet): We may have skipped the last interval
    u32 RemainingEntries = Count % (RangeSize * 2);
    if(RemainingEntries < RangeSize)
    {
        for(u32 EntryIndex = Count - RemainingEntries ;
            EntryIndex < Count;
            ++Start)
        {
            Out[EntryIndex] = Source[EntryIndex];
        }
    }
    entry* Tmp = Out;
    Out = Source;
    Source = Tmp;
}
//Note(ratchet): we could have gone through the outer loop an odd number of times
// so we may need to copy from the temporary buffer back to the input buffer
if(First != Source)
{
    for(u32 EntryIndex = 0;
        EntryIndex < Count;
        ++Start)
    {
        First[EntryIndex] = Source[EntryIndex];
    }
}


This has a few advantages:

First and foremost is the removal of the (non-tail-optimizable) recursion in favor of the explicit loop. Mergesort is one of those algorithms that look very elegant when expressed in recursive format but has a more efficient form in the iterative format (kinda like fibonacci but with only the O(log n) memory cost + function call overhead).

Second is that ping-ponging the buffers is much easier in this form.

Third is that you can now easily add a pre-pass to sort chunks of size n with a fast insertion sort and start the outer loop at that size n instead of 1. Finding which START_RANGE_SIZE to use and which SimpleSort is for during profiling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
for(u32 StartIndex = 0;
    StartIndex < Count;
    StartIndex += START_RANGE_SIZE)
{
    Entry* Start = Source + StartIndex;
    Entry* End = Min(Start + RangeSize, First+Count);

    //Note(ratchet): this is the fast sort of your choice optimized for ranges of 
    // small sizes
    SimpleSort(Start, End);

}

for(u32 RangeSize = START_RANGE_SIZE;
    RangeSize < Count;
    RangeSize *= 2)
{
    //...