Recursive vs Non-recursive algorithms

Andrew Chronister

#4161

July 6, 2015

I'm writing some code that seems like it would be simpler to express using a recursive method (it's a read-only tree walk) and I'm wondering if there are significant performance differences between recursive algorithms and non-recursive ones.

I had the misfortune of learning about these initially in the context of Java, which meant such algorithms were highly emphasized even though they suffer significantly from stack overflow problems at high recursion depth.

I don't know what the trade-offs are when writing recursive algorithms in C, and I'd be interested to find out. I'm guessing stack overflow is still an issue, if somewhat harder to achieve, but could be mitigated by allocating a larger stack?

(Now that I think about it, I should probably just write the non-recursive version and do a performance comparison with the hardware clock. Maybe I'll do that and post the results.)

Abner Coimbre

#4163

July 7, 2015

Check out this chapter from author V. Anton Spraul (pages 24-27) for some arguments to think about.

Andrew Chronister

#4164

July 7, 2015

abnercoimbre
Check out this chapter from author V. Anton Spraul (pages 24-27) for some arguments to think about.

Looks like his two major arguments against it are:
1) Repeated function calls result in overhead, presumably since the compiler has to set up a stack frame and jump to the code each time.
2) Recursive algorithms use system stack space instead of (sometimes) user stack space.

I don't think conceptual complexity really applies, since walking a tree is actually quite a bit easier to think about recursively.

But the other two are both good points. I'll have to write a procedural version and do a comparison like I said, and it helps that he has an example there.

Miguel Lechón

#4165

July 7, 2015

When calling a function recursively, take also into account that the whole state of the function (including local variables) is pushed onto the process stack. In case you have one or two big structs laying around and the tree is sufficiently deep (big and unbalanced), that could add up really fast.

Andrew Chronister

#4166

July 7, 2015

Okay, so I did some initial tests and here's what I've got. The actual cycle count isn't too relevant since it's very dependent on the source data, but the comparison is useful:

Recursive version average: 77589.1 cycles
Non-Recursive version average: 95863.8 cycles
(Cycles averaged over 1000 tests)

Looks like non-recursive is taking about 1.2 times as long for this data set. I should note it's a very shallow tree (probably 6 or 7 nodes deep at most), and I'm using a custom built array/stack thing that has not had any kind of performance pass yet. Those both might be contributing here.

Going to run some more tests with a deeper tree and see what happens, and maybe look at optimizing my stack a bit more. Interesting results so far. I'll post my source code when I'm done messing with it.

Edit: I should note that shallow trees are going to be a very common case with what this is being used for. I'm going to have to keep both of the algorithms around until I determine which is faster in the majority of real use cases.

Edit 2: Looks like tree depth isn't (much) of a factor. With a much deeper tree:

Recursive version average: 197385.8 cycles
Non-Recursive version average: 231653.1 cycles
(Cycles averaged over 1000 tests)

Non-Recursive is now only 1.17 times worse, but that's not much of an increase (not even outside of the 0.05 significance range!) I think the probably-inefficient stack is the more likely culprit. So at this point, I guess I'm pitting my data structure against the compiler's built in one. Yay.

Edit 3: Okay, just noticed another confounding factor here. The recursive algorithm is going depth-first, while the non-recursive is going breadth-first. So whatever gains the non-recursive might have had with a deeper tree were probably offset by that.

Edit 4: Well, would you look at that. Wrote a depth-first version and it gets a lot closer:

Recursive version average: 195136.7 cycles
Non-recur breadth-first average: 230491.2 cycles
Non-recur depth-first average: 199895.4 cycles
(Cycles averaged over 1000 tests)

Nonrecursive is now only 1.02 times worse than recursive (still with deep tree)...but still worse. Hmm.

Edit 5: Haha oh my god. Benchmarked against std::stack (used in abner's book, if I'm interpreting the phrase "system stack" correctly) and...

Recursive version average: 170689.4 cycles
Non-recur depth-first average: 763892.1 cycles
(Cycles averaged over 1000 tests)

std::vector is not much better...

Recursive version average: 170301.2 cycles
Non-recur depth-first average: 697988.3 cycles
(Cycles averaged over 1000 tests)

Guess what they say about standard library containers is true. :blink:

Edited by Andrew Chronister on July 7, 2015, 6:04am

Andrew Chronister

#4169

July 7, 2015

My std:: container comparison was popular on twitter, so I figured I better replicate the usage code here so people see there's no funny business.

Using my array<T> container:

node*
TreeSearchSingleAlt2(node TreeRoot, bool32 (*Predicate)(node, void*), void* PredicateArg, int Depth = -1)
{
	node* Result = null;
	
	array<node*> NodeCheckStack = {};
	PushArray<node*>(&NodeCheckStack, &TreeRoot);

	node* TestNode = null;
	while (NodeCheckStack.Length > 0)
	{
		TestNode = PopArray<node*>(&NodeCheckStack);
		if (Predicate(*TestNode, PredicateArg)) { Result = TestNode; break; }

		PushAllArrayIndirected<node>(&NodeCheckStack, TestNode->Children);
	}
	FreeArray(&NodeCheckStack);
	return Result;
}

With std::stack:

node*
TreeSearchSingleAlt4(node TreeRoot, bool32 (*Predicate)(node, void*), void* PredicateArg, int Depth = -1)
{
	node* Result = null;
	
	std::stack<node *> nodes;
	nodes.push(&TreeRoot);

	node* TestNode = null;
	while(nodes.size() > 0)
	{
		TestNode = nodes.top();
		nodes.pop();
		if (Predicate(*TestNode, PredicateArg)) { Result = TestNode; break; }

		for(uint i = 0; i < TestNode->Children.Length; ++i)
		{
			nodes.push(TestNode->Children.Values + i);
		}
	}
	return Result;
}

With std::vector:

node*
TreeSearchSingleAlt3(node TreeRoot, bool32 (*Predicate)(node, void*), void* PredicateArg, int Depth = -1)
{
	node* Result = null;
	
	std::vector<node *> nodes;
	nodes.push_back(&TreeRoot);

	node* TestNode = null;
	while(nodes.size() > 0)
	{
		TestNode = nodes.back();
		nodes.pop_back();
		if (Predicate(*TestNode, PredicateArg)) { Result = TestNode; break; }

		for(uint i = 0; i < TestNode->Children.Length; ++i)
		{
			nodes.push_back(TestNode->Children.Values + i);
		}
	}
	return Result;
}

And for reference, here's the recursive algorithm I'm trying to beat:

node*
TreeSearchSingle(node TreeRoot, bool32 (*Predicate)(node, void*), void* PredicateArg, int Depth = -1)
{
	node* Result = null;
	foreach(node, ChildNode, TreeRoot.Children.Length, TreeRoot.Children.Values)
	{
		if (Predicate(*ChildNode, PredicateArg)) 
		{ 
			Result = ChildNode; 
			break;
		}
		else
		{
			if (Depth == 0) { return null; }
			node* ChildResult = TreeSearchSingle(*ChildNode, Predicate, PredicateArg, Depth - 1);
			if (ChildResult != null) 
			{
				Result = ChildResult;
				break;
			}
		}
	}
	return Result;
}

Edited by Andrew Chronister on July 7, 2015, 5:24am

Miguel Lechón

#4171

July 7, 2015

Is it possible that you're visiting the children in a different order in the non-recursive versions?

I might be missing something, but if this line in your first code:

1	PushAllArrayIndirected<html_node>(&NodeCheckStack, TestNode->Children);

were to push the children from first to last, then this other line:

1	TestNode = PopArray<html_node*>(&NodeCheckStack);

would be loading them from last to first.

If that were the case, and imagining that your tree was composed of a root R and two children A and B, you would be traversing the nodes in the order R, A, B in the recursive version and in the order R, B, A in the non-recursive ones. That could be biasing your benchmarks.

Andrew Chronister

#4172

July 7, 2015

Ah, yeah, I thought of that but wasn't sure it would make much difference. I guess since I'm only using one dataset, it wouldn't be corrected for.

To visit them in the same order I should probably either "pop" the first element off the "stack" in the non-recursive one (inefficient, have to shift the whole thing down), or walk them backwards in the recursive algorithm.

In practice there are going to be places in the code where it will be faster to do one or the other in almost all cases, so I think I should keep variants for both directions (a Forwards and a Backwards function, maybe) of whichever algorithm I decide on.

I'll test that and see if it accounts for the difference.

d7samurai

#4173

July 7, 2015

node and html_node are the same thing?

What does PushAllArrayIndirected look like?

Edited by d7samurai on July 7, 2015, 5:18am

Andrew Chronister

#4174

July 7, 2015

Whoop, thought I sanitized all that :P

PushAllArrayIndirected (what a mouthful) is just this:

template<typename T>
void //?
PushAllArrayIndirected(array<T*>* Array, array<T> Items)
{
	foreach(T, Item, Items.Length, Items.Values)
	{
                // Item is actually a pointer (T*), my foreach macro is bad.
		PushArray<T*>(Array, Item);
	}
}

And foreach is just a wrapper for "for" with a little macro-mush "syntactic sugar".

Edit for explanation: I started out with just PushArray (to distinguish it from, uh, just Push, since its in the global namespace), then I got PushAllArray for pushing a bunch of stuff onto an array. But that doesn't work if the array you're pushing onto is of pointers and the array of items to push on is values (like I had there), so thus PushAllArrayIndirected was born.

Edited by Andrew Chronister on July 7, 2015, 5:26am

Andrew Chronister

#4175

July 7, 2015

New results:

Recursive (Forwards): 198543 cycles
Recursive (Backwards): 202639 cycles
Non-recursive (Backwards): 203652 cycles
(Cycles averaged over 1000 tests)

So backwards/forwards was definitely making a difference, it's really close now. In fact, I doubt I'll be able to get it much faster without resorting to intrinsics or something.

I still haven't profiled memory, but I expect the recursive one will win out unless I can find a way to accurately measure the stack size. Either way, this answers half my original question:

- For this algorithm, recursion makes little difference in performance.
- The main difference is in how the algorithm is expressed in code, and where the stack for the temporary variables is dealt with.

Always nice when you can answer your own questions.

John Lorre

#4176

July 7, 2015

Regarding recursion and worry about stack frames. Modern compilers can optimize this away and use a tail call, there won't be any need to allocate a new stack frame then.

This is something Scheme and Lisp did for a long time, as those languages can be so heavily recursive that you'd quickly blow the stack otherwise.

https://en.wikipedia.org/wiki/Tail_call

m1el

#4178

July 7, 2015

There are algorithms that cannot be optimized "away" using tail call elimination.
For example, tree walk. So tail call elimination won't really help here.

John Lorre

#4179

July 7, 2015

Yes, you are absolutely right that not every recursive solution will use tail calls (the wikipedia article states why and when).
I should have been more clear I suppose, I meant to comment on the statements that recursion will always put pressure on the stack. Which with the right algorithm and a compiler applying tail calls must not be always true.

Andrew Chronister

#4180

July 7, 2015

JohnL
Yes, you are absolutely right that not every recursive solution will use tail calls (the wikipedia article states why and when).
I should have been more clear I suppose, I meant to comment on the statements that recursion will always put pressure on the stack. Which with the right algorithm and a compiler applying tail calls must not be always true.

My understanding is that tail call optimization only works well if the recursive call is one of the last statements in the function. In my case, it was inside a loop over the node's children, and not the last statement in that loop anyway, so I don't think tail call optimization will really help here.

Good to know in the general case, I guess.