Signed vs unsigned integers

There was a question in a recent Q&A about int vs s32 which got me wondering: why does Casey use unsigned integers (u32) for indexes and counts? I remember in one of Jai videos Jonathan Blow was suggesting s64 as a catch-all type when you don't care about variable size. I understand that 64 bits may be too much but the way I see it you would rarely if ever store more than 2 billion things in memory even with s32 indexes. I think signed indexes have several advantages too. One is that you would probably never get into situation where you need to cast between signed and unsigned types. Another is that you could use negative numbers to mark something. Say for example you could initialize index to -1 to indicate that specific item was not found when doing a simple linear search. What are the advantages of unsigned indexes?
In my own code, I use "unsigned" whenever I know for a fact that a negative number would be wrong in all cases.

In the case of array indexes, for example, you almost _never_ want to index below the first element of an array. It's sort of a free bounds check on one end. Sometimes this does mean you have to cast the result of an index calculation, but all that means is that your code is making you think explicitly about what the index is being used for -- i.e., I'm casting this number, so if it's negative, I should have logic for how to deal with that.

Sometimes with structures that use unsigned indexes usually, there are good reasons to use negative numbers in function results, like indexOf and whatnot. One solution is to just use signed numbers there and accept the range difference, another would be to make an ad-hoc return struct that has a "bool Valid" and "uint32 Index" or something, if you really care about having the full range.
If there is a bug in index calculation code is there a fundamental difference if that index is negative or a very large number? It would still be out of array bounds in both cases. You need explicit checks for both signed and unsigned indexes.
The difference is how many comparisons you need to do.
For signed to you need to do:
1
if (value >= 0 && value < max) { ... valid index ... }

For unsigned this is enough:
1
if (value < max) { ... valid index ... }

Because of how two complement integers work, negative values will be represented as large integers greater than 2**31.
If I understand correctly the bigger problem here is the additional code since I doubt it would affect performance that much. So I guess that's one argument for unsigned indexes. Are there any others?
I haven't viewed any Jai streams so there is some speculation in this. What is meant by "you don't care about variable size?" In this context is seems like what is desired is a integral variable but you don't wish to specify a restriction on the range of values it can have. What you want is a representation of an integral value that is the most natural on the system. "Natural" here is something that the CPU is directly supports; most likely a register. This implies that the machine code to handle the value is minimal and fast.

So, on a 64 bit machine, a signed int fits the bill. (I presume that Jonathan feels that 32 bit programs are a small minority today.)

The advantage of such a natural value are that CPU can load and update the value quickly, there are numerous instructions to manipulate it efficiently, it can be passed on stack easily, it can easily hold the expected range of values, etc. In short, you incur minimal overhead by using it.

In the following, keep in mind that the premise is "you don't care about the size of the variable" but you may care about the range of values it holds.

A disadvantage of a signed value is when you do care about the range of valid values it may have such as an index into an array. You would add "special" processing to deal with negative values. It's not illegal to use a negative value to index into an array.
1
2
3
4
5
6
7
...
int arr[32];
...
int access(int index)
{
    return arr[index];	// index can be < 0
}

The problem here is that you would be accessing memory outside of the array, which is probably not what was intended. By the same token, if the index is larger than the size of arr then you again are accessing memory you didn't intend to.

So you add checks
1
2
3
4
5
6
int access(int index)
{
	return (index < 0 || 32 <= index)
		? -1
		: arr[index];
}

This incurs incurs overhead that you were trying to avoid and has a non-zero run-time cost. This "robs peter and pays paul" by saving on the call but paying in the function. And you have a maintenance point if the array size changes, etc. In addition, you have to deal with the -1 value.

An unsigned value can get the compiler and the type system "on your side" by detecting any unintended calls
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
int access(unsigned index)
{
	return (32 <= index)
		? -1
		: arr[index];
}

void f()
{
	int a = -1;
	int result = access(a);	// type mismatch
}

This only handles half of the range problem but at least you are getting some help. If the index is calculated then the type system can play a larger role.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int calculate_index(/*...*/)
{
	int index = -1;
	//...
	return index;
}

void f()
{
	access(calculate_index(/*...*/));		// waraning: type mismatch
}

The advantage of the using type system is the time savings catching these types of program errors.

OTOH, there can be a cost in conversion from unsigned to signed (unless I'm mistaken.) TANSTAAFL. And mixing signed and unsigned types in calculations have problems as well. John Lakos uses string lengths to demonstrate.

1
2
3
4
5
6
unsigned lenght(const char * s);

char * me  = "me";
char * you = "you";
int diff_you_me = length(you) - length(me);	// 1
int diff_me_you = length(me) - length(you;	// 4 bazillion

You might expect that (you-me) == -(me-you) or the reverse, but not 4 bazillion.

It really depends upon the "nature:" of the variable and it's use. If it is naturally unsigned (negative values have no meaning) then you can enlist the compiler to help you insure this but it isn't a complete guarantee.

What I would suspect Jonathan's suggestion concerns that in the majority of cases where the size of the variable is unimportant then so is the sign of it. (IIRC, he does read this forum so he can give a more definite answer.) Too much time and effort is spent bounds checking when the actual case never occurs. In the absence of a real need, a signed value is the most efficient.

YMMV

- tim
RomulusTFM
The advantage of such a natural value are that CPU can load and update the value quickly, there are numerous instructions to manipulate it efficiently, it can be passed on stack easily, it can easily hold the expected range of values, etc. In short, you incur minimal overhead by using it.

On 64-bit Intel architecture using 32-bit register is pretty much same as using 64-bit in terms of efficiency. It actually can be more efficient to use 32-bit value, because you can store more stuff in a cache that way.

OTOH, there can be a cost in conversion from unsigned to signed (unless I'm mistaken.) TANSTAAFL
There isn't. For processor register holds just a bits. Is it signed or unsigned - doesn't matter. So there is nothing to convert. Only place where signed or unsigned type matters are few operations that have carry from highest bits to lowest - shift right, div, some comparisons. They basically have two opcodes for each operation, one for signed type, one for unsigned. Most operations are exactly the same - add, sub, mul, shift left, using as array index, etc.

Edited by Mārtiņš Možeiko on
I thought there was a cost to sign extension. Or is that only for larger size?
Yes, only to larger sizes. But the cost is pretty much the same for signed and unsigned types. Cast from shorter signed type like char or short to int (signed or unsigned - doesn't matter, it just a bunch of bits) uses MOVSX instruction. Cast from unsigned type (unsigned char/short) to int uses MOVZX instruction. In both cases just one instruction.

Edited by Mārtiņš Možeiko on
RomulusTFM
I haven't viewed any Jai streams so there is some speculation in this. What is meant by "you don't care about variable size?" In this context is seems like what is desired is a integral variable but you don't wish to specify a restriction on the range of values it can have.

That is correct, I meant int-like type where you don't want to specify its range. s64 is probably a good fit if you're mostly targeting 64-bit machines.


An unsigned value can get the compiler and the type system "on your side" by detecting any unintended calls
I guess that's a good point and is similar to what was mentioned by Andrew previously. Type checking may catch bugs where you're not expecting for conversion to take place.


What I would suspect Jonathan's suggestion concerns that in the majority of cases where the size of the variable is unimportant then so is the sign of it. (IIRC, he does read this forum so he can give a more definite answer.)
I think you can see throughout Jon's videos that he's typically using signed integers for indexes and counts. This is what got me interested in the first place since it's the opposite of what Casey uses in HMH. I obviously don't know either of them personally and I'm using that just as an example. I don't know what's their exact take on that.

Edited by Marius Adaškevičius on
Just to be clear and above board.

I don't know either Jonathan or Casey (or anyone else in this forum for that matter.) If I gave any other impression, I am sorry. I'm sure they are fine fellows and upstanding citizens. I just seem to recall Casey mentioning the Jon did read the posts and did watch the streams.

- tim
The only reason I brought that up is because I don't want to misrepresent their opinions since I'm not even citing exact words here. Sorry for the confusion. It would obviously be interesting to hear their thoughts :)