Rounding functions giving wrong output

When I run the handmade intrinsic functions, I get wrong output for negative numbers.

Example outputs:-

1)
1
-15.000000 unsigned rounds to -14

Should round to -15?

2)
1
-14.099999 unsigned rounds to -13

Should round to -14?


Code and debug output file are attached below.

Edited by Class GenericHuman on
Solution to this might be using:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
inline int32
RoundReal32ToInt32(real32 Real32)
{
	if (Real32 >= 0.0f)
	{
		int32 Result = (int32)(Real32 + 0.5f);
		return(Result);
	}
	else
	{
		int32 Result = (int32)(Real32 - 0.5f);
		return(Result);
	}
}


instead of the original function
I think this is one of the few cases in video game programming where unit tests would actually make sense.
Just for the record (it's a TODO in the source code), C99 has the rint family of functions, such as lrintf(), and C++11 has an overloaded round() function.

The semantics are slightly different, in that it uses the prevailing rounding mode, which by default should be "round to even":

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
int32 RoundFloatToInt32(float f) {
    // Round down
    int32 i = f < 0 ? -((int32)-f) : (int32)f;

    // Get the fractional part
    f -= i;

    // If less than 0.5, round down.
    if (f < 0.5f) {
        return i;
    }

    // If greater than 0.5, round up.
    if (f > 0.5f) {
        return i + 1;
    }

    // If exactly 0.5, round to the nearest
    // even number.
    return (i % 2 == 0) ? i : i + 1;
}


Why would you bother? Consider the following sequence of operations:

1
2
3
4
5
6
7
float f = 0.5f;
f = round(f) + 0.5f;
f = round(f) - 0.5f;
f = round(f) + 0.5f;
f = round(f) - 0.5f;
f = round(f) + 0.5f;
f = round(f) - 0.5f;


One way to judge a rounding mode is to see what happens when you do things like this.
Round to even is a biased round but that is not what I am talking about. In Casey's round function



In this image, you place the point whose RoundReal32ToInt32 you want to see on the slanted line on top of the nunmber line and follow the slanted line until the arrow. The integer pointed by the arrow is the output of the function.

Since we are adding 0.5 and then truncuating, anything between I and I-0.5f(where I is some negative number) is rounding to the wrong value. That is why for negative number we have to subtract 0.5 instead of adding 0.5f.

Edited by Class GenericHuman on
Pseudonym73, Do you know why round to even is a common rounding mode? It seems kind of strange but I assume it must have some nice properties that are not obvious to me, either computational or technical (maybe ease of hardware implementation).
Uberstedt, round to even is useful when you have a lot of half fractions in your data(3.5, -4.5). By using round to even(or odd), it keeps the average close to the original average (because they sometimes lose 0.5 and sometimes gain 0.5)


Example:- Average of (5x 3.5 + 10x 4 + 8x4.5) = 4.065
Rounding up average = ( 5x 4 + 10x 4 + 8x5) = 4.34
Rounding down average = (5x 3 + 10x 4 + 8x4) = 3.78
Round to even average = (5x 4 + 10x 4 + 8x4 ) = 4
Round to odd average = (5x 3 + 10x 4 + 8x5) = 4.13

You'll have to confirm the math but you get the idea.
Yeah, sorry for not putting a big disclaimer on these things, but until we actually get to writing these functions for real, you definitely don't want to be using them for anything other than what we're doing with them in the code!!!

- Casey
Uberstedt
Pseudonym73, Do you know why round to even is a common rounding mode?

I actually hinted at the answer. Consider the sequence of operations:

1
2
3
4
5
6
7
float f = 0.5f;
f = round(f) + 0.5f;
f = round(f) - 0.5f;
f = round(f) + 0.5f;
f = round(f) - 0.5f;
f = round(f) + 0.5f;
f = round(f) - 0.5f;


Suppose this is your round function:

1
2
3
int round(float f) {
    return (int)(f + 0.5f);
}


What you'll find is that f steadily increases. When you consider what is happening semantically (you're just adding and subtracting 0.5), this is highly undesirable behaviour. Using round-to-even limits the amount of artificial increase that can happen, keeping the whole sequence more numerically stable.

It seems a bit weird when you're converting floats to integers, but it makes a bit more sense when you consider the kind of rounding that happens in more usual floating point operations, such as adding two floats. If the exact answer doesn't fit in your floating point number format, the result is rounded.

There are some numeric methods which act like the above example, in the least significant digit of the mantissa. If it's a simulation which runs for a long time, you don't want numbers creeping in one direction just because of the rounding mode.
Incidentally, I've been mucking around with SSE2, and this is the simplest and fastest floor function that I could come up with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
inline int32_t
sse2_floor(float value)
{
    // Get a vector of all ones.

    // mmozeiko's version:
    __m128 all_ones = _mm_set1_ps(1.0f);

    // My original version was this. It generates essentially the
    // same code as the above on Clang, but worse code on MSVC.
    // float one = 1.0f;
    // __m128 all_ones = _mm_load1_ps(&one);

    // Load the value into a SSE register. If you need a version which does
    // four floor operations at once, use _mm_load_ps or _mm_loadu_ps here.
    __m128 x = _mm_set1_ps(value);

    // Convert x to int and back to float
    __m128i x_to_int = _mm_cvttps_epi32(x);
    __m128 x_to_int_to_float = _mm_cvtepi32_ps(x_to_int);

    // A result needs adjustment if the round-trip resulted in an
    // increase.
    __m128 adjustment_mask = _mm_cmpgt_ps(x_to_int_to_float, x);

    // The adjustment is to subtact one.
    __m128 adjustment = _mm_and_ps(adjustment_mask, all_ones);
    __m128 x_floor = _mm_sub_ps(x_to_int_to_float, adjustment);

    // Extract the result as an integer.
    return _mm_cvtss_si32(x_floor);

    // If you are doing four floor operations and want the result as
    // a vector of int32s, this is probably more efficient:

    // WARNING: Untested code. May need to coerce __m128i to __m128 or something.
    // __m128i all_ones_int = _mm_setr_epi32(1,1,1,1);
    // __m128 adjustment = _mm_and_ps(adjustment_mask, all_ones_int);
    // return _mm_sub_ps(x_to_int, adjustment);
}


While I'm at it, here's the world's best round function. It only works for well-behaved normal floats.

1
2
3
4
5
6
#include <float.h>

int32_t worlds_best_round(float x) {
    const float float_to_int = 0.75f * (1u << FLT_MANT_DIG);
    return (int32_t)((x + float_to_int) - float_to_int); // Hope that /fp:fast doesn't optimise this away.
}


That constant, float_to_int, is one of the most important constants in all of IEEE-754 bit hackery. I'm certain it will come up again before Handmade Hero is done.

One final thought. I'm not having a go at Casey here, given in the discussion of the floor() function in Day... 31, I think it was, and quite rightly noting that we don't need all full compliance, describing floating point numbers as "real" is a barefaced lie. What you name something is, of course, more contentious than what it is, but does the "real32" typedef bother anyone other than me?
You can use _mm_set1_ps(1.0f) to get number 1 in all four lanes of SSE register, no need for two lines (float one = 1.0f and _mm_load1_ps). You can use same intrinsic to load "value" argument into SSE register type, there won't be need to take address of value

Same value from _mm_set1_ps(1.0f) will work in case of four flooring operations (no need for _mm_setr_epi32).

For Visual Studio using _mm_set1_ps will generate fewer instructions than using _mm_load1_ps/_mm_load_ss.

With _mm_load1_ps and _mm_load_ss:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
	movss	xmm1, DWORD PTR __real@3f800000
	movss	DWORD PTR [rsp+8], xmm0
	movss	xmm2, DWORD PTR value$[rsp]
	cvttps2dq xmm0, xmm2
	movss	DWORD PTR one$[rsp], xmm1
	cvtdq2ps xmm1, xmm0
	movss	xmm0, DWORD PTR one$[rsp]
	shufps	xmm0, xmm0, 0
	cmpltps	xmm2, xmm1
	andps	xmm2, xmm0
	subps	xmm1, xmm2
	cvtss2si eax, xmm1
	ret	0


With _mm_set1_ps:
1
2
3
4
5
6
7
8
9
	movss	xmm2, xmm0
	shufps	xmm2, xmm2, 0
	cvttps2dq xmm0, xmm2
	cvtdq2ps xmm1, xmm0
	cmpltps	xmm2, xmm1
	andps	xmm2, XMMWORD PTR __xmm@3f8000003f8000003f8000003f800000
	subps	xmm1, xmm2
	cvtss2si eax, xmm1
	ret	0

Edited by Mārtiņš Možeiko on
Good catch! FYI, I tested it on Clang, which generates the same code. I've edited it to suit.
To get assembly I posted at bottom of post it should be:
1
2
__m128 all_ones = _mm_set1_ps(1.0f); // set not load
__m128 x = _mm_set1_ps(value); // set not load, and pass argument by value

Note that now all_ones variable is same as all_ones_int in your comment (where it has wrong type uint32_t).
Fixed.


mmozeiko
Note that now all_ones variable is same as all_ones_int in your comment [...]


It shouldn't be, because _mm_setr_epi32 should load four integers as integers, where as _mm_set1_ps should loads four floats. The point of the "possibly more efficient" method is to avoid the final float-to-int conversion, which means that the adjustment should be done on integers.
But how will call to _mm_and_ps work? It accepts two arguemnts with __m128 type. But _mm_setr_epi32 returns __m128i type. Code won't compile.

[strike]And result of function must be __m128i, not __m128 (what _mm_sub_ps returns).[/strike]Ah I see, you want to return float, not ints. And continue calculations with float types (but with floored values).

Edited by Mārtiņš Možeiko on