How to get correct perspective projection

I'm entering my first foray into the 3d realm and I'm starting will really trying to understand the fundamentals of perspective projection. I've gotten things to a point where it seems I'm able to project a set of hand picked screen coordinate vertices (6 vertices right now just to keep things simple) to something resembling the side of a cube. I wanted to test how my perspective projection compares to something like glm::perspetive function so I can have some sort of reference to know if what I'm doing is correct. After watching some of Casey's videos this is my current perspective projection code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
               //Units are in pixels. Z coordinate represents the vector's distance from the monitor, in positive z. So 400 pixels in z is farther away then 300 pixels in z
                Array<glm::vec4, 6> squareVerts_screenCoords =
                {
                    glm::vec4{200.0f, 600.0f, 400.0f, 1.0f},
                    glm::vec4{300.0f, 600.0f, 300.0f, 1.0f},
                    glm::vec4{400.0f, 600.0f, 400.0f, 1.0f},
                    
                    glm::vec4{200.0f, 400.0f, 400.0f, 1.0f},
                    glm::vec4{300.0f, 400.0f, 300.0f, 1.0f},
                    glm::vec4{400.0f, 400.0f, 400.0f, 1.0f}
                };

                
                {//Projection transform - my version
                    f32 cameraDistanceFromMonitor_z = 800.0f;
                    
                    for(i32 vertI{}; vertI < 6; ++vertI)
                    {
                        f32 pointDistanceFromCamera_z = cameraDistanceFromMonitor_z + squareVerts_screenCoords[vertI].z;
                        
                        f32 numerator = squareVerts_screenCoords[vertI].x * cameraDistanceFromMonitor_z;
                        squareVerts_screenCoords[vertI].x = numerator / pointDistanceFromCamera_z;
                        
                        numerator = squareVerts_screenCoords[vertI].y * cameraDistanceFromMonitor_z;
                        squareVerts_screenCoords[vertI].y = numerator / pointDistanceFromCamera_z;
                    };
                };


After this code I then do clip space transforms to get things into openGL's clip space and render. After looking at glm's perspective function I noticed it parameters specify a field of view angle (fov), and an aspect ratio (as well as clip space coordinates which I'm not worried about at the moment). Now from what I understand my 'cameraDistanceFromMonitor' and 'pointDistanceFromCamera' replace the need for a specific fov angle and I could calculate an fov angle from them if I wanted to. My questions is where does the monitor's width/height or aspect ratio work its way into the above equation (assuming I'm correct with my current perspective algorithm)?

Edited by Jason on Reason: Initial post
In your formula the aspect ratio is 1 because you are multiplying x & y coordinate with same number.
After projection matrix the points are in clip space: -1 .. +1. But because display is not a square, these -1..+1 values represents different units depending whether we talk about x or y. Aspect value allows you to squeeze one of axis more than other thus maintaining same aspect ratio in clip space as your display screen (or more precisely - current framebuffer)
The perspective projection always happens onto a projection plane that can be imagined to be a rectangle of some size (in world units, not pixels) with the same proportions as the screen, and situated at some distance from the camera (again, in world units). The ratio between these two measurments determines the magnitude of the perspective. The exact choice of how big to make the projection plane and how far to put it from the camera, is arbitrary - as long as the ratio between those is maintained to be of the desired amount.

It is convinient to put the projection plane at a distance of 1 world-unit away from the camera such that it's size would reflect that ratio directly.

Field of view is just a way of getting at such a ratio from an input of an angle. An angle of 90 degrees for example, represents a ratio of 2:1 between the width of the projection plane to it's distance from the camera. In other words, it happens when it is situated away from the camera at a distance that is precicely half of it's width. A ratio of 1:1 between the distance of the projection plane and half the width. A tangent of 45 degrees (half of the 90 degrees fov angle).

Combining that with the choise made above to put the projection plane at a distance of 1 from the camera, determines it's width to be exactly 2 - which not coincidentally is what the normalization requirement of the perspective projection matrix is aimint the width to be (a span of 2: betwen -1 and 1).

That is why if you use a field of view angle of 90 degrees, the scaling factors of X and Y in the perspective projection matrix become 1 (discounting for aspect ratio correction).

Any other field of view angle, would require the X and Y coordinates to be scaled by an inverse amount of that ratio, in order to compensate and being it back to "effectively" 90 degrees again (a projection plane width of 2).

That is why the X and Y are scaled by the inverse of the ratio in the perspective projection matrix.

Edited by Arnon on
If you're asking about the perspective projection MATRIX, in - say, it's OpenGL's way - that's a whole different topic:

Overall, the role of perspective projection matrix is to transform coordinates from view space to clip space. You first need to have full confidence that you fully understand what each of these spaces mean. Clip space is by far the most non-trivial space to wrap your head around, because it is some intermediary space between view space and what is termed NDC space (Normalized Device Coordinates).
NDC space is actually the one that is in the shape of a 2x2x2 cube (going from -1 to 1 on every axis - at least in OpenGL's case).
Clip space is actually NOT that cube (...yet).
It's some bizzarly shaped Inverted frustum, you shouldn't even try to imagine. All you really need to know, is that in this space all the vertices that were in the original view frustum, WILL transform INTO the NDC space cube, IF you divide each of them by their respective W component (a.k.a: the "perspective divide").

The perspective projection's role of converting from view space to this odd clip space, involves 3 unrelated operations:
1. Scaling x and y such that the near and far clipping planes would both transform into a 2x2 square in NDC space (AFTER perspective divide)
2. Scaling z such that the span between the near and far clipping planes in view space, would transform into a span of 2 (between -1 and 1) in NDC space (again, AFTER perspective divide).
3. Storing the original Z component of all vertex position multiplied by it, in their respective W component, so that the perspective divide can happen from clip space to NDC space (After the multiplication by the projection matrix).
If you REALLY want to get the FULL lay down on the true nature of perspective projection, and aren't afraid of getting your mind bent a bit, read on:

The story of perspective projection is really the story of projective geometry. It's a really important topic, I think - and one that makes understating the perspective projection and it's matrix a lot more intuitive, once you have the right mental model in your head. The prjection matrix is doing something to 3D space in 4D projective space. It is easier to imagine this in 2D going to 3D first:

Imagine a projector machine - you put some transparent slide flat on it, and it shoots rays of light through it upwards from below (it then hits a periscope-like mirror to aim it at the wall, but ignore that for now). Now, imagine that the light source is the origin of 3D space, and that your slide with a 2D shape drawn on it, sits exactly 1 unit above it (say, in Z axis, since X and Y are reserved for the 2D space within the slide itself). Now, the story of projective geometry is that you imagine the rays of light eminating from the origin (where the light source is), passing through the 2D vertices of that 2D shape drawn on the slide. In projective geometry terms, it is said that the "homogeneous coordinates" of these 2D vertices ARE those 3D rays(!) in other words, you are free to raise that slide upwards from the projector, as long as you also stretch the slide such that directions of those rays are maintained. You are always free to bring the slide back to sitting on the projector (at a height of 1 in Z from the origin). In projective geometry terms, this operation is called "re-homogenization". It is done by (wait for it...) "dividing all 3 coordinates of the 3D vectors of the rays by their Z coordinate, bringing Z back to 1..." Sounds familiar? It should... That's the origin of story of projection.

Now, what perspective projection does is almost that raising of the slide then lowering it back - but with a twise: It actually raises different vertices on the slide by different amounts(!) and WITHOUT stretching the slide to compensate(!)

This kind of violates the rules of projective geometry. Then, after having done so, you go back to respecting the rules of projective geometry, and lower the slide back down in the usual way, by dividing all coordinates by their Z coordinate. This distorts the 2D shape on the slides, such that points that were raised higher, are pushed stronger towards the center when they are lowered back down to the projector (Z of 1).

This is exactly the sotry of perspective projection in 3D with a 4D matrix, only instead of raising a 2D shape into 3D space, you rays a 3D shape into 4D space, and instead of the Z you had in the 2D/3D story, you now have have the 4th diemnsion W in all 3D vectors.

The perspective projection matix ends up taking 3D vertices that hover at 1 in the 4th dimention, and raises each one by a a different amount - their respective Z coordinate for each one.

What in computer graphics is called the "perspective divide", is really just the standard "re-homogenization" operation of projective geometry.

If you managed to follow all of this description, congradulations - you now have a deeper understanding of perspective projection than most computer graphics practitioners... :)

Edited by Arnon on
After projection matrix the points are in clip space: -1 .. +1


So I think this was where I was getting a bit confused. I had copied some of Casey's code to do the clipspace transformation as a separate step from the perspective foreshortening. So I would take a point in pixel space like {400, 600} and produce a new, perspective point in pixel space {267, 400} first. The full code was this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Array<glm::vec4, 6> squareVerts_screenCoords =
    {
        glm::vec4{200.0f, 600.0f, 400.0f, 1.0f},
        glm::vec4{300.0f, 600.0f, 300.0f, 1.0f},
        glm::vec4{400.0f, 600.0f, 400.0f, 1.0f},
        
        glm::vec4{200.0f, 400.0f, 400.0f, 1.0f},
        glm::vec4{300.0f, 400.0f, 300.0f, 1.0f},
        glm::vec4{400.0f, 400.0f, 400.0f, 1.0f}
    };
    
    {//Projection transform - my version
        f32 camDistanceFromMonitor_z = 800.0f;
        
        for(i32 vertI{}; vertI < 6; ++vertI)
        {
            f32 pointDistanceFromCamera_z = camDistanceFromMonitor_z + squareVerts_screenCoords[vertI].z;
            
            f32 numerator = squareVerts_screenCoords[vertI].x * camDistanceFromMonitor_z;
            squareVerts_screenCoords[vertI].x = Round(numerator / pointDistanceFromCamera_z);
            
            numerator = squareVerts_screenCoords[vertI].y * camDistanceFromMonitor_z;
            squareVerts_screenCoords[vertI].y = Round(numerator / pointDistanceFromCamera_z);
        };
    };
    
    Array<glm::vec4, 6> squareVerts_openGLClipSpace;
    {//Do openGL clip space transform on verts
        f32 a = 2.0f/windowWidth;
        f32 b = 2.0f/windowHeight;
        
        glm::mat4 clipSpaceTransformMatrix =
        {
            a, 0.0f, 0.0f, 0.0f,
            0.0f, b, 0.0f, 0.0f,
            0.0f, 0.0f, 1.0f, 0.0f,
            -1.0f, -1.0f, 0.0f, 1.0f
        };
        
        squareVerts_openGLClipSpace[0] = clipSpaceTransformMatrix * squareVerts_screenCoords[0];
        squareVerts_openGLClipSpace[1] = clipSpaceTransformMatrix * squareVerts_screenCoords[1];
        squareVerts_openGLClipSpace[2] = clipSpaceTransformMatrix * squareVerts_screenCoords[2];
        squareVerts_openGLClipSpace[3] = clipSpaceTransformMatrix * squareVerts_screenCoords[3];
        squareVerts_openGLClipSpace[4] = clipSpaceTransformMatrix * squareVerts_screenCoords[4];
        squareVerts_openGLClipSpace[5] = clipSpaceTransformMatrix * squareVerts_screenCoords[5];
    };
    
    GLfloat verts[] =
    {
        squareVerts_openGLClipSpace[0].x, squareVerts_openGLClipSpace[0].y, squareVerts_openGLClipSpace[0].z,
        1.0f, 0.0f, 0.0f,
        squareVerts_openGLClipSpace[1].x, squareVerts_openGLClipSpace[1].y, squareVerts_openGLClipSpace[1].z,
        0.0f, 1.0f, 0.0f,
        squareVerts_openGLClipSpace[2].x, squareVerts_openGLClipSpace[2].y, squareVerts_openGLClipSpace[2].z,
        1.0f, 0.0f, 0.0f,
        squareVerts_openGLClipSpace[3].x, squareVerts_openGLClipSpace[3].y, squareVerts_openGLClipSpace[3].z,
        1.0f, 0.0f, 0.0f,
        squareVerts_openGLClipSpace[4].x, squareVerts_openGLClipSpace[4].y, squareVerts_openGLClipSpace[4].z,
        0.0f, 1.0f, 0.0f,
        squareVerts_openGLClipSpace[5].x, squareVerts_openGLClipSpace[5].y, squareVerts_openGLClipSpace[5].z,
        1.0f, 0.0f, 0.0f
    };
    
    GLuint bufferID;
    glGenBuffers(1, &bufferID);
    glBindBuffer(GL_ARRAY_BUFFER, bufferID);
    glBufferData(GL_ARRAY_BUFFER, sizeof(verts), verts, GL_STATIC_DRAW);
    glEnableVertexAttribArray(0);
    glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(GLfloat) * 6, 0);
    glEnableVertexAttribArray(1);
    glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, sizeof(GLfloat) * 6, (char*)(sizeof(GLfloat)*3));
    
    GLushort indicies[] =
    {
        0, 1, 3,  3, 1, 4,  1, 2, 4,  2, 5, 4
    };
    
    GLuint indexBufferID;
    glGenBuffers(1, &indexBufferID);
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, indexBufferID);
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indicies), indicies, GL_STATIC_DRAW);
    
    glDisable(GL_TEXTURE_2D);
    //glDrawArrays(GL_TRIANGLES, 0, 3);
    glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_SHORT, 0);
    glEnable(GL_TEXTURE_2D);


So it seems the aspect ratio was being taken into account when moving from screen space to clip space with the clipSpaceTransformMatrix stuff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Array<glm::vec4, 6> squareVerts_openGLClipSpace;
    {//Do openGL clip space transform on verts
        f32 a = 2.0f/windowWidth;
        f32 b = 2.0f/windowHeight;
        
        glm::mat4 clipSpaceTransformMatrix =
        {
            a, 0.0f, 0.0f, 0.0f,
            0.0f, b, 0.0f, 0.0f,
            0.0f, 0.0f, 1.0f, 0.0f,
            -1.0f, -1.0f, 0.0f, 1.0f
        };
        .......................


Though I don't know if conceptualizing perspective transform as a separate step from clip transform is the right way to be thinking about things or if it really matters.

And Arnon thanks for you thorough explanation. I will be looking over your response for a while since there's a lot in there and will see if I can come up with a good mental model in my own words which I might post here to see if I'm understanding things correctly.
Arnon
Clip space is actually NOT that cube (...yet).
It's some bizzarly shaped Inverted frustum, you shouldn't even try to imagine. All you really need to know, is that in this space all the vertices that were in the original view frustum, WILL transform INTO the NDC space cube, IF you divide each of them by their respective W component (a.k.a: the "perspective divide").

The perspective projection's role of converting from view space to this odd clip space, involves 3 unrelated operations:
1. Scaling x and y such that the near and far clipping planes would both transform into a 2x2 square in NDC space (AFTER perspective divide)
2. Scaling z such that the span between the near and far clipping planes in view space, would transform into a span of 2 (between -1 and 1) in NDC space (again, AFTER perspective divide).
3. Storing the original Z component of all vertex position multiplied by it, in their respective W component, so that the perspective divide can happen from clip space to NDC space (After the multiplication by the projection matrix).


Okay so I think this is where I was/am getting confused a little bit. With the way my mind is working currently, I'm performing the perspective divide as the first step (in my mind as a part of the perspective projection), converting camera space screen coordinates (pixels) into new, perspective screen coordinates (e.g. original = {400px, 600px}, perspective = {267px, 400px}). Then, I'm thinking about taking those new perspective coordinates and using the aspect ratio to come up with some equation to get -1 to 1 coordinates for clip space. And then these are the coordinates I'm passing to openGL.

From what you said it looks like the point of the perspective projection part is NOT to actually take a point and convert it to another point to get the perspective effect, but to get things into this -1 to 1 clip space which has z setup to be placed in the w component of the point's new clip space matrix/vector position. This is so openGL can then actually perform the perspective divide, which gives the perspective look, during the transition to NDC space.

If what I just summarized makes sense, then my question is:

1.) Is my original way of going about this transform completely incorrect? Or is it just another potential way of performing perspective projection? Even though with the original way I'm not making full use of what the openGL api is offering and performing the perspective divide myself on the cpu.

Edited by Jason on
First of all, remember that the perspective divide involves a division by different amounts for each vertex position.
That is why you can't represent that with a single matrix(!).
You can try to incorporate that in a matrix, but then you'd have to have a slighly different matrix for each vertex (which nobody does...). And then you can't have that matrix as a "uniform" in OpenGL, because "uniforms" area provided .. well, "uniformly" to ALL vertex positions (same matrix in all vertex shader invocations).

boagz57

From what you said it looks like the point of the perspective projection part is NOT to actually take a point and convert it to another point to get the perspective effect, but to get things into this -1 to 1 clip space which has z setup to be placed in the w component of the point's new clip space matrix/vector position. This is so openGL can then actually perform the perspective divide, which gives the perspective look, during the transition to NDC space.


That is correct. Except that clip space is NOT really this -1 to 1 cube (yet...) - that's actually NDC space.
Clip space is "whatever it would need to be for it to be convertibale to NDC space, by a perspective divide".
So, if you consider that what the perspective divide is going to do is "divide each vector by it's own W component", then Clip space would have all of the vectors as they were to be in NDC space "multiplied" by their own W component (the inverse of that Clip->NDC transformation).
So, because NDC space is -1 to +1, then the outskirts of the vertex positions in Clip space, would have all their X, Y and Z coordinated between the negative and positive of their own W componet (-W -> +W).
Any vertex positions in Clip space that has an X, or Y or Z components that are either greater than their own W component, or smaller than their own negative W component, would end up landing outside the -1 to +1 box at NDC space (after perspective divide), and so would be considered outside the view frustum.
So in Clip space, it is pretty trivial to do "frustum culling", and reject any triangles that would end up being outside the view frustum - by a simple comparison of their X, Y and Z components with their own W coordinate.
That's the purpuse of this whole odd scheme in the first place - being able to cull/reject triangle that are outside the view frustum before the persepctive divide (beacus a division has historically been considered a more expensive computation than a multiplication, or an addition/subtraction).
Additionally, for the same reason, it is also possible to Clip triangles (not to be confused with Cull which is outright rejecting), that are partially within the view frustum (say, one vertex is inside, and the other 2 are outside, or vice versa).
Triangles like that must be clipped, forming one or more new smaller triangles, that have all their vertices inside the view frustum.
And again, all of that before the persepctive divide.
This in fact is where Clip space gets it's name.
It's for frustum clipping (but also frustum culling).

When you're working with OpenGL, all your vertex shaders must output vertices in this Clip space(!) Always(!).
That means NO persepctive divide on your part.
Otherwise things just don't work.
OpenGL wants to heve the vertices before the perspective divide, so that it can do the frustum culling and clipping itself in hardware, and then apply the perspective divide (again, in hardware - parallelized to the max).

You should NEVER apply the perspective divide yourself, whenever working with ANY graphics API (It's the same for Vulkan, DirectX, and I presume Metal as well).
You only really need to to worry yourself about that if/when you're implementing your own rasteriser in software.

boagz57

I'm thinking about taking those new perspective coordinates and using the aspect ratio to come up with some equation to get -1 to 1 coordinates for clip space. And then these are the coordinates I'm passing to openGL.


With a better understanding of what I just said in this post, you should probably re-read my priot comment(s).

The role of the aspect ratio is to squeeze/stretch a rectangle of the screen proportion into a square (NDC space).
Later, after NDC space, the hardware API would re-scale that NDC space back up to some rectangle of the same proportions, just before rasterising triangles. This whole "normalized" device coordinate space (NDC) it so have all the clipping and culling happen in a resolution-independent way (hence the normalization).
Again, whenever you use a graphics API, you don't (and actually shouldn't) concearn yourself with any of this...
If you try to apply any of what the graphics API is going to do, it'll just end up being applied twice, producing a wrong result.

The role of the near and far clipping planes, is to define how far into the screen, and how close to the screen, should vertex positions still be considered to be "in view" (within the view frustum).

The role of the fov (field of view) is to define specically how far to the right and to the left of the camera (in it's own space) should vertex positions still be considered within the view frustum. Obviousely this determination changes along depth (vertices that are further away from the camera into the screen, can still be consired in-view even when their are further away to the left or right than closer ones).
So, the fov ends up determining the maximum distance to the left or right that vertices that are furthest away to the left/right - which would be for vertices that are furthest away from the camera into the screen (right on the far clipping plane). That's one way to think about it.

Edited by Arnon on
Ahh, okay. Your explanation of why the perspective divide should and does happen within openGL really helped to make things click. I have a better framework for how I should be sending vertices to openGL for a 3d effect now. I will try and implement my own test case of it tomorrow.
Okay so when I try and get to the nitty gritty and work out the math I'm not quite getting to correct answers. I asked a question on stack overflow concerning my issue and some people seem to be confused by my question so I still might be off in my understanding.

Here's my original question from stack overflow:

So I'm trying to understand the fundamentals of perspective projection for 3D graphics and I'm getting stuck. I'm trying to avoid matrices at the moment to try and make things easier for understanding. This is what I've come up with so far:

First I imagine I have a point coming in with screen (pixel) coordinates of x: 200, y: 600, z: 400. The z amount in this context represents the distance, in pixels, from the projection plane or monitor (this is just how I'm thinking of it). I also have a camera that I'm saying is 800 pixels from the projection plane/monitor (on the back side of the projection plane/monitor), so that acts as the focal length of the camera.

From my understanding, first I find the total z distance of the point 200, 600 by adding its z to the camera's focal length (400 + 800), which gives me a total z distance of 1200. Then, if I wanted to find the projected point of these coordinates I just need to multiply each coordinate (x & y) by (focal_length/z_distance) or 800/1200 which gives me the projected coordinates x: 133, y: 400.

Now, from what I understand, openGL expects me to send my point down in clips space (-1 to 1) so I shouldn't send my pixel values down as 200, 600. I would have to normalize my x and y coordinates to this -1 to 1 space first. So I normalize my x & y values like so:

xClipSpace = (x / (width/2)) - 1;
yClilpSpace = (y / (height/2)) - 1;

This gives me normalized values of x: -.6875, y: -.0625. What I'm unsure of is what my Z would need to be if openGL is going to eventually divide these normalized values by it. If I didn't normalize my x and y values then I could just do:

xProj = x / (z_distance / focal_length);
yProj = y / (z_distance / focal_length);

so z could just be (z_distance/focal_length).


Basically, I'm thinking there should be an equation that produces a z value that is able to be used for the perspective divide by openGL and produce a normalized projected value that will then be blown out back to screen coordinates. So the stripped down process would be something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
vec3 originalVec = {200.0f, 600.0f, 400.0f};

f32 xClipSpace = (originalVec.x / (width/2.0f)) - 1;
f32 yClipSpace = (originalVec.y / (width/2.0f)) - 1;
f32 zReadyForPerspectiveDivide = originalVec.z (in some equation involving focal length, near/far planes maybe???;

vec4 vecForOpenGL {xClipSpace, yClipSpace, z, zReadyForPerspectiveDivide}; //Not actually sure yet what goes into the z component of the vector

//OpenGL operation
f32 xProjectedNDCSpaceCoord = vecForOpenGL.x / vecForOpenGL.w;
f32 yProjectedNDCSpaceCoord = vecForOpenGL.y / vecForOpenGL.w;

//OpenGL viewPort equation that blows back out projectedNDCSpace to screen space
f32 xScreenSpace = (xProjectedNDCSpaceCoord + 1.0f) * (windowWidth / 2.0f);
f32 yScreenSpace = (yProjectedNDCSpaceCoord + 1.0f) * (windowHeight / 2.0f);



Here are some videos you should watch (preferrably in this order - first one is the most inportant one):
3d Projection
OpenGL - model transform and projection
Model View Projection Matrices
To be clear, OpenGL expect clip space coordinates at the end of the vertex shader, not normalized coordinates. Clip space coordinates (what you put in gl_Position) are coordinates that are between -w and w on all axis when they are visible from the camera. The clipping stage will test each point components (x, y, z) against the w component to determine if a point is in the view (-w < x < w; -w < y < w; -w < z < w for the point to be visible). The perspective divide, that is executed between the vertex shader and fragment shader will then divide the coordinates by w to produce the normalized device coordinates (NDC).

So we go from view coordinates to clip space coordinates to normalized device coordinates.
View space coordinates origin (0,0,0) is the camera position, if you use right handed system, z is negative in the direction the camera is looking (meaning all visible points will have negative z value).

Normalized device coordinates are by default left handed, z is positive in the direction the camera is looking.

If you only want to project the point without needing the z information, you only need a few information:
- the near clip plane distance, at which distance is the plane you are projecting onto;
- the width and height of the plane you are projecting onto. This is often computed using the field of view and the aspect ratio, but it's not required.

The near clip plane distance is arbitrary, I will call it n. But since z is negative in front of the camera (right handed), the projection plane is at -n.
P is the 3d point, P' is the projected point. Pz, and P'z are negative value (right handed).

P'x = Px * ( -(-n) / -Pz );
P'y = Py * ( -(-n) / -Pz );
P'z = -n;

There are 2 issues:
- We want Z to change based on the input so that we can determine which point is in front of which other point.
- We know there is a perspective divide step done by the graphics pipeline so we can't put P' directly in gl_Position.

If we needed to produce NDC coordinates

Xndc = P'x / ( width / 2 );
Yndc = P'y / ( height / 2 );
Zndc = ? We will do that latter.

But we want clip space coordinates.

Xndc = P'x / ( width / 2 );
Xndc = ( Px * ( n / -Pz ) ) ) / half_width; /* Replacing P'x by its formula */
Xndc = ( ( Px * n ) / -Pz ) / half_width;
Xndc = ( ( Px * n ) / -Pz ) * ( 1 / half_width );
Xndc = ( ( Px * n ) / ( -Pz * half_width );
Xndc = ( ( Px * n ) / half_width ) * ( 1 / -Pz );
Xndc = ( ( Px * n ) / half_width ) / -Pz; => We have isolated Pz from the formula.

Xclip = ( ( Px * n ) / half_width );
Yclip = ( ( Py * n ) / half_height );

And we need to put -Pz in the Wclip component. Remember that Pz is negative, so -Pz will produce a positive value.

Now on to Zclip.

We need to define which range of z values we want to map in the NDC range. So we need to define a far plane, a distance from the camera after which the point are considered outside the view. As the distance of the far plane, f, is arbitrary. In right handed system the far plane will be at -f z coordinate.

We know that:
- valid z values are between -n and -f.
- ndc values are between -1 and 1 (ndc is left handed so -1 is closer to the camera than 1, which is the opposite of the view coordinate system).
- we know that we will be dividing Zclip by -Pz in the perspective divide => Zndc = P'z / -Pz.
- If Pz = -n => Zndc = -1;
- If Pz = -f => Zndc = 1;
- We want to map values linearly so our equation is a linear equation: P'z = Pz * a + b; (Which is good because a and b are the component left for us to use if we used a matrix). Zclip = Pz * a + b;

A linear equation can be solved if we know two points on the line. So let's find two points.

Zndc = Zclip / -Pz => Zclip = Zndc * -Pz

if Pz = -n we know that Zndc = -1 => Zclip = -1 * -( -n ) = -1n = -n;
We can write that as f( -n ) = -n;

if Pz = -f we know that Zndc = 1 => Zclip = 1 * -( -f ) = 1f = f;
f( -f ) = f;

If we use a linear equation notation we can now solve the equation.
f( Pz ) = Pz * a + b;

f( -n ) = a * (-n) + b; /* Replace f(-n) by the known result */
-n = -an + b;
an - n = b;

f( -f ) = a * (-f ) + b; /* Replace f( -f ) by the known result */
f = -af + b;
f - b = -af;
(f - b) / f = -a;
-(( f-b ) / f ) = a;
( -f + b ) / f = a;
-1 + ( b / f ) = a;

Replacing a in "an - n = b" by "-1 + ( b / f )"
( -1 + ( b/f ) ) * n - n = b;
( (b/f) - 1 ) * n - n = b;
( (nb)/f ) - n - n = b;
((nb) / f) - 2n = b;
-2n = b - ( (nb) / f );
-2n = ( (bf) / f ) - ( (nb) / f );
-2n = ( (bf) - (nb) ) / f;
-2n = ( b * (f-n) ) / f;
-2nf = b * (f-n);

(-2nf) / ( f-n ) = b; => We found b.

Replacing b in "-1 + ( b / f ) = a;" with "(-2nf) * (f-n)"

-1 + ( ( (-2nf) * (f-n ) ) / f ) = a;
-1 + ( ( ( -2nf ) / (f-n ) ) * ( 1/f ) ) = a;
-1 + ( ( -2nf ) / ( f²-nf) ) = a;
( ( -2nf ) / ( f²-nf) ) - 1 = a;
( ( -2nf ) / ( f²-nf ) ) - ( ( f² -nf ) / ( f² -nf ) ) = a;
( (-2nf) - (f²-nf) ) / ( f²-nf ) = a;
(-2nf - f² + nf) / ( f²-nf ) = a;
(-nf - f² ) / ( f² -nf ) = a;
( -1 * (nf + f² ) ) / ( f²-nf ) = a;
( -f * ( n + f ) ) / ( f² - nf ) = a;
( -f * ( n + f ) ) / ( f * ( f - n ) ) = a;
( - ( n + f ) ) / ( f - n ) = a;

( -n - f ) / ( f -n ) = a; => We found a.

Zclip = P'z = Pz * ( ( -n - f ) / ( f - n ) ) + ( ( -2nf ) / ( f-n ) );

Xclip = ( ( Px * n ) / half_width );
Yclip = ( ( Py * n ) / half_height );
Zclip = Pz * ( ( -n - f ) / ( f - n ) ) + ( ( -2nf ) / ( f - n ) );
Wclip = -Pz;

There is probably shorter ways to do the math but I generally need to write each step to get it right. Hopefully I didn't messed up.

Edited by Simon Anciaux on
What happens to Z is completely independent of anything that happens to X and Y.
It only uses the near and far clipping plane. No focal length, no aspect ratio, no resolution - none of that.

Having them both in the same matrix is just for convinience and performance.

You can think of what happens to Z as separate to what happens to X and Y, as how a translation is completely separate from rotation. Like you can represent a translation in a separate matrix, then the rotation in another separate matrix, and then compine them into one with a matrix-matrix multiplication.
All the values remain exactly as they were, and in the same slots that they were.
Because they are separate slots with no overlap.

In the same way you could represent what happens to Z in one matrix, and what happens to X and Y in another matrix.
Then combine them.

BTW: You do NOT need the resolution at all(!)
The blow-up back to screen coordinates, happens in hardware by OpenGL.
Everything that happens after you provide coordinates in clip-space, is not in your control (at least not directly).
The aspect-ratio is only used in the projection matrix to squeez/stretch a rectangle into a square.

Edited by Arnon on
I should node that in mrmixer's reply, the terms "width" and "height" (as well as their "half_*" counterparts) are NOT the screen resolution(!). Not sure how clear he was about that.
They are the world-unit dimentions of the projection plane.
That whole derivation assumes you already know how to produce those from a combination of fov angle and aspect ratio.

Then, the whole derivation of what happens to Z, is kind of the "standard" treatment of that topic: A linear-algebraic solution to a system of 2 linear equations.
It is techincally correct, thought I personally find it the least intuitive of all the other alternatives that I've seen of deriving these formulas.

For simplicity and clarity, lets first assume a left-handed coordinate system (positive Z values in front).
We can deal with the other way later. Also, lets ignore the perspective divide for a minute:

What you're actually looking at here, geometrically, is a combintion of a scale and a translation.

The scale portion:
It needs to squeese the 3D space between the near and far clipping plane, into a range that spans 2 world units. That is why in the denominators of both solutions you see: "(f - n)" - that is the span between from the near to the far clipping lanes. Now, on-top of that, it "typically" also includes a "handedness-flip" (converting the coordinate system from a right-handed one to a left-handed one) which is a part that is often not explained properly - because it is "assumed" that the 3D coordinate system that you're starting with before hitting the vertex shader, is "right handed". That's an "assumption". It doesn't need to be that at all. It's a "convention" that graphics programmers "tend to follow" when working with OpenGL. If the coordinate system your engine is using before hitting the vertex shader is already left-handed (positive Z values are in front of the camera), than you NEED TO IGNORE that flip and NOT apply it, otherwize you're flipping in the wrong way, and things won't work...
But assuming that it IS right-handed, as the convention implies, the way this handedness-flip is acheived is by simply inverting the Z coordinate (multiply by -1). So, that's an "optional" step within the scale portion of the transformation, on-top of the squeez. However, instead of that happening within the scale portion, it actually ends up happening as part of the perspective divide itself: Remember the role of "storing the original Z in the W component"? Well, that's accomplished in the perspective projection matrix by sticking a 1 in the W component of the Z-axis. Well, instead of storing Z, we can store -Z, which would acheive the same effect after perspective divide - because instead of dividing by Z, everything would be divided by -Z (which is the same as being multiplied by -1 and then divided by Z). This is achieved by insteaf of sticking a 1 as noted above, you stick a -1.
So, in effect, that whole coordinate-system flip is incorporated into the perspective divide itself, instead of being incorporated into the scaling portion.

The translation portion:
It needs to move that squeezed 3D space "backwards", into the camera, such that the center of that squeezed 3D space is at the origin (the camera's center). This would puts the far clipping plane one unit in front of the camera, and the near clipping plane one unit behind it. That's your NDC cube space.

Now, you could technically derive formulas for the scale and translation separately, and then combine them using standard linear algebra. To me this seems much simpler and more intuitive than an algebraic approach of systems of 2 linear equations, and substituting known values, etc.
You could say that the "Standard" way is the "pure mathematician"s way, and what I'm saying would be more of the "applied mathematician"s way. They both work. They both end up with exactly the same formulas, just through a different approach of getting there. You'll find that kind of duality is very common in computer graphics. There are often multiple approaches of getting at the exact same thing. Often even even more than 2.

Conceptually, it's easier to imagine this as first moving everything back by "n", such that the space starts from the near clipping place at 0, spanning as far as "f - n" in front of the camera, and then doing that squeez.
Then, squeezing down the 3D space from a span between the near and far clipping plane into a span of 2, is just dividing by that span (normalizing to 0 -> 1) then multiplying by 2 (stretching back up from -0->1 to 0 -> 2).
So that's: 2 / (f - n)
Then, the translation postion is just moving that 0 -> 2 space one unit backwards.

It's much more intuitive (I think) to think of this as these steps:
1. Move back by n
2. Squeeze by f-n
3. Stretch by 2
4. Move back by 1

Finally, let's "un-ignore" the perspective divide:
Because we know that what we got here would be NDC space and NOT Clip space, and we want the transformation to Clip-space - because there is going to be a perspective divide, we need to account for that:
So the perspective divide is going to squeez all coordinates by their "original Z" coordinate. And so, to compensate for that, we need to take what we got and have it "multiplied" by that "original Z", such that our nice NDC cube is what would be arrived to after the perspective divide. And so, say we have a point that in view space is right on the near clipping plane - that's n world units in front of the camera. In NDC space it would land one unit behind the camera.
In Clip space, it would need to be at whatever distnace wold have it land at that one unit behind the camera at NDC space "after" a perspective divide, meaning "after" dividing by it's original Z of "n". And so, in clip space it would have to be at "-1 * n" such that after perspective divide, it would be: "(-1 * n) / n = -1".
Similarly, a point that started in view-space right on the far clipping plane, would land at +1 in NDC space, and so would have to be "+1 * f" in Clip space, such that after perspective divide, it would be: "(+1 * f) / f = +1".
That's the compensation for the perspective divide.

And so, the aformentioned steps would need to be corrected for that.
Clip space is NOT between -1 and +1, but instead between -n and +f.

Ok, so in the scale portion, after normalizing the depth span down to 0-1, instead of scaling it back up to 2, we sacle it back up to the span of Clip space, between -n to f, which is: "f -(-n) = f + n".
And so the scale portion is: (f + n) / (f - n)

Then, the translation postion, instead of being a move back by one unit, it's a move back by n units.

The corrected steps are:
1. Move back by n
2. Squeeze by f-n
3. Stretch by f+n
4. Move back by n

All that algebraic mumbo-jambo in the standard derivation, is just arriving at final formuli that encapsulate all these steps in one go. You could just represent each of the steps as a separate transforormation, and just combine it in-order in the usual way of combining transformations. You'll get the exact same result.

Edited by Arnon on
For completness:

Step 1: Move back by n:
1
2
3
4
[ 1 0 0 0  ]
[ 0 1 0 0  ]
[ 0 0 1 -n ]
[ 0 0 0 1  ]


Step 2: Squeeze by f-n:
1
2
3
4
[ 1 0 0       0 ]
[ 0 1 0       0 ]
[ 0 0 1/(f-n) 0 ]
[ 0 0 0       1 ]


Step 3: Stretch by f+n
1
2
3
4
[ 1 0 0   0 ]
[ 0 1 0   0 ]
[ 0 0 f+n 0 ]
[ 0 0 0   1 ]


Step 4: Move back by n:
1
2
3
4
[ 1 0 0 0  ]
[ 0 1 0 0  ]
[ 0 0 1 -n ]
[ 0 0 0 1  ]


Result:
1
2
3
4
[ 1 0 0 0 ]
[ 0 1 0 0 ]
[ 0 0 a b ]
[ 0 0 0 1 ]


Now we need to find what a and b end up being. Well, that's a very standard matrix-matrix multiplication excersize: Given these are column-major matrices, we need to left-multiply. For simplicity, lets first combine Step 2 and 3 into a single scale matrix:
Step 2+3:
1
2
3
4
[ 1 0 0            0 ]
[ 0 1 0            0 ]
[ 0 0 (f+n)/(f-n)  0 ]
[ 0 0 0            1 ]


Starting with the first matrix of Step 1, we left-multiply it by this matrix:
1
2
3
4
[ 1 0 0           0 ]     [ 1 0 0 0  ]     [ 1 0 0                         0 ]  
[ 0 1 0           0 ]  *  [ 0 1 0 0  ]  =  [ 0 1 0                         0 ]
[ 0 0 (f+n)/(f-n) 0 ]     [ 0 0 1 -n ]     [ 0 0 (f+n)/(f-n)  -n*(f+n)/(f-n) ] 
[ 0 0 0           1 ]     [ 0 0 0 1  ]     [ 0 0 0                         1 ]


Now, left-multiply the result of that by Step 4:
1
2
3
4
[ 1 0 0 0  ]   [ 1 0 0                         0 ]   [ 1 0 0                             0 ]  
[ 0 1 0 0  ] * [ 0 1 0                         0 ] = [ 0 1 0                             0 ]
[ 0 0 1 -n ]   [ 0 0 (f+n)/(f-n)  -n*(f+n)/(f-n) ]   [ 0 0 (f+n)/(f-n)  -n*(f+n)/(f-n) - n ]
[ 0 0 0 1  ]   [ 0 0 0                         1 ]   [ 0 0 0                             1 ]


And so:
a = (f + n) / (f - n)
b = -n(f + n)/(f - n) - n

Simplifying b:
1
2
3
4
5
-n(f + n)/(f - n) - n( (f - n)/ (f - n) )  //Making common denominator
( -n(f + n) - n(f - n)  ) / (f - n)        //Combining fractions
( -nf - nn - nf  + nn  ) / (f - n)         //Expanding
( -nf - nf  ) / (f - n)                    //Canceling nn and -nn
-2nf / (f - n)                             //Combining numerator


So:
a = (f + n) / (f - n)
b = -2nf / (f - n)


And there you have it.
Arrived at the same exact answer.

a would be a scaling factor to multiply all incoming Z coordinates (from view space).
b would be a translation to be added to the result, after the scaling is applied.

Edited by Arnon on