The history of opengl vertex data

In the beginning there was glVertex. The model was simple; for every vertex the client called a function and set the attributes when there were enough vertices for a primitive the graphics driver rasterized it and emited the triangle to screen.

1
2
3
4
5
6
7
foreach(mesh: meshes){
    glBegin(GL_TRIANGLES);
    foreach(vertex: mesh.vertices){
        glVertex(vertex.x, vertex.y, vertex.z);
    }
    glEnd();
}


Why this was bad: For every attribute of the vertex a function needed to be called. When mesh sizes and memory grew most applications had a loop running over the mesh data this then became a bottleneck.

So the next step is to let the client point the driver to a chunk of memory and specify how to read it. This is glVertexPointer the fixed function variant of glVertexAttribPointer.

1
2
3
4
5
glEnableClientState(GL_VERTEX_ARRAY);
foreach(mesh: meshes){
    glVertexPointer(3, GL_FLOAT, 0, mesh.vertices.data());
    glDrawArrays(GL_TRIANGLES, 0, mesh.vertices.length());
}


Why this was bad: The client is allowed to change the data pointed to at any time which means that the driver needed to read all the data during the call to glDraw* to make sure it had a correct independent copy of the data even if the client doesn't change it. For glDrawElements this includes another step as the driver would need to read all indices to make sure it knows the upper bound of the data read. This became worse when the graphics cards got their own memory and became more asynchronous.

Then came VBOs. This lets the driver encapsulate the data changes. They jury-rigged it by adding binding points like GL_ARRAY_BUFFER that you can bind a buffer to. Then the gl*Pointer calls referred to the bound buffer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
//init
foreach(mesh: meshes){
    glGenBuffers(1, &mesh.vbo);
    glBindBuffer(GL_ARRAY_BUFFER, mesh.vbo);
    mesh_data d = load_mesh_data( mesh.id );
    glBufferData(GL_ARRAY_BUFFER, mesh_data.size, mesh_data.data, GL_STATIC_DRAW);
    free_mesh_data(d);
}

//draw
glEnableClientState(GL_VERTEX_ARRAY);
foreach(mesh: meshes){
    glBindBuffer(GL_ARRAY_BUFFER, mesh.vbo);
    glVertexPointer(3, GL_FLOAT, 0, NULL);
    glDrawArrays(GL_TRIANGLES, 0, mesh.vertices.length());
}


Why this is bad: Every time you want to draw a different mesh you need to call glVertexAttribPointer for every attribute you use. Some GPUs had also evolved to use software vertex pulling which means that every time the format changes they need to recompile the vertex shader. (BTW this is also why opengl tends to get a bad reputation because the driver needs to patch the program for a lot of state changes that were "free" in older version)

So they came up with half the solution: Vertex Array Object. All this does is collect the glVertexAttribPointer calls so that you can switch between meshes in a single call. In 3.0 it contains something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
struct VAO{
    GLInt element_buffer;
    struct{
        bool enabled;
        bool normalized;
        GLInt type;
        GLInt count;
        GLInt stride;
        GLInt offset;
        GLInt buffer;
    } attributes[MAX_ATTRIBUTES] ;
}


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
//init
foreach(mesh: meshes){
    glGenVertexArrays(1, &mesh.vao);
    glBindVertexArray(mesh.vao);
    glGenBuffers(1, &mesh.vbo);
    glBindBuffer(GL_ARRAY_BUFFER, mesh.vbo);
    mesh_data d = load_mesh_data(mesh.id);
    glBufferData(GL_ARRAY_BUFFER, mesh_data.size, mesh_data.data, GL_STATIC_DRAW);
    free_mesh_data(d);
    glVertexAttribPointer(posLoc, 3, GL_FLOAT, false, 0, NULL);
    glEnableVertexAttribArray(posLoc);
}

//draw
foreach(mesh: meshes){
    glBindVertexArray(mesh.vao);
    glDrawArrays(GL_TRIANGLES, 0, mesh.vertices.length());
}


Why this is not enough: the driver still needs to match the vertex format against cached compiled programs, It can still be faster than calling glVertexAttribPointer again for every attribute depending on how smart the driver is (YMMV™). There is also no way to change just the buffer binding or offset or in other words, the pointer to where the data is.

The latest solution is the separated attribute format, this lets the client change where the data is orthogonal to the format of the data. The vao structure is updated to allow this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
struct VAO{
    GLInt element_buffer;
    struct{
        bool enabled;
        bool normalized;
        GLInt type;
        GLInt count;
        GLInt offset;
        GLInt binding;
    } attributes[MAX_ATTRIBUTES];

    struct{
        GLInt stride;
        GLInt buffer;
        GLInt offset;
    }bindings[MAX_ATTRIBUTES_BINDINGS];

}


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
//init
glGenVertexArrays(1, &vao);
glBindVertexArray(vao);
glVertexAttribFormat(posLoc, 3, GL_FLOAT, false, 0);
glVertexAttribBinding(posLoc, 0);
glEnableVertexAttribArray(posLoc);

foreach(mesh: meshes){
    glGenBuffers(1, &mesh.vbo);
    glBindBuffer(GL_ARRAY_BUFFER, mesh.vbo);
    mesh_data d = load_mesh_data(mesh.id);
    glBufferData(GL_ARRAY_BUFFER, mesh_data.size, mesh_data.data, GL_STATIC_DRAW);
    free_mesh_data(d);
}

//draw
glBindVertexArray(vao);

foreach(mesh: meshes){
    glBindVertexBuffer(0, mesh.vbo, 0, sizeof(vertex));
    glDrawArrays(GL_TRIANGLES, 0, mesh.vertices.length());
}

Edited by ratchetfreak on
Thanks for posting this. It's very interesting.
It would be nice to have a "programming history lessons" section as part of the HMN education effort.
I haven't watched the few last HmH episodes, so it may have been discussed there, but a while back I asked about VAOs and it was pointed to me that I probably shouldn't use them. Did anyone tested if it's still the case ?
They probably are as not as bad as they were in the beginning. The thing is - when they were introduced they were promised to improve performance. I'm pretty sure its still not true. Maybe only marginally and in some special cases. So using them now probably is not a big deal anymore, it's ok, but it won't improve performance much. Go with GL_ARB_vertex_attrib_binding extension that improves performance for sure.

Edited by Mārtiņš Možeiko on
AsafG
Thanks for posting this. It's very interesting.
It would be nice to have a "programming history lessons" section as part of the HMN education effort.


Yeah knowing the history and the whys of something really helps me understand the whats and hows.

mrmixer
I haven't watched the few last HmH episodes, so it may have been discussed there, but a while back I asked about VAOs and it was pointed to me that I probably shouldn't use them. Did anyone tested if it's still the case ?


Drivers are now a lot more clever and gpus have more space so they'll be much better at caching the patched shaders. And the vao can contain a reference to the patched shader (if any) so that if the format doesn't change it can reuse the patched shader.
Thanks for the writeup.

Eric Lengyel wrote a detailed blog post showing VAO usage in typical game scenarios (benchmarked using his own) was slower on all major video chip vendors, by up to 15%. His blog is down, but the twitter discussion it produced is found here : https://twitter.com/ericlengyel/status/482724497095946240
Jesse
Thanks for the writeup.

Eric Lengyel wrote a detailed blog post showing VAO usage in typical game scenarios (benchmarked using his own) was slower on all major video chip vendors, by up to 15%. His blog is down, but the twitter discussion it produced is found here : https://twitter.com/ericlengyel/status/482724497095946240


I would like to see those details in that blog post... And I found an archive of it: https://web-beta.archive.org/web/...5094241/the31stgame.com/blog?p=39

Kinda useless as a benchmark though: doesn't show enough code to figure out how he's using the vao and what else could be going wrong.

But googling around a bit I see a lot of debate about 2-3 years ago (6 years after they got put into core in 3.0) about whether VAOs are fast or not, yet I found only a single benchmark doing the comparison (I didn't look that hard though and the code also isn't available) and that one said they were faster.

Every other mention I found sounded like they were parroting someone else (most likely that blog or the valve presentation) or was astonishment at people saying they were slow. I can understand the latter set because even a naive implementation with only obvious optimizations should have only minimal overhead.

Edited by ratchetfreak on
Yeah, the blog does leave the reader with a few open questions (many of which were part of the twitter discussion).

Eric is an expert though. It's not likely he's doing something dumb.

Edited by Jesse on