My bitmap rendering is working slow

I wanted to have something like a text buffer on the screen where i could write some stuff, so i just loaded glyph images and drawed them to the screen. it somehow took me like 23 ms\frame to draw 5000 glyphs, and when i opened the profiler it showed me that 75% of the cpu time is on the Intel driver (i have a intel graphics hd 3000 as a gpu). i use the fixed function pipeline, and here is my calls:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
static void render_rectangle_bitmap(PictureBuffer* screen, Vec2 offset,Vec2 dirx,Vec2 diry,Vec2 magnification,Vec4 color)
{
	glColor4f(color.r, color.g, color.b, color.a);

	glEnable(GL_TEXTURE_2D);
glMatrixMode(GL_TEXTURE);
glLoadIdentity();


  // glColor3f(1,1,0);
  // r32 p=0.9f;0
	glBindTexture(GL_TEXTURE_2D,(GLuint)screen->handle );
	
	Vec2 lower_left= offset-(dirx*(r32)screen->width*magnification.x+diry*(r32)screen->height*magnification.y)*0.5f;
	Vec2 upper_left= offset-(dirx*(r32)screen->width*magnification.x-diry*(r32)screen->height*magnification.y)*0.5f;
	Vec2 upper_right= offset+(dirx*(r32)screen->width*magnification.x+diry*(r32)screen->height*magnification.y)*0.5f;
	Vec2 lower_right= offset+(dirx*(r32)screen->width*magnification.x-diry*(r32)screen->height*magnification.y)*0.5f;

	glBegin(GL_TRIANGLES);//<--26% of the global time goes here

	glTexCoord2f(0,1);
	glVertex2f(upper_left.x,upper_left.y);
	glTexCoord2f(1,1);
	glVertex2f(upper_right.x,upper_right.y);
	glTexCoord2f(1,0);
	glVertex2f(lower_right.x,lower_right.y);
	glTexCoord2f(0,1);
	glVertex2f(upper_left.x,upper_left.y);
	glTexCoord2f(1,0);
	glVertex2f(lower_right.x,lower_right.y);   
	glTexCoord2f(0,0);
	glVertex2f(lower_left.x,lower_left.y);
	glEnd();
	glBindTexture(GL_TEXTURE_2D,0 );//<--36% of the global time goes here

}
static void prepare_using_bitmap(PictureBuffer* screen)
{
	glEnable(GL_TEXTURE_2D);
	glMatrixMode(GL_TEXTURE);
	glLoadIdentity();
	glGenTextures(1,(GLuint*)&screen->handle);
	glBindTexture(GL_TEXTURE_2D,(GLuint)screen->handle );

	glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MIN_FILTER,GL_NEAREST );
	glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MAG_FILTER,GL_NEAREST );
	//this is for the sprite art, this is not for regular art
	glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_WRAP_S,GL_CLAMP );
	glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_WRAP_T,GL_CLAMP );

	glTexEnvi(GL_TEXTURE_ENV,GL_TEXTURE_ENV_MODE,GL_MODULATE);
	glTexImage2D(GL_TEXTURE_2D,0,GL_RGBA8,screen->width,screen->height, 0,GL_BGRA_EXT,GL_UNSIGNED_BYTE,screen->picture);

	glBindTexture(GL_TEXTURE_2D,0 );

}
void initOpengl(HWND window)
{
  HDC hdc=GetDC(window);
  PIXELFORMATDESCRIPTOR wanted_format= {};
  wanted_format.nSize=sizeof(PIXELFORMATDESCRIPTOR);
  wanted_format.nVersion=1;
  wanted_format.dwFlags=PFD_SUPPORT_OPENGL|PFD_DRAW_TO_WINDOW|PFD_DOUBLEBUFFER;
  wanted_format.iPixelType=PFD_TYPE_RGBA;
  wanted_format.cColorBits=24;
  wanted_format.cAlphaBits=8;
  int returned_index=ChoosePixelFormat(hdc,&wanted_format);
  PIXELFORMATDESCRIPTOR returned_format= {};
  DescribePixelFormat(hdc, returned_index,sizeof(PIXELFORMATDESCRIPTOR), &returned_format);
  SetPixelFormat(hdc,returned_index,&returned_format);
  HGLRC hglrc = wglCreateContext(hdc);

  Assert(wglMakeCurrent(hdc,hglrc));

  ReleaseDC(window, hdc);
  
}


to be fair, i know that the intel hd is slow, but i think it's something more to do about buffering the calls instead of getting the driver to work with them immediately. can someone help me?
This is totally possible. Immediate mode (glBegin/glEnd/glVertex) is not the fastest way to get data to GPU. Driver needs to do a lot of guesses what's happening (it doesn't know how many more Begin/End calls will follow) and most likely it does wrong choices - sends data too often in too small portions to GPU.

Look into vertex array functions - glDrawArrays, glVertexPointer & friends. Using this you'll need to buffer all data in temporary memory. But the amount of calls you'll make to driver will be drastically smaller.

Something like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
struct Vertex { uint8 Color[4]; vec2 TexCoord; vec2 Position; };

Vertex* vertices = ...; // allocate memory
for (int i=0; i<count; i++)
{
    vertices[i].Color = ...; // store color as RGBA, R is [0], A is [3]
    vertices[i].TexCoord = ...;
    vertices[i].Position = ...;
};

glColorPointer(4, GL_UNSIGNED_BYTE, sizeof(Vertex), &vertices[0].Color);
glTexCoordPointer(2, GL_FLOAT, sizeof(Vertex), &vertices[0].TexCoord);
glVertexPointer(2, GL_FLOAT, sizeof(Vertex), &vertices[0].Position);
glDrawArrays(GL_TRIANGLES, 0, count);


Only 4 calls to GL driver to draw whatever amount of triangles you want!
You can reduce amount of data you need to keep in memory, for example, by storing uint16_t type for TexCoord and Position instead of float's.

Btw, measuring CPU time is not a good way to profile GPU. Because modern drivers are asynchronous they can stall in very weird places when you don't expect some GL function to take a lot of time. Look into GPU related profiling tools if you want more information about what's happening with GPU.

Edited by Mārtiņš Možeiko on
when you say that it's a temporary memory, can i use the stack or should it persist for some time?
Another expensive Thing your are doing is the 10k calls to glBindTexture - you might want to try and bake your glyphs into a single texture and adjust the texture coordinates according to which glyph to render. This should reduce the number of state-changes the GL Driver has to do drastically.

Edited by Bl00drav3n on
Yeah, one important thing I forgot to mention is what Bl00drav3n says - you'll need to bake all glyphs in one texture for this to work. You can't switch textures with "old" GL. With some newer versions/extensions it's possible to do some tricks to work around that (texture arrays or smth), but really - just bake everything into one texture.

Memory should persist only till the end of glDrawArray call. DrawArrays call will read this memory and either submit directly to GPU or cache it internally for later submission.

Next step would be to use streaming vertex buffer object (VBO). Then you wouldn't need to keep memory on your side, but just on drivers side:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct Vertex { uint8 Color[4]; vec2 TexCoord; vec2 Position; };

// setup, do this once
GLuint buffer;
glGenBuffers(&1, &buffer);

// when drawing, do this once per frame
glBindBuffer(GL_ARRAY_BUFFER, buffer);
glBufferData(GL_ARRAY_BUFFER, 1024*1024, NULL, GL_DYNAMIC_DRAW); // allocate 1MB of memory, discarding previous data
Vertex* vertices = (Vertex*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);

// fill vertex data
for (int i=0; i<count; i++)
{
    vertices[i].Color = ...; // store color as RGBA, R is [0], A is [3]
    vertices[i].TexCoord = ...;
    vertices[i].Position = ...;
};

// done with filling data
glUnmapBuffer(GL_ARRAY_BUFFER);

glColorPointer(4, GL_UNSIGNED_BYTE, sizeof(Vertex), (void*)offsetof(Vertex, Color));
glTexCoordPointer(2, GL_FLOAT, sizeof(Vertex), (void*)offsetof(Vertex, TexCoord));
glVertexPointer(2, GL_FLOAT, sizeof(Vertex), (void*)offsetof(Vertex, Position));
glDrawArrays(GL_TRIANGLES, 0, count);


Edited by Mārtiņš Možeiko on
e1211
when you say that it's a temporary memory, can i use the stack or should it persist for some time?

Once you do the actual draw call, (glDrawArrays in this case), the driver will copy the memory out. (This is part of the speed up, I suspect, the work on your thread boils down to a few memcpys). So your memory only needs to stay valid until then.

It's also probably worthwhile at that point to go ahead and use the full shader pipe, using glVertexAttribPointer instead of glColorPointer/glTexCoordPointer/etc. (You can't, according to the wiki at least, use them on GL 3.0+ anyway.) This is another chunk of setup to do, but that path is the one vendors actually care about performance on.
You must use glVertexAttribPointer only on core profile or forward compatible context. Using it on regular compatibility context is completely fine - this is what Casey uses in Handmade Hero, and glVertex calls work fine there.
Thanks for all the replies, i think i will bake the glyphs into a bitmap and just use that. I think it will be just the thing for what i need, though i'll postpone it to a future time when i will have nothing immediate to do.
Asking this Q i wanted to know that i have something to do in order to make the performance better, now that i know i can rest until i fix it.