Not really, more like drawing all sprites with a single glDrawArrays() call. It’s your transforms that are the problem because you have so many OpenGL calls already. This isn’t something that can be easily solved to be honest, so make sure that performance isn’t good enough already before you try this out, as it makes the code a lot less readable and more error prone.
The main point of batching is to get a number of OpenGL calls that is independent of the number of actual instances of each different 3D model (or in this case: sprites). It should obviously be faster to only use x draw calls than use y * n, where n is the number of sprites. x might be more expensive, but y is done for each sprite so it will be slower for lots of them. For 3D models, this is usually accomplished with instancing, which with a single draw call draws the same model multiple times. There are some ways of getting the instance specific data (often just a model matrix to get it them to different positions) to the vertex shader so you can make the instances have some differing attributes like position, color, rotation, e.t.c.
I think this can be a lot cleaner to solve with newer OpenGL versions. To be honest, I don’t recommend that you do this, but if you THEORETICALLY were to you could use a technique called point sprites to draw each sprite as a GL_POINT which is then expanded to a quad (triangle strip) in a geometry shader. For some unfathomable reason the driver implementation of point sprites in ARBPointSprite is limited to squares with a maximum size of 64x64 pixels, so you’ll have to do it with a geometry shader.
Now stop again and question if you really need this. If it sounds useful or maybe even just interesting, continue. Otherwise just move along with your game. xD Also
An easy way to batch sprites is to either batch all sprites together to a single draw call by putting them all together in the same texture. Using a texture atlas (having the sprites next to each other in a large texture) can create artifacts with mipmaps as the sprites can bleed into each other when the texture is minimized. To avoid this you can use a Texture Array, which is basically an array of 2D textures. You can choose which layer you want to access using a third texture coordinate in your shader (there’s a specific sampler for them). The point of texture arrays is that no sampling is done between the layers, and it’s possible to have mipmaps with them. This is not possible with simple 3D textures. This is just to avoid having to do multiple texture binds when drawing different sprites, but if you only have 20 different sprites or so you don’t have to do it.
Now for the shader. The first thing you need to figure out is which attributes stays constant for multiple sprites. If your drawing 50 identical sprites to different positions, the only thing your shader needs to have per drawn sprite is position. Size, color, rotation (as a 2D rotation matrix), e.t.c is all constant for a given draw call, so you can keep these as uniforms. You’ll want as few attributes as possible, while still being able to achieve whatever you want to achieve with as few draw calls as possible. For example, in a very simple 2D particle system the only thing that varies between points is their position and their color. Size is constant and can in this case be either an uniform or just a plain constant if I only have a single type of particles.
Your vertex shader should just pass through the vertex attributes to geometry shader. The real “magic” happens in there. Here you use whatever data you need to output the 4 corners of your sprite, basically expanding your point to a sprite. You will want to do as little work as possible here, basically just mapping uniforms, constants and attribute to the fragment shader inputs. Going back to your cube implementation, the only thing that varies per cube is position. I assume you don’t have any rotation so we’ll remove that. You’re not using a color either. In the end, the only thing that varies per cube is position, so your vertex shader takes in a vec2 positions and outputs it to gl_Position. Your geometry shader reads this position and outputs 4 unique positions: (x, y), (x + width, y), (x, y + height), (x + width, y + height). Also keep a vec2 uniform with 2 x the inverse screen size (2.0/screenSize). To get screen positions in the -1 to 1 range you can do (position * twoOverScreenSize - 1.0), removing the need for a matrix completely. You can also generate texture coordinates for these attributes in the geometry shader. (x, y) would have (0, 0), (x + width, y) would have (1, 0), e.t.c. Finally your fragment shader is pretty much identical to how it is now.
The result is that you’ve reduced the size of the attribute data from 5 floats to 2 floats, which is 20 bytes -> 8 bytes per sprite instance. The real win comes in the number of draw calls you do. To draw a cube, just add (batch) its position to a large FloatBuffer, and draw all cubes with a single draw call to glDrawArrays() in the end. You’ll be able to draw as many sprites as you want while the number of draw calls remain constant. This in theory should be a lot faster, but it depends a lot on what your current bottleneck. I’m 99% sure that it’s your CPU at the moment, but it could very well be fill rate limited if you have an old graphics card, but I seriously doubt it though. Even if you aren’t fill rate bound at the moment, your sprites are pretty big. Your goal is basically to shift the bottleneck from the CPU to the GPU’s fill rate, which with that big sprites isn’t very hard.
I know that this represents a best case scenario for point sprites, as the only thing that varies is the position. You’ll have to find a balance between the number of draw calls and the amount of attribute data you send. This relies on your definition of “different” when it comes to sprites. You could basically draw every possible sprite using a texture array and lots of attributes per sprite instance, but it might be a lot faster to split this into two different draw calls if you can eliminate some attributes.
Let’s say you want to draw 1000 arbitrary sized sprites. You have the following attributes: a position (x, y) in floats, and a size (width, height) also in floats. This works fine and is really fast. However, if you only have 2 different possible sizes of your sprites, it might be worth moving the size into a uniform and separating the drawing into two different calls to glDrawArrays() with a call to glUniform inbetween.
Note that geometry shaders have horrible performance if used wrongly. As your GPU has several hundreds or even thousands of small processors, it relies on being able to run things in parallel. Having few very expensive geometry shaders is therefore very bad performance-wise compared to having many cheap geometry shaders. The same is true for any shader stage. This is basically the source of the criticism of geometry shaders, as people were expecting them to be able to do heavy tessellation. They weren’t really intended for that, as you might have figured out. Geometry shaders are FAST and can open up lots of possibilities when used correctly. The can reduce the bandwidth and computations needed for point sprites and they can see all the vertices in a primitive, which is important for some algorithms. The performance hit is also exaggerated, a passthrough geometry shader is free on my NVidia card.
Now for the last time question if you really need this. My experience comes from optimizing PARTICLE SYSTEMS, where I basically had millions of pixel sized particles. In such case drawing each particle with a single draw call is completely out of question for more than say 5000 particles. However, when batching up particles which only had varying positions and colors I quickly hit another bottleneck: the speed at which I can fill the ByteBuffer with my particles’ vertex data. Minimizing the data submitted by using appropriate variable types was therefore my first step. Using bytes for the color instead of floats gave me a nice improvement. You could do the same with your current code (for no noticeable speed increase though xD) as you’re just submitting either 0 or 1. You’re clearly not using anything that a byte cannot represent. Note that you should still have have a stride that is a multiple of 4. This is due to the way OpenGL can optimize reads from the buffer if it is a multiple of 4. In my case replacing RGBA from 4 floats to 4 bytes still came out as multiple of 4 (24 bytes -> 12 bytes). In your case, you have 5 bytes, so you’ll want to pad it with 3 bytes so it totals to 8 bytes per vertex to keep it to a multiple of 4.
I then added lots of other optimizations like multithreading the updating, Riven’s MappedObject implementation, e.t.c. In the end I had a particle engine capable of 64 FPS with 1 000 000 particles on a dual core LAPTOP. The funny thing is that my program would be bottlenecked when the points were bigger than 4 pixels. Basically, if you have sprites in the 200x200 pixel range, you will hit a GPU bottleneck much faster. Therefore you really shouldn’t waste time optimizing something that isn’t the bottleneck of your game.
HOLY SHIT I DID IT AGAIN PLEASE DON’T HIT ME T__________T
EDIT: TL;DR: Batch your sprites so that you can draw them in a constant number of draw calls regardless of the number of instances you have of them. This can be done with a geometry shader, but it’s not always worth implementing it if the sprites are big as they will be fillrate limited. The rest is just specifics in how to implement it with a geometry shader. Also try to keep the amount of data you send each frame as low as possible, as it consumes bandwidth, but also lots of CPU power to fill the data buffer.
EDIT2: And ra4king is mean to me. T___T