many sprites

Nate · February 9, 2010, 4:11pm

I have a 2D game that uses many sprites. My game has 15 characters, each character has ~7 animation sequences in 6 different directions, and each sequence has ~9 frames. (5000+ frames to save you the calculation.)

Currently I’m striping whitespace and packing the sequences into 1024x1024 images. It takes ~120 images to fit everything. I never split a sequence across images. Just before the first frame of a sequence is needed, I load the image for that sequence if it isn’t already loaded. After the animation, if no active sequences need the image, I unload it.

This works OK most of the time, but occasionally fails. Imagine a group of 6 characters all die at the same time. If the death animations happen to be on different images that aren’t loaded, the game pauses while the 6 images are loaded. The pause is extremely noticeable, even on good desktop hardware.

I’m targeting both the desktop and Android with this game. I haven’t tried it yet, but I’m guessing that even if only one image needs to be loaded, on Android the pause will be noticeable. Certainly if 6 are loaded at once, Android will be brought to its knees.

What do you think would be the best approach for my animation system? Maybe you guys have some sage advice or tricky tricks for this problem.

I could store one sequence per image. This should reduce the load time, at the (possibly negligible) cost of more video memory since the images are unlikely to be powers of two and this approach guarantees the use of one texture per active sequence. I envision the typical number of different simultaneously active sequences would be ~30, with a max of maybe ~60.

The actual loading of an image is done by this IMGDecoder class. Possibly this could be optimized. Would it be worth doing pure Java JPEG decoding to load the image directly into a ByteBuffer OpenGL can use?

I could decode all ~120 images before hand, so I just have to read the bytes and hand them to OpenGL. Just the raw bytes would take ~480mb of disk space. Something like deflate could be used to make this significantly less, but the overhead would have to be smaller than IMGDecoder (which probably isn’t hard). For this to be feasible, the resulting temp files would have to be reasonably small since Android doesn’t have an abundance of storage. On Android I use a lower resolution, so I need only ~30 images instead of ~120.

princec · February 9, 2010, 4:46pm

5670 frames of animation in 480mb of texture space: this isn’t going to work. You’re going to need to adopt a different strategy!

Firstly I’d make those textures much, much smaller. 256x256 perhaps. Maybe less. This means when you need to load one, you’ll only need to load in 256kb of data (still probably way too much for Android). Secondly, I’d use a different texture format for them - I’d suggest RGBA4, which is 4 bits per channel, halving the amount of IO again. Thirdly, I’d have a small “working set” cache of these in memory, probably utilising a LRU. Keep, say, 2 of them in memory, and if you need to load another up, chuck the least recently used one. This might give you a small speedup where the scene changes little frame by frame.

Fifthly, I’d implement a “working set” cache of the images you need to display the current frame as a set of big texture atlases - several 1024x1024 in-memory atlases. You’d have to dynamically add and remove sprites to the atlases, using some caching rule, and obtain your images from the small loaded texture working set. If your images are all of convenient uniform size this’ll be very easy to implement. What’s less obvious is the best caching rule: LRU might not work too well. “Most Often Used” is maybe a better bet, where the most often used sprites are retained in the atlases, as they just keep getting used over and over. Experimentation may be the key here.

Sixthly, I’d do this using a batch and sorting sort of thing. I’d find out precisely which sprite images are required for the next video frame, and then I’d sort them all by their known order in the list of 256x256 textures, then I’d do all the loading and atlas additions / removals by following that sorted list sequentially, and then I’d get on with rendering sprites.

This all sounds very difficult, and it is You are dealing with an unusually massive number of sprite images. Also, all of this advice is totally theoretical: this is just how I’d start approaching the problem.

BTW at Puppygames we have nowhere near that number of sprite images All our sprites fit into about 100mb in our most complex game. We cheat and just load the lot into OpenGL at the start of the game and let the OS take care of swapping them in and out of the disk using virtual RAM - probably something you can’t really do with Android!

Cas

Riven · February 9, 2010, 5:43pm

Look at your images as if they are plain data.

Data must be available at some point, at some time.
Not all data has an equal probability of being needed instantly.
Tag every item in your data with a probability value.
Tag every item in your data with an integer (more on that later).

When you are about to need a data item soon, increase the probability value for that data item.

Create multiple layers that have increasing amounts of latency towards usage. Build it like the memory system in the PC: registers, L1 cache, L2 cache, L3 cache, system RAM, swapfile, disk, network.

Level 6 accessibility: [disk] compressed images (jpeg/png)
Level 5 accessibility: [disk] raw images (RGB/RGBA) using a big mapped bytebuffer.
Level 4 accessibility: [ram] LRU collection
Level 3 accessibility: [ram] ByteBuffer to update texture
Level 2 accessibility: [vram] texture (often used, often changed)
Level 1 accessibility: [vram] texture (~always used, ~never changed)

Each level has a capacity. You can stream in data from level N to level N+1 in different threads, all handled by the ‘probability value’ of each sprite. Ofcourse you add some shortcuts for ‘emergencies’.

So, if you are fighting an enemy, and you can determine he is about to lose in the next few seconds: bump the probability value of the death sequence which will automatically stream in the data (from level to level) as textures before you actually need them.

If a unit spontaniously combusts, and its texture is nowhere near your [vram], find the data (where ever it is, see point #4) and blocking-load it. To prevent this from happening, you have to tweak your probability values.

DzzD · February 10, 2010, 9:35pm

in addition to above maybe you can also use hardware compression (DXT ? ) to reduce memory usage, no ? (or your own hardware-side compression if possible…)

EDIT: also you would probalby get best compression result using some kind of “video” encoding for each animations rather than using a big image for the animation, you may use n images as a video track , something like => imageAnim[n].jpg= imageAnim3Original[n].jpg-imageAnim[n-1].jpg, as you will always display image n before image n+1 it is also good for decoding/streaming

Nate · February 10, 2010, 10:08pm

Thanks for the feedback guys.

I think RGBA4 or DXT are good ideas for reducing the texture memory used. RGBA4 is easiest to implement. Because the sprites are stored on disk as JPEG+alpha , I would have to load the JPEG, apply the alpha, compress to DXT, then hand to the GL. I played with using DXT for on disk storage, but since my sprites have lots of transparency I was getting <15% compression.

I will store each sequence in the smallest image file possible to reduce the data that must be loaded at once at runtime. Hopefully this makes load times acceptable.

I’ll give an example that may shed more light on how my sprites are used. Say there are 5 different characters on screen. They will all be doing an “idle” animation. I’ll load 5 textures, one for each animation sequence, and won’t unload them while the character is looping the idle animation. When a character attacks, I unload idle, load “attack”, and display those frames. When the attack animation is complete, I unload attack and load idle.

Because each character action has its own spritesheet, does a large atlas texture buy me much? It would reduce the number of texture binds from one per character to just one or two, at the added cost of the time to pack/write to the atlas after loading a sequence’s spritesheet.

I have the feeling that using Riven’s approach, the majority of sequence lookups will be “emergency”. Eg, if I press attack or walk, that animation is needed right now. Attack, walk, and idle can’t really be loaded all the time since there could be 15 unique characters on screen at once, with 3 sequences, 6 directions, and ~7 frames each that is ~1900 frames. I think if my game is going to work with so many frames, it must be possible to load an individual sequence quickly. If handling the emergency loads can be made mostly reasonable, then I think looking ahead to preload sequences makes sense to mitigate load pauses.

I’ve been thinking about the animation system a lot, but haven’t started coding a new one yet. I will very soon, but first I’ve been sidetracked on a related task to optimize the JPEG loading.

I was thinking that using AWT to load JPEGs is inefficient, even if it uses faster native code. It gives me a BufferedImage and I have to copy all the data into a buffer for OpenGL. Google eventually led me to a JPEG decoder in SWT. I’ve ripped it out of SWT (its a single ~6000 line class) and changed it to write to a ByteBuffer. I profiled memory allocation and modified the decoder to keep around the memory so subsequent images can be decoded without the allocation overhead. I have a hunch that whatever ImageIO uses (libjpeg?), it allocates new temp and output buffers for each decode.

At this point, on my desktop hardware, my JPEG decoder takes 24ms to decode a 640x480 JPEG to a ByteBuffer while ImageIO takes 36ms for the same image, including copying the pixel data to a ByteBuffer. Strangely enough, every fourth ImageIO decode takes 51ms instead of 36ms.

I look forward to trying the pure Java decoder on Android. I’m hoping avoiding memory allocations will be a big benefit. Android has its own image loading stuff, since AWT/ImageIO isn’t available there.

It appears the SWT decoder is directly ported from Blender, from this C code:
http://download.blender.org/source/chest/blender_2.03_tree/jpeg/
Apparently there are a few DCT implementations supported in the Blender code:


JDCT_ISLOW /* slow but accurate integer algorithm */
JDCT_IFAST /* faster, less accurate integer method */
JDCT_FLOAT /* floating-point: accurate, fast on fast HW */

The SWT port only includes ISLOW. I wonder what sort of speedup we’d see using IFAST? There are a couple hairy bits I get caught up with when I try porting IFAST. I’ll play with it a bit more, but I’m not sure I’m up to the task. If anyone has some free cycles and thinks this is an interesting problem, here are the resources:

The original code for ISLOW:
http://download.blender.org/source/chest/blender_2.03_tree/jpeg/jidctint.c
The SWT decoder code for ISLOW:
http://www.docjar.com/html/api/org/eclipse/swt/internal/image/JPEGDecoder.java.html#2921
The original code for IFAST:
http://download.blender.org/source/chest/blender_2.03_tree/jpeg/jidctfst.c
My decoder so far:
http://n4te.com/temp/JPEGDecoder.java
Example/benchmark for the decoder:
http://n4te.com/temp/Test.java

DzzD · February 10, 2010, 10:41pm

does the JPEG decoding time only include “decoding” not disk loading/reading ? 36 ms seems very very very slow ?

if the image compress well you may be able to keep the whole JPEG image into memory and decode them when requiered without re-reading from the disk let say your image have an average of 25Kb : 5000 image * 24Kb will requiere arround 122 Mb, but if the decoding time pointed above is only about decoding and not disk-reading, sure you have to find something faster…

EDIT : I just see your benchmark source, seems that you are counting disk reading in the bench, you may try to benchmark only decoding using Toolkit.createImage( byte [ … ]) (sorry for my old school loveness…) or such to only benchmark the decoding time of the image without the disk reading, I am sure you can decode hundreds of image in less than a second.

Nate · February 10, 2010, 10:50pm

It was including reading from disk. You’re right, I may be able to keep the compressed images in memory. All the compressed images are ~40mb for the desktop, for Android its < 10mb. I updated Test.java to not include disk IO but it didn’t affect the times.

If I comment out copying the BufferedImage pixel data to a ByteBuffer, then I see ImageIO taking 23ms to decode. Strangely enough every fourth ImageIO decode takes 38ms.

FWIW, the machine I’m testing this on is an Intel P4 3ghz.

DzzD · February 10, 2010, 11:08pm

I tried mine bench dont know if it is relevant…

/**
 * @(#)TestJPG.java
 *
 * TestJPG Applet application
 *
 * @author 
 * @version 1.00 2010/2/10
 */
 
import java.awt.*;
import java.applet.*;
import java.net.*;
import java.io.*;
import java.awt.image.*;

public class TestJPG extends Applet 
{
	public static final long serialVersionUID=1;
	
	static int nbImage=100;
	byte spriteData[];
	Image spriteImage[];
	String msg;
	
	public void init() 
	{
		
		try
		{
			
			URL url=new URL(this.getCodeBase() +"TEST.JPG");
			URLConnection uc=url.openConnection();
			int size=uc.getContentLength();
			this.spriteData=new byte[size];
			InputStream is=uc.getInputStream();
			int nb=is.read(this.spriteData);
			if(nb!=size) throw new IOException(" Error reading ");
			
			this.spriteImage=new Image[this.nbImage];
			Toolkit t=Toolkit.getDefaultToolkit();
				
			long start=System.currentTimeMillis();
			
			for(int n=0;n<nbImage;n++)
			{
				this.spriteImage[n]=t.createImage(this.spriteData);
				if(!t.prepareImage(this.spriteImage[n],-1,-1,null))
				{
					int flag=0;
					do
					{
						flag=t.checkImage(this.spriteImage[n],-1,-1,null);
						Thread.yield();
					}while((flag & ImageObserver.ALLBITS)==0);
					
					
				}
			}
		
			
			
			long end=System.currentTimeMillis();
			this.msg=" decoding " +((1000*nbImage)/(end-start)) + " jpg / second";
			
		}
		catch(Exception e)
		{
			e.printStackTrace();
			this.msg=e.getMessage();
		}
		
		System.out.println (msg);
		this.repaint();
	}

	public void paint(Graphics g) 
	{
		
		if(this.spriteImage!=null)
		{
			g.drawImage(this.spriteImage[0],0,0,null);
			g.setColor(Color.red);
			g.drawString(this.msg,50,50);
		}
		
	}
}

[s]I got “1063829 jpg / second” on Intel Core 2 T7XXXsomething (but inded bench use only one core)

EDIT : error on code, too much it look suspicious…[/s]

“decoding 47 jpg / second”

Nate · February 10, 2010, 11:11pm

Maybe Toolkit’s createImage just wraps the byte array? I don’t think it is actually doing any decoding.

DzzD · February 10, 2010, 11:14pm

you re probably right, it is far way too much, I ll try another benchmark approach

EDIT : ok that’s better now I updated the above post… “decoding 47 jpg / second” wich is still damnly slow…

EDIT2 : tried with a bigger array to get more accurate results… this is a little better but still too slow " decoding 58 jpg / second" (note that the JPG is 700*398)

EDIT3 : give “decoding 116 jpg / second” for the 256*256 sprite below (also IMO the bench I ve posted is inacurate but more pessimistic than optimistic)

DzzD · February 10, 2010, 11:37pm

improving a little the bench an running it severals time give on my computer 130 jpg 178 jpg / seconds, so if you keep all data in memory it maybe interresting to have an object like :

Sprite
{
 byte jpgData[];
 Image jpgImage;
 ...
 ....
}

and then have two version compressed & decoded so that when you know an image is being displayed you can decode it and when it is not for a cupple of time you can free it Image.

Nate · February 11, 2010, 12:38am

On my machine I see 69 jpg/sec. If I change your test to use the JPEGDecoder class I posted, I see 77 jpg/sec. The code for that:

JPEGDecoder decoder = new JPEGDecoder();
ByteArrayInputStream jpegInput = new ByteArrayInputStream(spriteData);
long start = System.currentTimeMillis();
for (int n = 0; n < nbImage; n++) {
	jpegInput.reset();
	decoder.decode(jpegInput);
}
long end = System.currentTimeMillis();

DzzD · February 11, 2010, 12:44am

Nate:

On my machine I see 69 jpg/sec. If I change your test to use the JPEGDecoder class I posted, I see 77 jpg/sec. The code for that:
JPEGDecoder decoder = new JPEGDecoder();
ByteArrayInputStream jpegInput = new ByteArrayInputStream(spriteData);
long start = System.currentTimeMillis();
for (int n = 0; n < nbImage; n++) {
	jpegInput.reset();
	decoder.decode(jpegInput);
}
long end = System.currentTimeMillis();

I ll try with the JPEGDecoder too and post results (I really cant understand that a JPG decoder even without native code take so much time, sure there is plenty of way to optimise it

DzzD · February 11, 2010, 12:54am

I got the following results :

145 jpg/s for the JPEGDecoder
180 jpg/s using Toolkit.createImage

using Java 1.6-18-ea-b04

Nate · February 11, 2010, 1:08am

Strange. I’ll try it on a different machine. You are using the same code you posted?

For Toolkit.createImage to be equivalent to JPEGDecoder, it needs to result in a ByteBuffer that can be passed to OpenGL to create a texture. I think once you access the pixels and copy them into a direct ByteBuffer, it will be much slower.

Nate · February 11, 2010, 9:33am

Ran my test on a faster machine and I see SWT’s decoder with 15ms and ImageIO with 20ms. Every fourth ImageIO decode takes 26ms (weird!).

Same machine with DzzD’s test I saw 217 jpg/sec. DzzD’s test with JPEGDecoder was 166 jpg/sec. The pixel data should be copied to a ByteBuffer before the two can be compared, but I don’t see a way to access the pixels through AWT’s Image or Toolkit classes.

JPEG decoding involves doing Fourier analysis stuff. Take a look at JPEG decoder, it is pretty crazy!

Riven · February 11, 2010, 9:56am

use PixelGrabber

Riven · February 11, 2010, 10:00am

Make every frame (not every sequence) a data item.

DzzD · February 11, 2010, 6:04pm

I know pretty weel fourier world, I have eated a lot of those when I was student and this really dont explain such slowness, a simple fast DCT requiere few instructions per entry (here is an example, dont know if this is the best one but it is already not that much :

I suppose that the bottlneck is more in the other stuff (memory allocation / pixel format cmpatibilty) than in the JPEG decoding, what I mean is that you should be able to tune the JPEGDecoder to get far way best results.

Nate · February 11, 2010, 9:10pm

I believe JPEG splits the image into 8x8 blocks and does a DCT on each. 640x480 would be 80x60 blocks, or 4800 total DCTs. By your math, that is 134,400 additions and 105,600 multiplies. The method in JPEGDecoder that does the DCT is jpeg_idct_islow, and I have a feeling it uses more operations than you have listed. I am really interested in porting the IFAST DCT algorithm I posted about above.

HPROF says the biggest CPU pain points are DCT (jpeg_idct_islow method) and converting YCbCr to RGB (ycc_rgb_convert method).

Ah, I see. This would allow a sequence to be partially loaded. This might be needed if I have problems loading the entire sequence for every active sequence, but is much more complicated. I think I’d like to try a simpler approach and move to something more advanced if that doesn’t pan out.