CottAGE goes LWJGL/JWS

rreyelts · July 27, 2004, 8:08pm

[quote]Yes, that would be really interesting although I wouldn’t know where to start
[/quote]
Cool. Can you IM me? I’m on Yahoo as rreyelts.

[quote]I did have a look at a site which explains how to do static binary translation (so the original machine code is translated to java, and removing dead flags and such), and that seems to me the way to speed up CPU emulation the most, but I’m not sure (not knowing much about dynamic recompilation if anything)…
[/quote]
I think static retargetting would probably be ideal, but I think it’s probably easier to do dynamic recompilation - it may also be more amenable to some of the tricks certain programs play.

[quote]When I look at the profiler output, CPU emulation is quite expensive (although it’s only ~5-15% on my PC, but that’s a quite fast PC), but video emulation is even more expensive in many games… I guess because of lots of array access.
[/quote]
Hmm… It sounds like you’re not offloading any work onto the graphics processor. I guess that makes sense for the earlier programs, but I bet that the burden shifts in later systems/programs where you can discern graphics primitives calls.

[quote]When I would know how to do ‘dynarec’ or static binary translation, I could have a go at emulating an 68000.
[/quote]
You mean cause it would run too slow otherwise?

God bless,
-Toby Reyelts

erikd · July 27, 2004, 9:03pm

I sent you a PM.

[quote]I think static retargetting would probably be ideal, but I think it’s probably easier to do dynamic recompilation - it may also be more amenable to some of the tricks certain programs play.
[/quote]
Yeah, I think you’re probably right.

[quote]Hmm… It sounds like you’re not offloading any work onto the graphics processor. I guess that makes sense for the earlier programs, but I bet that the burden shifts in later systems/programs where you can discern graphics primitives calls.
[/quote]
All graphics emulation is done in software inside the CPU. Only the final frame is rendered using openGL. I think there’s a big speed gain as well although the current method might be more accurate.

Well, look at it this way: The 68k has a huge number of instruction and it would not make sense to do every single instruction in a huge switch/case. It might make more sense to do it more programatically, and use dynamic recompilation to get some performance out of it.
I’ve seen one 68k emulator done in java (used in an Atari ST emulator) and that one was done by generating a java source with huge switch/cases. The source of that generated 68k emulator was about 4Mb large…

swpalmer · July 28, 2004, 12:39am

[quote]All graphics emulation is done in software inside the CPU. Only the final frame is rendered using openGL. I think there’s a big speed gain as well although the current method might be more accurate.
[/quote]
The big advantage for graphics would be in emulating hardware sprites with OpenGL quads or something. You could also emulate hardware scrolling registers and that sort of thing. Even hardware playfield layers like the Amiga had.

For 68k emulation the trick would be to decompose the instruction word… they have a fairly regular structure, so you could do a much smaller switch on the bits that define the instruction type, then extract the register IDs from the other bits etc…

rreyelts · July 28, 2004, 1:07am

[quote]The big advantage for graphics would be in emulating hardware sprites with OpenGL quads or something. You could also emulate hardware scrolling registers and that sort of thing. Even hardware playfield layers like the Amiga had.
[/quote]
Exactly.

[quote]For 68k emulation the trick would be to decompose the instruction word… they have a fairly regular structure, so you could do a much smaller switch on the bits that define the instruction type, then extract the register IDs from the other bits etc…
[/quote]
This Z80 instruction set is a farking nightmare in that regard. No consistency whatsoever from an opcode point of view. Anyway, I’m still not sure why people don’t dispatch via an array of interfaces instead, i.e:

interface OpcodeHandler {
public void execute();
}

OpcodeHandler[] opcodeHandlers = new OpcodeHandler[] {
new Op0Handler(),
new Op1Handler(),
new Op2Handler(),
…
};

public void execute() {
while ( true ) {
int opcode = readOp();
opcodeHandlers[ opcode ].execute();
}
}

Erik had to already pull all of the inline code out of the switch statements into functions (it performed worse), so I can’t see how an array dereference + virtual lookup is going to be so much more expensive than a switch. Especially considering that the dynamic recompiler is going to strive to minimize these loop iterations anyway.

God bless,
-Toby Reyelts

swpalmer · July 28, 2004, 1:16am

[quote]…so I can’t see how an array dereference + virtual lookup is going to be so much more expensive than a switch.
[/quote]
Theoretically, the switch could be implemented internally by the compiler as a lookup anyway. It might even be able to optimize the case for no match so that it overlaps with the time that would have been taken to do the array bounds check in your example.

Practically speaking this is worth profiling in a real-world case on different VMs. And you can of course disassemble the bytecode to see what javac does with it. Maybe it isn’t so easy to optimize with the JIT compiler.

rreyelts · July 28, 2004, 1:42am

[quote]Theoretically, the switch could be implemented internally by the compiler as a lookup anyway.
[/quote]
I don’t know what you mean. The instruction is tableswitch. It is indeed a table lookup. It’s part of the VM spec. The JIT implementation should be straightforward. The point is that you still pay a branch penalty (what really hurts the modern processors today), and you still pay a function call penalty, because you end up outlining all of the code anyway.

One of the greatest benefits of dynarec is that you don’t pay a branch penalty per instruction anymore.

God bless,
-Toby Reyelts

erikd · July 28, 2004, 6:12am

[quote]Erik had to already pull all of the inline code out of the switch statements into functions (it performed worse)
[/quote]
Yeah, manual inlining was a stupid thing to do although on the MSVM it did got faster that way and pulling the inline code caused a performance hit on that VM.

[quote]For 68k emulation the trick would be to decompose the instruction word… they have a fairly regular structure, so you could do a much smaller switch on the bits that define the instruction type, then extract the register IDs from the other bits etc…
[/quote]
Yes, that is the smarter but slower way to do it. I started an 68k emulator that way once but at the time I wouldn’t be able to get it fast enough on my machine I had then.

[quote]The big advantage for graphics would be in emulating hardware sprites with OpenGL quads or something. You could also emulate hardware scrolling registers and that sort of thing. Even hardware playfield layers like the Amiga had.
[/quote]
Yes, a major speed gain would be possible.
The thing is when you use filtering that you can see the edges of the sprites/characters which doesn’t look very good.
i.e. when 2 sprites or characters are directly next to eachother, the pixels of the outer edge of one quad are not filtered with the pixels next to them from the other quad so you see a sharp line between them.
So I think for best looking results, I should not render the quads directly, but blit them to another texture first (one texture per video layer). It’s all doable, but not really trivial to rewrite the current renderer that way.
If only bounds check removal would be implemented (in the client), that would cause a huge performance boost as well.

erikd · July 28, 2004, 7:58am

The 1st version of the 6809 emulator was written like that. We went for a switch and it got faster. The performance gain was worth the uglier code.

rreyelts · July 28, 2004, 1:23pm

[quote]If only bounds check removal would be implemented (in the client), that would cause a huge performance boost as well.
[/quote]
Why can’t people run with server? Is this not an option for a webstarted app?

God bless,
-Toby Reyelts

rreyelts · July 28, 2004, 1:26pm

[quote]The 1st version of the 6809 emulator was written like that. We went for a switch and it got faster. The performance gain was worth the uglier code.
[/quote]
Was that before or after you outlined all of your case code? I have a feeling you might not see much difference with all of the code outlined now.

God bless,
-Toby Reyelts

swpalmer · July 28, 2004, 1:34pm

Well I don’t know much about the VM spec/bytecodes… but I was assuming that a table lookup, and by that I meant using the switch value as an index into a table of function pointers, would be a bad idea in some cases, as the switch values could be very sparse and the table size would need to be 4 gig.

[quote]The point is that you still pay a branch penalty (what really hurts the modern processors today), and you still pay a function call penalty, because you end up outlining all of the code anyway.
[/quote]
So you are saying that there are effectively two branches in the switch case? One in the switch decision and another that is part of the function call? (the call being unconditional, so not nearly as bad for the processor’s instruction pipeline)
So the theory is that the array lookup only has the one branch for the call that is always taken. If I have that right, I was supposing that the JIT of the ‘tableswitch’ would end up producing machine code that effectively did the same thing, given the nature of the switch statement it could be a special case that the JIT compiler was able to convert into the native equivalent of the array of interfaces.

[quote]One of the greatest benefits of dynarec is that you don’t pay a branch penalty per instruction anymore.
[/quote]
Yep, obviously JIT compiling is the proof that it works. The emulator after all is nothing more than a VM with the virtual machine opcodes (byte code) being instruction codes of an actual processor. I guess that makes it a NVM (non-virtual machine)?

The scary thing is dealing with the practice of self-modifying code in an environment with dynamic re-compilation. Something that was fairly common in that day. I guess arcade machine ROM based programs would have less of that, so you are at an advantage there.

erikd · July 28, 2004, 2:18pm

[quote]I guess that makes it a NVM (non-virtual machine)?
[/quote]

Well, the machine is still virtual as it’s software. The only difference is that the machine also exists for real.

[quote]I guess arcade machine ROM based programs would have less of that, so you are at an advantage there.
[/quote]
Yes, you could simply only JIT the code which is in ROM and always interpret code in RAM.

fbi · July 29, 2004, 5:06am

Dear Erikd,
1943 simply rocks! ;D
We spent some time in my lab to re-play it
Very cool job 8)

princec · July 29, 2004, 8:22am

Sounds like it will all get far too complex for a very very minimal gain…

Cas

erikd · July 29, 2004, 7:02pm

Well, with dynarec you will get rid of loads of branches and I think hotspot will have a much easier job removing dead flags and other optimizations. I think it will make a difference, although the difference might not be as big as when doing it in C (static binary translation in C with dead flag removal has been reported to get as much as 25-30x faster than interpretation.).
Maybe for the Z80 emulation, it might not be too important on recent PC’s since a Z80 doesn’t run on high clock speeds anyway. But once it works, the same technique can be applied to 68k emulation and there it will matter.

There are of course more optimizations possible in other areas (mainly rendering), but I’m looking at that as well.

rreyelts · July 30, 2004, 12:17am

[quote]Sounds like it will all get far too complex for a very very minimal gain…
[/quote]
I don’t think you know what you’re talking about Cas. You translate X number of impossible to optimize interpreter loops into a single optimizable function. That should be a huge win. I could easily see it running 10x faster. As far as it being “far too complex”, I already almost have a working prototype ready.

God bless,
-Toby Reyelts

princec · July 30, 2004, 8:03am

I sure don’t know But for the 8bit chips it’s certainly a wasted effort… it ain’t going to run any faster than 60Hz.

Cas

erikd · July 30, 2004, 9:19am

Well, there are arcades which run as much as 5 Z80’s simultaneously. If we can keep the system requirements down that would be welcome.
And once we (well, Toby that is :-)) got it working, the same technique can be applied to faster CPU’s and there it will surely be important.

princec · July 30, 2004, 11:16am

Cool. Who can explain the techniques in laymans’ terms? I’ve always been massively interested in this stuff. I wrote my first VM over 20 years ago!

Cas

rreyelts · July 30, 2004, 3:05pm

[quote]Cool. Who can explain the techniques in laymans’ terms?
[/quote]
Immediately after the ROM is loaded, I walk the Z80 machine instructions looking for a window of instructions that is uninterrupted by jumps. Once I have that window established, I generate Java bytecode that is the equivalent to what the interpreter would run, given the same sequence of instructions. I then replace the first machine instruction of that window in ROM with a “dynarecTrap instruction” which has an opcode that falls outside the valid set of Z80 opcodes. The instruction contains a pointer to the Java bytecode I just generated. So, when the interpreter is running, if it sees a dynarecTrap, it runs the corresponding Java bytecode, instead of looping the series of machine instructions it would have normally.

This should optimize out really well, because it takes a set of instructions that would have normally required a dynamic branch (for each instruction) to execute, and instead places them back to back, where you can, all of the sudden, perform local optimizations like inlining, dead flag detection, redundant instruction collapsing (i.e. multiple updates to the PC or cycle counter), etc…

God bless,
-Toby Reyelts