In case anyone is interested, JumpTables made it into Java6 B55 (should be available wednesday or thursday). If you have large switch statements this should help a bit. Give it shot and report any bugs. Thanks.
If you have switch statements so large that you’re required to optimise them than your code is broken and you should fix it.
Switch statements are jump tables.
All properly written Java code will not see a performance increase with this “feature” in Mustang.
Hm… maybe with a HashMap<Integer, Runnable> ? Better to roll your own hashmap implementation then, to support <int, Runnable> (without autoboxing). You’d safe yourself from a massive jump-table, but the Runnables might slow you don’t a bit. I suspect it aint worth the effort for JEmu2
This approach will lead to 2 performance problems: Bounds checking and indirect function calls through an interface.
I even experimented with an array of interface references so that I can do opcodes[opcode].exec(); but even this is significantly slower than a switch.
I think you’re correct that jumptables have been there all along, at least at a bytecode level. There’s even an instruction dedicated to this (tableswitch). Maybe what Azeem is refering to is that the tableswitch bytecode instruction is now also JIT’ed to a native jumptable?
Correct, the bytecodes generated from a switch statement can either be a tableswitch or a lookupswitch (depending on size, sparseness,etc). The previous implementation would create a binary search tree to calculate which branch of the switch statement to take. The Jumptable implementation creates a table in the constant section of the generated code, and in most cases its just a load address of the base of the table, calculate the index and jump to the new destination. The larger the table, the faster the implementation.
I just gave it a try (Java HotSpot™ Client VM (build 1.6.0-ea-b56, mixed mode):
On JEmu2, with Z80 based emulators I see a nice overal performance boost of almost 6%. Pretty impressive
With one MC68000 based emulator I see an overal performance gain of about 9-10% :D, but in another MC68000 emulator, I see a tiny performance degredation of about 2% :-.
However, the latter emulator spends much less time in large switches (more time is spent in video emulation compared to CPU emulation), so I guess this little loss has to do with something else. I’ll try to put my finger on the cause of this degradation (really hard to tell at this point).
But overall, I’m impressed!
EDIT: BTW, I measured against 1.4.2_03 client VM. I’ll measure against a 1.5 later.
3rd year now.
What does that have to do with jump tables?
I believe they are completely unreasonable even when you want to parse large amounts of data.
Just because I can’t think of something better at this time doesn’t mean a better way doesn’t exist.
[quote]I believe they are completely unreasonable even when you want to parse large amounts of data.
Just because I can’t think of something better at this time doesn’t mean a better way doesn’t exist.
[/quote]
Maybe you should first think of a reason why jump tables are unreasonable before claiming that my code is broken.
As long as you can’t think of something, I believe you are wrong. Remember we’re talking about raw performance here, which is not always the same as pretty code.
[quote]My memory (back since 1.0.x/1.1.x days) is that you could get something like up to 16 int before it switched to the less direct impl?
[/quote]
I remember that when you have a properly ordered switch, you get a tableswitch (even if it’s quite large), otherwise you get a lookup switch.
Thanks! You can play with the flags to change how large the switch statement has to be before Jumptables are used. The flag is -XX:MinJumpTableSize=
The default is at 16, you can try smaller but I’ve found 16 to be about the right size.
while (cyclesLeft > 0) {
int opcode = fetch();
execute(opcode);
}
fetch() just reads an opcode from memory.
Nothing to optimize here, for example:
bus.readByte(PC++);
As I see it, I have 2 options for implementing execute():
1)
switch(opcode) {
case OPCODE_0: opcodeNOP(); break;
case OPCODE_1: opcodeADD(registerA, registerB); break;
etc.
}
A pre populated array with function pointers to implementations of the different opcodes:
opcodes[opcode].exec();
I don’t have to search through the array at all because I know in advance where everything is, so here is nothing to optimize.
The 2nd option is the nice one, but the problem with 2) is bounds checking and indirect calls.
In my experience a switch is much faster.
switch (i) {
case 1:
case 999:
case 12121212:
default:
}
In such cases, binary branches are probably the best solution.
b) Degenerated cases like 2-4 switch targets, speculative path on CPU can cover most of the paths, depending on architecture, while having a jump to address taken from memory location is bound to be 100% stopping any kind of lookahead/pipelining.