JAVA IS SLOW!

rreyelts · December 10, 2003, 6:08pm

Now that I’ve got your attention, I hope you’ll stick around for the real topic.

I developed a fibonacci heap (based on the most venerable text “Introduction to Algorithms”), and it gets what you might normally think of as pretty decent performance. An average insert takes about 4 microseconds and an average extractMin takes about 12 microseconds. Compared to some other (good) Java fibonacci heaps out there, mine is about 25% faster.

The truth, though, is that performance is actually very slow. Our algorithm performs roughly 1 million inserts and extractMins and we expect to be able to do it in under a second. Obviously, that’s not possible with the current timings. I did nearly a line for line translation of my fibonacci heap from Java to C++ (not optimizing for C++), and my C++ fibonacci heap gets performance that is nearly ten times faster than its Java counterpart. Inserts happen in 0.26 microseconds and extractMins take as long from 0.6 to 3 microseconds depending upon the size of the heap. (Realistically it’ll be in the 0.6 microsecond range for us, because we’re going to have fairly small heap sizes).

So the big question in my mind is WHY OH WHY? I’ve poured over and over my Java code, and I just can’t see anything that can be done to improve the performance. The scary thing is that the algorithm actually requires a significant amount of heap allocation and doesn’t do comparatively many array accesses, so it’s actually already playing to Java’s strengths.

I tested this with Sun’s 1.4.1 and 1.4.2 VMs with and without the server options. I’ve tried with different heap sizes and different garbage collection options and the numbers I give above are the best I could get. I compiled my C++ code under VC++ 7.1 with default optimizations.

So, I challenge specifically Jeff or Chris to examine my heap and either a) explain exactly how it can be re-written to get the same performance or b) get Sun’s vm team to fix the vm so that we can get the same kind of performance.

Currently, our only option is to ditch Java for C++.

God bless,
-Toby Reyelts

kevglass · December 10, 2003, 6:18pm

You might want to post the two bits of code (Java/C++) ?

Kev

erikd · December 10, 2003, 6:42pm

Yeah, I second that; please post the code!
Did you profile it? Maybe something you didn’t think of that slows things down in java.

Erik

rreyelts · December 10, 2003, 6:51pm

You might want to post the two bits of code (Java/C++) ?

I received permission to post the Java code. Let’s just say that the C++ code looks mostly the same. It goes without saying that my company reserves the copyright to this code, so you should avoid copying it.

A couple of notes. 1) I extracted comments - code was too long for the board 2) Don’t worry about the asserts 3) Performance numbers came from running inserts and extractMins a million times and taking average. Best times are with -server and sufficient heap (i.e. 256M). Tests took place on a 2.4GHZ PIV with 512M.

Thanks and God bless,
-Toby Reyelts

import java.util.*;

/**
 * A FibonacciHeap specialized for keys of type int.
 *
 * Copyright SSA Global 12/10/2003
 *
 * @author Toby Reyelts
 *
 */
public class IntFibonacciHeap<V> {

  // Used in max degree calculations
  private static final double PHI = ( 1.0d + Math.sqrt( 5 ) ) / 2.0d;
  private static final double LOG_PHI = Math.log( PHI );

  private Node<V> min; // Both the minimum node and the pointer to the root list
  private int size; // The total number of nodes in the tree

  // Performance statistics
  private double totalInsertTime;
  private double totalDecreaseKeyTime;
  private double totalExtractMinTime;
  private double totalInsertCalls;
  private double totalDecreaseKeyCalls;
  private double totalExtractMinCalls;
  private double largestSize;

  public IntFibonacciHeap() {}

  public static class Node<V> {

    public Node( int key, V value ) {

      assert( key >= 0 );

      this.key = key;
      this.value = value;
      right = this;
      left = this;
    }

    private int degree;
    private Node<V> parent;
    private Node<V> child; // only valid when degree is > 0
    private Node<V> right;
    private Node<V> left;
    private boolean mark;
    private boolean deleted;

    private int key;
    protected V value;

    public int key() {
      return key;
    }

    public V value() {
      return value;
    }

    public V setValue( V value ) {
      V temp = this.value;
      this.value = value;
      return temp;
    }

    public String toString() {
      return key + "";
    }

    public String debugString() {
      return key + " parent: " + parent + " right: " + right + " left: " + left + " degree: " + degree + " child: " + child;
    }
  }

  public void insert( Node<V> node ) {

    assert( ! node.deleted );

    ++totalInsertCalls;

    assert( node != null );
    assertNullParents();

    assert( node.parent == null );
    assert( node.child == null );
    assert( node.right == node );
    assert( node.left == node );
    assert( node.mark == false );
    assert( node.degree == 0 );

    if ( min == null ) {
      min = node;
      ++size;
      return;
    }

    // Insert the new node
    node.left = min;
    node.right = min.right;
    min.right.left = node;
    min.right = node;

    // Update the minimum if necessary
    if ( node.key < min.key ) {
      min = node;
    }

    ++size;
    assertNullParents();
  }

  public Node<V> min() {
    return min;
  }

  public Node<V> extractMin() {

    ++totalExtractMinCalls;

    if ( min == null ) {
      return null;
    }

    assertNullParents();

    Node<V> child = min.child;
    Node<V> currentMin = min;

    if ( min.degree > 0 ) {
      for ( int i = 0; i < min.degree; ++i ) {
        child.parent = null;
        child = child.right;
      }

      min.degree = 0;
      min.child = null;

      Node<V> rightOfMin = min.right;
      Node<V> leftOfChild = child.left;

      min.right = child;
      child.left  = min;
      leftOfChild.right = rightOfMin;
      rightOfMin.left  = leftOfChild;
    }

    if ( size == 1 ) {
      min = null;
    }
    else {
      min.right.left = min.left;
      min.left.right = min.right;
      min = min.right; // pick an arbitrary min - we have to recalculate it anyway
      consolidate();
    }

    --size;

    assertNullParents();

    currentMin.deleted = true;

    return currentMin;
  }

  public int size() {
    assert( size >= 0 );
    return size;
  }

  private void assertNullParents() {
    /*
    int i = 0;

    if ( min == null ) {
      return;
    }

    Node<V> temp = min;

    do {
      assert( temp.parent == null );
      temp = temp.right;
      ++i;
      assert( i <= size );
    }
    while ( temp != min );
    */
  }

  private void consolidate() {

    assert( min != null );
    assert( min.parent == null );
    assertNullParents();
 
    int largestDegree = ( int ) Math.floor( Math.log( size ) / LOG_PHI );
    Node[] nodes = new Node[ largestDegree + 1 ];

    Node<V> x = min;

    while ( true ) {

      Node<V> current = x;
      assert( current.parent == null );

      Node<V> next = x.right;

      x.right.left = x.left;
      x.left.right = x.right;
      x.right = x;
      x.left = x;

      int degree = x.degree;

      while ( true ) {

        Node<V> y = ( Node<V> ) nodes[ degree ];

        if ( y == null ) {
          break;
        }

        assert( y.parent == null );

        if ( x.key > y.key ) {
          Node<V> temp = x;
          x = y;
          y = temp;
        }

        link( y, x );

        assert( x.parent == null );

        nodes[ degree ] = null;
        ++degree;
      }

      assert( x.parent == null );

      nodes[ degree ] = x;

      if ( next == current ) {
        break;
      }

      x = next;
    }

    min = null;
    int length = nodes.length;

    for ( int i = 0; i < length; ++i ) {

      Node<V> node = nodes[ i ];

      if ( node != null ) {

        assert( node.parent == null );

        if ( min == null ) {
          min = node;
          continue;
        }

        node.left = min;
        node.right = min.right;
        min.right.left = node;
        min.right = node;

        if ( node.key < min.key ) {
          min = node;
        }
      }
    }
  }

  private void link( Node<V> y, Node<V> x ) {

    assert( y != null );
    assert( x != null );
    assert( y.parent == null );
    assert( x.parent == null );

    y.parent = x;

    if ( x.degree == 0 ) {
      x.child = y;
      y.right = y;
      y.left = y;
    }
    else {
      y.left = x.child;
      y.right = x.child.right;
      x.child.right.left = y;
      x.child.right = y;
    }

    ++x.degree;
    y.mark = false;
  }

  public void decreaseKey( Node<V> node, int key ) {

    assert ( ! node.deleted );

    ++totalDecreaseKeyCalls;

    assert( node.key >= key );
    assertNullParents();

    node.key = key;

    Node<V> parent = node.parent;

    if ( parent != null && ( key < parent.key ) ) {
      cut( node, parent );
      cascadingCut( parent );
    }

    if ( key < min.key ) {
      min = node;
    }

    assertNullParents();
  }

  public void delete( Node<V> node ) {

    assert ( ! node.deleted );

    assertNullParents();

    Node<V> parent = node.parent;

    if ( parent != null ) {
      cut( node, parent );
      cascadingCut( parent );
    }

    min = node;

    extractMin();

    assertNullParents();
  }

  private void cut( Node<V> child, Node<V> parent ) {

    assertNullParents();

    child.parent = null;
    --parent.degree;
    
    if ( parent.degree == 0 ) {
      parent.child = null;
    }
    else {
      parent.child = child.right;
      child.right.left = child.left;
      child.left.right = child.right;
    }

    child.left = min;
    child.right = min.right;
    min.right.left = child;
    min.right = child;

    child.mark = false;

    assertNullParents();
  }

  private void cascadingCut( Node<V> node ) {

    Node<V> parent = node.parent;
    
    if ( parent == null ) {
      return;
    }

    if ( ! node.mark ) {
      node.mark = true;
    }
    else {
      cut( node, parent );
      cascadingCut( parent );
    }
  }

  public void print( Node<V> node, boolean debug ) {

    if ( node == null ) {
      return;
    }

    Node<V> begin = node;
    Node<V> current = node;

    while ( true ) {
      if ( debug ) {
        System.out.println( current.debugString() );
      }
      else {
        System.out.println( current );
      }
      if ( current.degree > 0 ) {
        print( current.child, debug );
      }
      current = current.right;
      if ( current == begin ) {
        return;
      }
    }
  }

  public void print( boolean debug ) {
    print( min, debug );
  }

  public double totalInsertTime() {
    return totalInsertTime;
  }

  public double totalExtractMinTime() {
    return totalExtractMinTime;
  }

  public double totalDecreaseKeyTime() {
    return totalDecreaseKeyTime;
  }

  public double totalInsertCalls() {
    return totalInsertCalls;
  }

  public double totalExtractMinCalls() {
    return totalExtractMinCalls;
  }

  public double totalDecreaseKeyCalls() {
    return totalDecreaseKeyCalls;
  }

  public double largestSize() {
    return largestSize;
  }

  public static void main( String[] args ) {
    IntFibonacciHeap<String> heap = new IntFibonacciHeap<String>();
    int[] ints = new int[] { 23, 7, 3, 3, 3, 14, 14, 7, 17, 30, 24, 26, 46, 35 };
    Node[] nodes = new Node[ ints.length ];

    for ( int i = 0; i < ints.length; ++i ) {
      Node<String> node = new Node<String>( ints[ i ], "" + i );
      heap.insert( node );
      nodes[ i ] = node;
    }

    while ( heap.size() > 0 ) {
      System.out.println( "Heap min: " + heap.extractMin() );
    }
  }
}

princec · December 10, 2003, 7:13pm

Errr… that would appear to be JDK1.5 code? Which we can’t compile?

Cas

rreyelts · December 10, 2003, 7:24pm

[quote]Errr… that would appear to be JDK1.5 code? Which we can’t compile?
[/quote]
It’s JDK1.4+ code. You can compile and run this with Sun’s freely available 1.2 adding generics prototype which is just an add on to the 1.4 compiler. Alternatively, you can join the CAP program and compile and run this with the 1.5 VM. Finally, you could just take the 60 seconds to get rid of the parameterization. It doesn’t have any affect on the running time, because the heap doesn’t mess with the values.

God bless,
-Toby Reyelts

endolf · December 10, 2003, 7:39pm

[quote]Check out the free, open-source, JNI toolkit, Jace - http://jace.reyelts.com/jace
[/quote]
I would, but your dyndns account seems broken as it won’t resolve reyelts.dyndns.org.

Endolf

rreyelts · December 10, 2003, 7:51pm

I would, but your dyndns account seems broken as it won’t resolve reyelts.dyndns.org

Lol. My server just started suffering spontaneous rebooting fits, so the domain is down for a few days while I put together a new box. The project is also hosted at sourceforge - http://sf.net/projects/jace

God bless,
-Toby Reyelts

Jeff · December 11, 2003, 5:31am

[quote]The truth, though, is that performance is actually very slow. Our algorithm performs roughly 1 million inserts and extractMins and we expect to be able to do it in under a second. Obviously, that’s not possible with the current timings. I did nearly a line for line translation of my fibonacci heap from Java to C++ (not optimizing for C++),
[/quote]
Hmm. It may still be code that is better suited to C++, where a re-workign of the algorithym might make the Java fly.

But before we even go there, lets ask some basic Java microbenchamrking questions:

(1) Did you run this with the -Xcompile flag to force it to compile all the methods right away?

(2) How long did yo urun thsi etst for? even with -Xcompile you are liley I think to pay some start-up penalty. When benchmarking on the JDK tema we always did the foloowing:
(a) NEVER ran abench mark that ran for less then about 60 seconds.
(b) ALWAYS ran the benchmark 5 times in sucession fro mwithin the same Java program.
© THROUGH OUT the entire first run.

Unless you are doing these things its quite possible that you aren’t warming up the VM and/or allowing the one-time cost of compilation to have an unreasonably large effect on your final numbers.

altair · December 11, 2003, 6:19am

Toby,
How did you measure exactly ?
1 Million times insert/extract or 1 Million times
for ( int i = 0; i < ints.length; ++i )
insert()
while ( heap.size() > 0 )
extract()
?
or something else ?
The results seem to be pretty slow indeed. Did you try without the generics because the generics prototype may be slowing down everything ? There is nothing suspect in the code AFAIK.

princec · December 11, 2003, 11:07am

Well, following the general guidelines above, I ran the following benchmark:


  public static void benchmark() {
        
        int[] fn = new int[] { 23, 7, 3, 3, 14, 17, 30, 24, 26, 46, 35 };
        int[] ints = new int[1000000]; //{ 23, 7, 3, 3, 3, 14, 14, 7, 17, 30, 24, 26, 46, 35 };
        for (int i = 0; i < ints.length; i ++) {
              ints[i] = fn[Util.random(0, fn.length - 1)];
        }
        long then = Sys.getTime();
        IntFibonacciHeap heap = new IntFibonacciHeap();
        Node[] nodes = new Node[ ints.length ]; 
        
        for ( int i = 0; i < ints.length; ++i ) { 
              Node node = new Node( ints[ i ], String.valueOf(i)); 
              heap.insert( node ); 
              nodes[ i ] = node; 
        } 
        
        Object[] mins = new Object[heap.size()];
        int count = 0;
        while ( heap.size() > 0 ) {
              mins[count ++] = heap.extractMin().value(); 
        }
        
        System.out.println("Time: "+((float)(Sys.getTime() - then) / (float)Sys.getTimerResolution()));
        int sum = 0;
        for (int i = 0; i < mins.length; i ++) {
              sum += Integer.parseInt((String) mins[i]);
        }
        System.out.println("Sum: "+sum);
        
  }

On a dual P3-1GHz using the following parameters:


-server -Xcomp -Xmx128m -Xms128m

using the LWJGL hires timer and got an average time of 9.7s running for 1min30s. A vanilla run with just the client VM and no params got me an average 11.5s.

Is this too slow by a factor of 10 then? What does the C++ one manage?

Cas

altair · December 11, 2003, 6:30pm

Princec,
You are measuring also the creation of 1000000 node objects, you should move the creation of those nodes (and the associated strings) out of the insert() loop.
Also, my understanding is that the heaps remain small but yours grows to 1M nodes.
Toby,
Have you tried the profiler (-Xrunhprof:cpu=times) to measure the time spent in your methods ?
With my Ahtlon 1.4Ghz and JVM 1.4.2-beta-b19, I get (no optimization options) :

CPU TIME (ms) BEGIN (total = 36706) Thu Dec 11 13:15:37 2003
rank self accum count trace method
1 13.62% 13.62% 1 27 test.IntFibonacciHeap.main
2 11.45% 25.07% 700000 30 java.lang.String.concat
3 9.96% 35.03% 1400000 22 java.lang.String.getChars
4 9.58% 44.61% 700000 3 test.IntFibonacciHeap.insert
…
That means an insert around 5 microsec on my box.
With a longer test and Princec’s JVM options, it is around 4.8 microsec.

princec · December 11, 2003, 6:35pm

I included the creation time of the nodes (and the class instance itself) because it doesn’t work without them being created, and the C version would do similar (although maybe not on the heap).

I’d like to see the C comparison now doing exactly the same thing. I’m certain C will be at least twice as fast but worried if it’s 10 times as fast

Cas

altair · December 11, 2003, 6:57pm

Actually I do not know if Toby’s figures include node creation or not. But, if you include the creation of the nodes, most of the time is spent in there and not in the insert() method.

rreyelts · December 11, 2003, 8:07pm

[quote]Is this too slow by a factor of 10 then? What does the C++ one manage?
[/quote]
Yes. That is too slow by about a factor of 10.

God bless,
-Toby Reyelts

rreyelts · December 11, 2003, 8:20pm

[quote]Actually I do not know if Toby’s figures include node creation or not. But, if you include the creation of the nodes, most of the time is spent in there and not in the insert() method.
[/quote]

He should include node creation. It’s something that is a required part of the insert. Most data structures (like HashMap) hide this inside of the insert, but with a fibonacci heap you have to give that Node back to the client anyway, so I just make the allocation the responsibility of the client.
The heap instance creation is just noise compared to a million inserts.
The String.valueOf really shouldn’t be part of that though.

I decided to break down and use an actual commercial profiler to see what it would tell me. JProfiler tells me that, on average, the inserts are taking 4 microseconds, and the extract mins are taking 70 microseconds. (I believe the maximum heap size has changed significantly, since some changes we made to the algorithm). It also says that the fibonacci heap is responsible for 80% of the running time of the main algorithm.

When I get finished translating the algorithm to C++ (soon), I’ll run some statistics on it and see what kind of results I get.

God bless,
-Toby Reyelts

altair · December 16, 2003, 4:21am

In my case, the tests show that roughly 75% is due to the node creation, not the insertion per se.
If pre-creating the nodes is an option, then you will see a big increase in speed.

mthornton · December 21, 2003, 2:01pm

I suspect that the fibonacci heap (or at least your implementation of it) is not well suited to Java. Try implementing a basic binary heap instead. The array used in the consolidate method is probably very cheap in C++ (stack allocated), but costs quite a bit more in Java.

jherber · January 13, 2004, 5:19pm

macroscopically, you want to "public static final " all your methods.

microscopically, classes in java are expensive. if you can come up with a tricky way to store your data in an array and then just change integer pointers into that array you will get very high performance.

writing fast java code reminds me of writing fast c code - avoid objects!

blahblahblahh · January 13, 2004, 5:37pm

Ha! I remember the time when you had to also “avoid methods” for optimal performance, particularly methods in other classes or objects; I had a fractal generator that showed a considerable speed boost when all several thousand lines of code were placed in one class - I can’t remember if I had the courage to manually inline the few recursive methods too (I don’t think so; it would have made the code seriously painful to change…).

Thankfully there’s normally no difference now between method calls to the current object and to other objects…