How accurate is this ?

zingbat · June 8, 2004, 7:54pm

I made a little benchmark program and since im a little rookie in this stuff i have some doubts on how to interpret the results.

The program is this:

public class Tests {

public static void main(String args[]) {
    System.out.println("\ntest0 - currentTimeMillis precision");
    test0(); test0();
    System.out.println("\ntest1 - empty cycle");
    test1(); test1(); test1(); test1();
    System.out.println("\ntest2 - local mul");
    test2(); test2(); test2(); test2();
    System.out.println("\ntest3 - static mul");
    test3(); test3(); test3(); test3();
    System.out.println("\ntest4 BC*4 - instance variable access");
    test4(); test4(); test4(); test4();
    System.out.println("\ntest5 - instance variable mul");
    test5(); test5(); test5(); test5();
    System.out.println("\ntest6 BC/20 - instance creation");
    test6(); test6(); test6(); test6();
    test6(); test6(); test6(); test6();
}

// A benchmark cycle
public static int BC = 10000000;

// the precision is P
public static long P = 0;

public static void test0() {
    // reset P
    P = 0;
    // indexes 0 and last are to be discarded later
    int TIME_ARRAY = 7;
    long n[] = new long[TIME_ARRAY];
    n[0] = System.currentTimeMillis();
    for (int i=1; i < TIME_ARRAY; i++) {
        while (n[i-1] == System.currentTimeMillis()) ;
        n[i] = System.currentTimeMillis();
    }
    // Only use indexes from 1 to preceding last index
    // (remenber valid indexes go from 0 to TIME_ARRAY-1)
    for (int i=1; i < TIME_ARRAY-1; i++) {
        long x = n[i]-n[i-1];
        P += x;
        System.out.print(x + " ");
    }
    // average it
    System.out.println("\nTotal: " + P);
    P /= TIME_ARRAY-2;
    System.out.println("Precision: " + P);
}

public static void test1() {
    long start = System.currentTimeMillis();
    for (int i=0; i < BC; i++) ;
    long end = System.currentTimeMillis();
    long frame = (end - start);
    System.out.print(frame + "-" + frame/P + " ");
}

public static void test2() {
    int x=0;
    long start = System.currentTimeMillis();
    for (int i=0; i < BC; i++) x = x * 2;
    long end = System.currentTimeMillis();
    long frame = (end - start);
    System.out.print(frame + "-" + frame/P + " ");
}

public static int x;

public static void test3() {
    long start = System.currentTimeMillis();
    for (int i=0; i < BC; i++) x = x * 2;
    long end = System.currentTimeMillis();
    long frame = (end - start);
    System.out.print(frame + "-" + frame/P + " ");
}

public static void test4() {
    long start = System.currentTimeMillis();
    CTest ct = new CTest();
    ct.self = ct;
    Object ref = null;
    for (int i=0; i < BC; i++) ref = ct.self.self.self.self;
    long end = System.currentTimeMillis();
    long frame = (end - start);
    System.out.print(frame + "-" + frame/P + " ");
}

public static void test5() {
    long start = System.currentTimeMillis();
    CTest ct = new CTest();
    for (int i=0; i < BC; i++) ct.x = ct.x * 2;
    long end = System.currentTimeMillis();
    long frame = (end - start);
    System.out.print(frame + "-" + frame/P + " ");
}

public static void test6() {
    long start = System.currentTimeMillis();
    CTest ct = null;
    Object ref = null;
    for (int i=0; i < BC/20; i++) ref = new CTest();
    long end = System.currentTimeMillis();
    long frame = (end - start);
    System.out.print(frame + "-" + frame/P + " ");
}

}

class CTest {
public int x;
public CTest self;
}

And this is the result on a 1ghz cpu running jdk1.4.2

test0 - currentTimeMillis precision
10 10 10 10 10
Total: 50
Precision: 10
10 10 10 10 10
Total: 50
Precision: 10

test1 - empty cycle
40-4 40-4 40-4 30-3
test2 - local mul
60-6 61-6 60-6 70-7
test3 - static mul
50-5 60-6 50-5 60-6
test4 BC*4 - instance variable access
110-11 110-11 121-12 120-12
test5 - instance variable mul
50-5 50-5 50-5 60-6
test6 BC/20 - instance creation
60-6 60-6 40-4 50-5 40-4 50-5 51-5 40-4

According to the results one multiplication operation, on my machine, takes about 5ns and allocating memory for an object takes 20 times more time, that is 100 ns.

Is this a correct interpretation ?

Note: in the values above like 40-4, 40 is the time taken to perform BC operations and 4 is the resulte of dividing it by P

mthornton · June 10, 2004, 8:04am

Lots of scope for getting meaningless results in there. The multiplication by two may be transformed to a shift. Some of the loops might be eliminated altogether. Etc, etc.
Generally your tests are just too small.

zingbat · June 10, 2004, 10:51am

What you mean the tests are too small ? Each test is carried in a loop that iterates 10000000 times and each loop is often sampled 4 times to get an average.

I calculated the time a multiplication takes (what i was measuring) by subtracting the time an empty cycle takes from the time a cycle performing multiplications takes:

60 - 40 = 20 millisecs

Which gives 20 millisecs to perform BC multiplications, now dividing this by BC

20000000 nanosecs / BC = 2 nanosecs

Since my Duron works at 1Ghz it gives something like 2 instructions per cycle with pipeline (not shure about this), and each cycle takes something like 1 nanosec = 1s/1Ghz .

This is the confusing part, it is said that a java program takes almost the same time as a C++ program but 2ns for a multiplication seams to be a lot when with that time it should have performed 4 multiplications.

Is this correct or there is some obscure side effect that invalidates this calculation ? Im not an expert in this or anything like it. In fact this is my first benchmarking program.

mthornton · June 10, 2004, 11:11am

It isn’t a matter of the loop repetition count being too small, but rather the content of the loop is too trvial and unrepresentative of real code. You can’t compute the time for individual primitive operations and then add them together to get any meaningful result. (What was the last common processor for which that might have had some value? x386?)

Modern processors can perform a number of operations in parallel so attempting to time a single multiplication (even if repeated in a loop) just doesn’t represent a real situation. Depending on the JVM in use, it may not even do a multiplication at all, but just replace it with a shift. A more realistic test might be computing the dot product of two vectors. You also need to watch out for pipeline effects. That is although the CPU may be able to start a multiplication every clock cycle, the result is not available until many cycles later.

However it is far better to test significant fragments of real code (that performs some useful operation). This way it is easier to interpret the result without having to allow for all the complications mentioned above. So instead of trying to time a multiplication and an addition, time the calculation of a FFT over say 512 points.

blahblahblahh · June 10, 2004, 11:16am

Click on the “Search” link above and look for the words “micro” and “benchmark” by the user “jeffk”. ;D You will find a detailed analysis and commentary on the use and design of java benchmarks like this, and all the information on why what you are doing is actually a complete waste of time.

(if the search doesn’t work because this software is rubbish, make sure you entered “600” for the number of days to search backwards; if it still doesn’t work, try google and instead of “jeffk” use the terms “cats” and “yabb” which should get you his posts.)

mthornton · June 10, 2004, 12:26pm

These are more nano benchmarks than micro benchmarks and accordingly are even more useless. Current compiler and processor technology is just too complex to be usefully analysed in this way.

zingbat · June 10, 2004, 7:43pm

Uh ? Okay so micro or nano benchmarks are useless to test complex applications. Thats not the point of this thread.

Do you think that the above test is correct to test the speed of java multiplication in my computer and with the software i have installed ? Im not looking for a universal benchmark, im just trying to understand the results i obtained. It certainly cant be correct because it states that one multiplication in Java takes 4 times more than the potential speed available in my computer. Even considering multi-tasking and that hotspot cannot compete with a C++ program its strange to get this result.

Im interesting in a logical explanation an engeneer would provide not a popular opinion. But thanks anyway for trying to help.

Jacko · June 10, 2004, 9:21pm

ok so we all agree that micro benchmarks dont mean anything. BUT?, does anyone here really say that Java can compare to C++ when you want balls out pure performance code?

blahblahblahh · June 10, 2004, 9:38pm

[quote]ok so we all agree that micro benchmarks dont mean anything. BUT?, does anyone here really say that Java can compare to C++ when you want balls out pure performance code?
[/quote]
Sigh. Again, do the search I described. Of course java is the same performance as C++ - they’re the same damn code! There’s nothing more to it than that.

Well, actually, there are exceptions :).

swpalmer · June 10, 2004, 10:18pm

[quote]Do you think that the above test is correct to test the speed of java multiplication in my computer and with the software i have installed ?
[/quote]
NO

[quote]Im interesting in a logical explanation an engeneer would provide not a popular opinion. But thanks anyway for trying to help.
[/quote]
Compilers and Processors are smart. Your benchmarks are not actually doing any real work - in some cases the compiler can insert a NOP for your loop. If the computed results don’t actually require the loop to run then how can you be sure the loop actually does run? How can you be sure that the high level Java statements are really being computed?
For example:

public static void test2() { 
   int x=0; 
   long start = System.currentTimeMillis(); 
   for (int i=0; i < BC; i++) x = x * 2; 
   long end = System.currentTimeMillis(); 
   long frame = (end - start); 
   System.out.print(frame + "-" + frame/P + " "); 
    }

The only computation in the loop is on ‘x’ and the result is assigned to ‘x’, but ‘x’ is never used outside of the loop so the entire loop doesn’t need to be included in the compiled code. Theoretically the result can be computed without running the loop, so even printing the final ‘x’ value doesn’t prove the loop happened. But most of all the loop is so trivial that various factors about processor architecture dominate the measurement (assuming the loop does run). E.g. it depends so much on the fact that it will easily fit in the processor cache, branch prediction and the side effects of predicting wrong and stalling the pipeline will be fixed in a trivial way that doesn’t represent what would happen in a ‘useful’ program… The only thing this tests is how fast the benchmark runs… it doesn’t mean anything outside of that. In other words you can’t say “integer multiplication performance is X” you can only say the performance appears to be X for this very specific loop. Too much is going on that you can’t see.

mthornton · June 11, 2004, 6:20am

[quote]ok so we all agree that micro benchmarks dont mean anything. BUT?, does anyone here really say that Java can compare to C++ when you want balls out pure performance code?
[/quote]
Have a look at this for an example of using Java for high performance code.
http://hoschek.home.cern.ch/hoschek/colt/
While it doesn’t quite match Intel’s hand crafted assembler it remains a very creditable achievement.

Java has a number of libraries which are slower than they could be, and its requirement for array bounds checking puts it at a slight disadvantage in code doing random access to arrays (for loops across arrays the compiler can often hoist the check out of the loop or eliminate it entirely).
On the other hand Java can do inlining in many cases where C++ could not. This is particularly important in genuinely OO code (as opposed to simple procedural C like code). It is hard to match the performance of Java’s synchronization in C++ (and certainly not by naive use of say the Windows synchronization library calls). Java JIT compilers can (and some do) generate different code depending on whether the machine on which it is running has just one or multiple processors.

zingbat · June 11, 2004, 3:31pm

swpalmer:

That was a good explanation, it pointed some aspects i overlooked. I will check the documentation about how the specific JDK im using optimizes the code. This must be something more or less predictable.

How accurate is this ?

class CTest { public int x; public CTest self; }

class CTest {
public int x;
public CTest self;
}