The thriller in Manilla, JNA vs JNI using OpenCL

I personally would not make my OpenCL bindings decision based on this, but I think we all know this is not about OpenCL, rather OpenGL. OpenCL should be discussed in it’s thread. It is just a proxy here. Michael, sorry if I am forcing your hand too early or messing up your schedule. Just say if you wish to postpone. As long as you run both sides yourself on the same hardware, you do not have to worry about distribution before you had planned.

The JNA side is using OpenCL4Java. This is what JavaCL is built on. We have 2 names so we knew what we were referring to when talking. I am probably the only person who uses OpenCL4Java directly.

I ran this on a Snow Leopard Macbook wt 2 GPU’s & Intel® Core™2 Duo CPU P8800 @ 2.66GHz. I tried to put out some rules, listed right in the source, but they can be improved do so. I am not the referee, my schedule is pretty tight.

When I did a run of 1000 loops (each loop is 5 calls) , I got an Avg msec per loop: 0.19629501. When I did 1 million, I got 0.018008577. The million is probably better, if you are doing 1000 OpenGL calls just for 1 frame, but play with it.


`package whatever;
import java.nio.;
import com.sun.jna.
;
import com.sun.jna.ptr.;
import com.ochafik.lang.jnaerator.runtime.NativeSize;
import com.ochafik.lang.jnaerator.runtime.NativeSizeByReference;
import com.nativelibs4java.opencl.library.
;

/**

  • JNA version, using OpenCL4Java(the low level bindings for JavaCL). Add this jar to project to run.
  • http://nativelibs4java.sourceforge.net/maven/com/nativelibs4java/opencl4java/1.0-SNAPSHOT/opencl4java-1.0-SNAPSHOT-shaded.jar
  • Goal: test call overhead of JNI vs JNA. OpenCL has dev info calls which are
  • short in duration. They DO NOT touch GPU’s. The type of data returned can be found
  • by running http://nativelibs4java.sourceforge.net/webstart/OpenCL/HardwareReport.jnlp
  • Not every possible query performed, only one of each return type. Too much work for
  • all. Control using LOOP_COUNT.
  • Turn on the clock only after Platform, dev created.
  • Rules:
  • - platform must be NVidia 195 or 196 if Windows.  Win7 64-bit if possible.
    
  • - Do not even bothering to create a context or command queue.
    
  • - The avg time/loop should be compared on exact same hardware.  The value itself is
    
  •   NOT important, only the difference in values.
    
  • - MUST "look" at value, since this could be different.
    
  • - MUST include any methods which one would reasonable need to do inside the
    
  •   loop.  e.g. getPointer() methods for JNA
    
  • - assigning return code required, but can be actual checking can be commented out
    

*/
public class JNIvsJNAviaOpenCL{
static int LOOP_COUNT = 1000000; // 1M
static float NANOS_PER_MILLI = 1000000F;

public static void main(String[] argv){
    // get platform, usually only one, unless mixing NVidia & ATI GPU's
    OpenCLLibrary.cl_platform_id[] platformArray = new OpenCLLibrary.cl_platform_id[1];
    int err = OpenCLLibrary.INSTANCE.clGetPlatformIDs(1, platformArray, null);
    if (err != OpenCLLibrary.CL_SUCCESS)
        throw new RuntimeException("failed to get platform " + err);

    // get any device, the device itself not important, but need to do queries against something
    OpenCLLibrary.cl_device_id[] deviceArray = new OpenCLLibrary.cl_device_id[1];
    err = OpenCLLibrary.INSTANCE.clGetDeviceIDs(platformArray[0], OpenCLLibrary.CL_DEVICE_TYPE_ALL, 1, deviceArray, null);
    if (err != OpenCLLibrary.CL_SUCCESS)
        throw new RuntimeException("failed to get device " + err);

    // assorted vars declared out side the loop
    OpenCLLibrary.cl_device_id dev = deviceArray[0];  // do not want to index dev array every call
    long cummTime = 0L;
    long start;

    NativeSize szInt = new NativeSize(Native.LONG_SIZE);
    IntByReference valInt = new IntByReference();
    int lookedAtInt;

    NativeSize szLong = new NativeSize(8);
    LongByReference valLong = new LongByReference();
    long lookedAtLong;

    NativeSize szSizeT = new NativeSize(8);
    NativeSizeByReference valSizeT = new NativeSizeByReference();
    long lookedAtSizeT;

    NativeSize szString = new NativeSize();
    NativeSizeByReference nCharBuf = new NativeSizeByReference();
    ByteBuffer valStringBuf;
    int length;
    String lookedAtString;

    long force_JVM_to_do = 0;

    for(int i = 0; i < LOOP_COUNT; i++){
        start = System.nanoTime();

        // int based info queries
        err = OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DEVICE_VENDOR_ID, szInt, valInt.getPointer(), null);

// if (err != OpenCLLibrary.CL_SUCCESS)
// throw new RuntimeException("failed int query " + err);
lookedAtInt = valInt.getValue();

        // long based info queuies
        OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DEVICE_MAX_MEM_ALLOC_SIZE, szLong, valLong.getPointer(), null);

// if (err != OpenCLLibrary.CL_SUCCESS)
// throw new RuntimeException("failed long query " + err);
lookedAtLong = valLong.getValue();

        // tSize based info queuies
        OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DEVICE_IMAGE2D_MAX_WIDTH, szSizeT, valSizeT.getPointer(), null);

// if (err != OpenCLLibrary.CL_SUCCESS)
// throw new RuntimeException("failed tsize query " + err);
lookedAtSizeT = valSizeT.getValue().longValue();

        // string based info queuies  (2 calls, first to find out size; 2nd to get)
        err = OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DRIVER_VERSION, szString, null, nCharBuf);

// if (err != OpenCLLibrary.CL_SUCCESS)
// throw new RuntimeException(ErrorDesc.getErrorDesc(err));

        length = nCharBuf.getValue().intValue();
        szString.setValue(length);
        valStringBuf = NIO_Utils.getByteBuffer(length);

        // call again to get the actual value
        err = OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DRIVER_VERSION, szString, Native.getDirectBufferPointer(valStringBuf), null);

// if (err != OpenCLLibrary.CL_SUCCESS)
// throw new RuntimeException("failed string query " + err);
// else
lookedAtString = NIO_Utils.toString(valStringBuf);

        cummTime += System.nanoTime() - start;
        force_JVM_to_do += lookedAtInt - lookedAtLong + lookedAtSizeT - lookedAtString.length();
    }


    System.out.println("Avg ms per loop: " + (cummTime/(LOOP_COUNT * NANOS_PER_MILLI)));
    System.out.println("ignore:  " + force_JVM_to_do);

}

}
`

I like the title :wink: At least someone who don’t take technology discussions religiously

I had to comment the last few lines out of your test sinc you forgot to include the NIO_Utils.

well I used the high level api since its already late (And I had to change the rules since I am on linux 64 / GTX295)


package whatever;

import com.mbien.opencl.CLContext;
import com.mbien.opencl.CLDevice;
import com.mbien.opencl.CLPlatform;

/**
 * @author mbien
 */
public class JOCLHLBench {

    static int LOOP_COUNT = 1000000; // 1M
    static float NANOS_PER_MILLI = 1000000F;

    public static void main(String[] args) {

        //init
        CLContext context = CLContext.create(CLPlatform.getDefault().listCLDevices()[0]);
        CLDevice device = context.getDevices()[0];

        long cummTime = 0L;
        long start;
        long force_JVM_to_do = 0;

        long lookedAtInt;
        long lookedAtLong;
        long lookedAtSizeT;
        String lookedAtString;

        for(int i = 0; i < LOOP_COUNT; i++){
            start = System.nanoTime();

            // int based info queries
            lookedAtInt = device.getVendorID(); // sorry, but this is an long in my case :)

            // long based info queuies
            lookedAtLong = device.getMaxMemAllocSize();

            // tSize based info queuies
            lookedAtSizeT = device.getMaxImage2dWidth();

            // string based info queuies  (2 calls, hidden in HL API)
            lookedAtString = device.getDriverVersion();

            cummTime += System.nanoTime() - start;
            force_JVM_to_do += lookedAtInt - lookedAtLong + lookedAtSizeT - lookedAtString.length();
        }

        System.out.println("Avg ms per loop: " + (cummTime/(LOOP_COUNT * NANOS_PER_MILLI)));
        System.out.println("ignore:  " + force_JVM_to_do);

        //deinit
        context.release();
    }
    
}


your code again:


package whatever;

import java.nio.*;
import com.sun.jna.*;
import com.sun.jna.ptr.*;
import com.ochafik.lang.jnaerator.runtime.NativeSize;
import com.ochafik.lang.jnaerator.runtime.NativeSizeByReference;
import com.nativelibs4java.opencl.library.*;

/**
 *  JNA version, using OpenCL4Java(the low level bindings for JavaCL).  Add this jar to project to run.
 *  http://nativelibs4java.sourceforge.net/maven/com/nativelibs4java/opencl4java/1.0-SNAPSHOT/opencl4java-1.0-SNAPSHOT-shaded.jar
 *
 *  Goal: test call overhead of JNI vs JNA.  OpenCL has dev info calls which are
 *  short in duration.  They DO NOT touch GPU's.  The type of data returned can be found
 *  by running http://nativelibs4java.sourceforge.net/webstart/OpenCL/HardwareReport.jnlp
 *
 *  Not every possible query performed, only one of each return type.  Too much work for
 *  all.  Control using LOOP_COUNT.
 *
 *  Turn on the clock only after Platform, dev created.
 *
 *  Rules:
 *     - platform must be NVidia 195 or 196 if Windows.  Win7 64-bit if possible.
 *     - Do not even bothering to create a context or command queue.
 *     - The avg time/loop should be compared on exact same hardware.  The value itself is
 *       NOT important, only the difference in values.
 *     - MUST "look" at value, since this could be different.
 *     - MUST include any methods which one would reasonable need to do inside the
 *       loop.  e.g. getPointer() methods for JNA
 *     - assigning return code required, but can be actual checking can be commented out
 *
 */
public class JNIvsJNAviaOpenCL{
    static int LOOP_COUNT = 1000000; // 1M
    static float NANOS_PER_MILLI = 1000000F;

    public static void main(String[] argv){
        // get platform, usually only one, unless mixing NVidia & ATI GPU's
        OpenCLLibrary.cl_platform_id[] platformArray = new OpenCLLibrary.cl_platform_id[1];
        int err = OpenCLLibrary.INSTANCE.clGetPlatformIDs(1, platformArray, null);
        if (err != OpenCLLibrary.CL_SUCCESS)
            throw new RuntimeException("failed to get platform " + err);

        // get any device, the device itself not important, but need to do queries against something
        OpenCLLibrary.cl_device_id[] deviceArray = new OpenCLLibrary.cl_device_id[1];
        err = OpenCLLibrary.INSTANCE.clGetDeviceIDs(platformArray[0], OpenCLLibrary.CL_DEVICE_TYPE_ALL, 1, deviceArray, null);
        if (err != OpenCLLibrary.CL_SUCCESS)
            throw new RuntimeException("failed to get device " + err);

        // assorted vars declared out side the loop
        OpenCLLibrary.cl_device_id dev = deviceArray[0];  // do not want to index dev array every call
        long cummTime = 0L;
        long start;

        NativeSize szInt = new NativeSize(Native.LONG_SIZE);
        IntByReference valInt = new IntByReference();
        int lookedAtInt;

        NativeSize szLong = new NativeSize(8);
        LongByReference valLong = new LongByReference();
        long lookedAtLong;

        NativeSize szSizeT = new NativeSize(8);
        NativeSizeByReference valSizeT = new NativeSizeByReference();
        long lookedAtSizeT;

        NativeSize szString = new NativeSize();
        NativeSizeByReference nCharBuf = new NativeSizeByReference();
        ByteBuffer valStringBuf;
        int length;
        String lookedAtString;

        long force_JVM_to_do = 0;

        for(int i = 0; i < LOOP_COUNT; i++){
            start = System.nanoTime();

            // int based info queries
            err = OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DEVICE_VENDOR_ID, szInt, valInt.getPointer(), null);
//            if (err != OpenCLLibrary.CL_SUCCESS)
//                throw new RuntimeException("failed int query " + err);
            lookedAtInt = valInt.getValue();

            // long based info queuies
            OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DEVICE_MAX_MEM_ALLOC_SIZE, szLong, valLong.getPointer(), null);
//            if (err != OpenCLLibrary.CL_SUCCESS)
//                throw new RuntimeException("failed long query " + err);
            lookedAtLong = valLong.getValue();

            // tSize based info queuies
            OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DEVICE_IMAGE2D_MAX_WIDTH, szSizeT, valSizeT.getPointer(), null);
//            if (err != OpenCLLibrary.CL_SUCCESS)
//                throw new RuntimeException("failed tsize query " + err);
            lookedAtSizeT = valSizeT.getValue().longValue();

            // string based info queuies  (2 calls, first to find out size; 2nd to get)
            err = OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DRIVER_VERSION, szString, null, nCharBuf);
//            if (err != OpenCLLibrary.CL_SUCCESS)
//                throw new RuntimeException(ErrorDesc.getErrorDesc(err));

            length = nCharBuf.getValue().intValue();
            szString.setValue(length);
//            valStringBuf = NIO_Utils.getByteBuffer(length);
//
//            // call again to get the actual value
//            err = OpenCLLibrary.INSTANCE.clGetDeviceInfo(dev, OpenCLLibrary.CL_DRIVER_VERSION, szString, Native.getDirectBufferPointer(valStringBuf), null);
////            if (err != OpenCLLibrary.CL_SUCCESS)
////                throw new RuntimeException("failed string query " + err);
////            else
//                lookedAtString = NIO_Utils.toString(valStringBuf);

            cummTime += System.nanoTime() - start;
            force_JVM_to_do += lookedAtInt - lookedAtLong + lookedAtSizeT /*- lookedAtString.length()*/;
        }


        System.out.println("Avg ms per loop: " + (cummTime/(LOOP_COUNT * NANOS_PER_MILLI)));
        System.out.println("ignore:  " + force_JVM_to_do);

    }
}

results:

OpenCL4Java:

Avg ms per loop: 0.022599893
ignore: -234688290000000

JOCL, high level:
Avg ms per loop: 0.012990374
ignore: 5343065332166419968

(values are different since JNA version makes one less)

no guarantees, maybe i forgot something in the hurry… its already late in germany.
thanks for providing the testcase!

Hi all,

I’m the author of JavaCL (a.k.a OpenCL4Java).
This benchmark is actually an excellent news for JavaCL’s performance, because… it doesn’t even use the fastest JNA mapping mode !

Indeed, JNA has two mapping modes (see https://jna.dev.java.net/) :

  • interface mode (dynamic and slow because it’s reflection-intensive), currently used by OpenCL4Java
  • direct native mode (native methods are directly bound to native function callbacks, pretty much as in JNI).

The direct native mode can be up to 10 times faster than the interface mode, but has some (overridable) limitations that made me not to choose it for the pre-1.0 final version of JavaCL / OpenCL4Java.

Your post will obviously make me release a “direct-enabled” OpenCL4Java binding sooner, so stay tuned :slight_smile:
Cheers

Olivier

Well, I always knew JNA was going to end up as Joe Frazier in such a raw match up, but I had assumed those figures were using direct mapping. That’s better performance than I expected using interface mode. While seeing some robust benchmarks between JNI and JNA is something I’d like to see, the other thing that interests me is seeing some “real” apps benchmarked using the two bindings. I’m more interested in seeing at what point (if any) the extra overhead becomes statistically irrelevant.

@ Olivier - there was some discussion of direct vs interface mapping at the end of the “Catch 22 for JOGL” thread, in case you haven’t seen it. OT - The JNAJack binding I mentioned there (it’s at http://code.google.com/p/java-audio-utils/) used an early version of JNAerator to create most of the low level binding. Thanks, great tool! Must try again and update with direct mapping mode.

Best wishes,

Neil