New VM performance improvements

ajiva · September 1, 2004, 1:58pm

Just wanted to let you guys know that sin, cos, tan, ln, log10 are all setup in the VM to use the X86 hardware (when possible). This is in addition to square root and pow. This is both on X86 and AMD64, and there are small speed ups to other platforms (by speeding up the calls to these trig and transcendentals). This pretty much does it for this sorta work by me. Anyone have other suggestions for things that need to be sped up?

Oh and this is of course post-tiger, so don’t expect it anytime soon

pepe · September 1, 2004, 2:15pm

Hello. Thanks for those, it is very apreciated !!!
I don’t know if that is your field, or is a correct answer to your question, but shifts and masking (ints, for channel operations on pixels) showed to be very very slow. In fact, it was slower using an RGBA int than four floats for storing/handling pixel values due to those operations. (i’m doing image filtering, and i -of course- need it to be fast )
If you could accelerate that also, i’d be your slave for life.
;D

ajiva · September 1, 2004, 2:34pm

Can you write up a small test case showing the problem? Something that I can look at and try to optimize? Thanks

pepe · September 1, 2004, 3:51pm

of course.
here it is. i get a constant 4.5 speed increase factor going float. :o

public class FilteringTest
{

      public static void main(String args[])
      {
            FilteringTest ft=new FilteringTest();
            ft.startInt();
            ft.startFloat();
      }
      
      static final int nbPixels = 1024*1024; // change to set a new image size
      
      static final int loopCount = 500; // change to set a different image filtering count. Verification of the pixel will not match text printed. (i'm lazy)
      
      static int t;
      static float tf;


      void startInt()
      {
            
            long debut;
            long fin;
            int loop=0, imgloop;
            int r=0,g=0,b=0,a=0,pixel=0;
            debut=System.currentTimeMillis();
            int array[]=new int[nbPixels];
            fin=System.currentTimeMillis();
            System.out.println("init time:"+(fin-debut)+" ms");
            

            
            imgloop=0;
            for (;imgloop<nbPixels; imgloop++)
            {
                  array[imgloop]=0xffffffff;
    }

            

            System.out.println("test1: int pixels filtering");
            debut=System.currentTimeMillis();
            for (;loop<loopCount; loop++)
            {
                  imgloop=0;
                  for (;imgloop<nbPixels; imgloop++)
                  {
                        pixel=array[imgloop];
                        a=pixel>>>24;
                        r=(pixel>>>16)&0x000000ff;
                        g=(pixel>>>8)&0x000000ff;
                        b=pixel&0x000000ff;

                        t=((r+g+b)/3)-1; // performing a basic non weighted b&w. (-1 is to decrease the result, so rendering can be verified.)
                        
                        array[imgloop]=(t<<24)+(t<<16)+(t<<8)+t;
                  }
            }
            fin=System.currentTimeMillis();
            long test1=(fin-debut);
            System.out.println("Elapsed time: "+test1+" ms");
            int nbpix1=(int) ((nbPixels*loopCount)/((double)test1/1000.f));
            System.out.println(nbpix1+" pixels/second, that is "+(((double)nbpix1/(720*576*25))*100)+"% of real time video filtering.");
            System.out.println("random pixel result for validity of rendering: 0x"+ Integer.toHexString( array[ (int)(Math.random() * nbPixels) ] ));
            System.out.println("result should be:0x0a0a0a0a" +"\n\n");
            array=null;
      }
      

      
      void startFloat()
      {
            
            long debut;
            long fin;
            int loop=0, imgloop;
            float r=0.f,g=0.f,b=0.f,a=0.f;
            debut=System.currentTimeMillis();
            float array[]=new float[nbPixels*4];
            fin=System.currentTimeMillis();
            System.out.println("init time:"+(fin-debut)+" ms");


            imgloop=0;
            for (;imgloop<nbPixels; imgloop++)
            {
                  array[imgloop]=150000.f;
    }


            System.out.println("test2: float pixels filtering");
            debut=System.currentTimeMillis();
            for (;loop<loopCount; loop++)
            {
                  imgloop=0;
                  for (;imgloop < nbPixels ; imgloop+=4)
                  {
                        a=array[imgloop];
                        r=array[imgloop+1];
                        g=array[imgloop+2];
                        b=array[imgloop+3];

                        tf=((r+g+b)/3.f)-1; // performing a basic non weighted b&w. (-1 is to decrease the result, so validity of rendering can be verified.)

                        array[imgloop]=tf;
                        array[imgloop+1]=tf;
                        array[imgloop+2]=tf;
                        array[imgloop+3]=tf;
                  }
            }
            fin=System.currentTimeMillis();
            long test1=(fin-debut);
            System.out.println("Elapsed time: "+test1+" ms");
            int nbpix1=(int) ((nbPixels*loopCount)/((double)test1/1000.f));
            System.out.println(nbpix1+" pixels/second, that is "+(((double)nbpix1/(720*576*25))*100)+"% of real time video filtering.");
            System.out.println("random pixel result for validity of rendering:"+ array[ (int)(Math.random() * nbPixels) ] );
            System.out.println("result should be:149500" +"\n\n");
            array=null;
      }
}

mthornton · September 1, 2004, 5:01pm

You should really put
t &= 0xFF
before creating the pixel.
In any case using

t = (((r+g+b)*5592406) >>> 24)-1;

is considerably faster than

t=((r+g+b)/3)-1;

although still not as good as the float based code (at least on my Athlon XP 2500+).

pepe · September 1, 2004, 5:55pm

[quote]You should really put
t &= 0xFF
before creating the pixel.
[/quote]
That’s interesting, but i think it’s unnecessary. As the input can’t be over 255, the result can’t be illegal, that is, over 255.

True, but that kind of optimisation should belong to the compiler, not the coder.

What is your ratio between each?

dranonymous · September 1, 2004, 6:03pm

Mark - Why is the version you presented faster? My guess is that you avoid casting the ints to floats, but thats speculation.

Dr. A>

mthornton · September 1, 2004, 6:14pm

That -1 means the result can be -1!

While it may be practical for a compiler to replace division by a constant float with multiplication by the reciprocal, there are complications in doing the same thing for integers.

float about 5, vs int about 8 (seconds in both cases). The original int version takes 20.

dranonymous:
My revised int version is faster because muliplication is (usually) significantly faster than division. This is true for both integer and floating point, however I suspect that in the floating point case the division has been automatically replaced by a multiplication by the reciprocal.

tom · September 1, 2004, 7:02pm

On my computer the integer version is a factor of 1.7 slower using Marks modification. Wich sounds about right as the integer version does twice the amount of work.

[quote]here it is. i get a constant 4.5 speed increase factor going float.
[/quote]
What did you expect?

mthornton · September 1, 2004, 7:15pm

[quote]What did you expect?
[/quote]
Current CPU have lots of hardware devoted to floating point, so that they can do simultaneous additions and multiplications. On the other hand there is usually only one shifter, so the integer version probably makes less effective use of the chip (less scope for operations to be performed in parallel).

pepe · September 2, 2004, 2:34am

I would be working with 8 bits store, that would be true. nevertheless, 0xFF in an int is 255, not -1… 0XFFFFFFFF would…

[quote]While it may be practical for a compiler to replace division by a constant float with multiplication by the reciprocal, there are complications in doing the same thing for integers.
[/quote]
Oh, interesting. why that?

That 's a nice improvement, i agree. Too nice, in fact. there has to be something to do for that division…

swpalmer · September 2, 2004, 3:00am

[quote] Anyone have other suggestions for things that need to be sped up?
[/quote]
Use of vecor instructions MMX/SSE/SSE2… etc. for common patterns found in manipulating RGBA ints - as above.

NVaidya · September 2, 2004, 1:02pm

With reference to this document from NIST (dated November 2002 !!! )
http://math.nist.gov/javanumerics/reports/jgfnwg-minutes-11-02.html

is FMA, in particular, currently in place in 1.5 (betas)?

mthornton · September 2, 2004, 1:15pm

[quote]is FMA, in particular, currently in place in 1.5 (betas)?
[/quote]
Unfortunately not.

JSR 84 which also proposed supporting FMA was withdrawn in March 2002 apparently due to difficulties in setting up the expert group.
http://jcp.org/en/jsr/detail?id=84

dranonymous · September 2, 2004, 2:34pm

Pepe - In the int version you shift the alpha value, but then you never did anything with it. Did I miss where you manipulated the value again?

Mark/Pepe - Have you looked at the compiled byte code to see how it differs for those small shifting/masking areas?

Dr. A>

pepe · September 2, 2004, 3:10pm

[quote]Pepe - In the int version you shift the alpha value, but then you never did anything with it. Did I miss where you manipulated the value again?
[/quote]
no. In first versions, the values were even all copied into temporary values, then pushed bacK. That class is an expurged version of an other set where i tested how valuable it was to put pixel treatment in a method of an other class. In that old test, i had to extract all components, and pass them to filtering method, along with image array and poke offset. That was a pretty interesting test, because doing so was faster than simply putting all code in a single loop. (server JIT only…)

I would love to, but we can’t have a look at how the JIT compiles bytecode, if that’s what you meant.

dranonymous · September 2, 2004, 4:15pm

I realize you can’t see how the JIT compiled it down to native assembly, but you could see the bytecode produced in the class files and compare them. It would be interesting to see what was going on in each one.

Dr. A>

mthornton · September 2, 2004, 5:40pm

Byte code is very direct representation of the java source — little or no optimisation is done at that point. Essentially all the optimisation is done by the JIT at runtime.

pepe · September 3, 2004, 6:06am

Byte code (compiled java source) is very basic. No optimisations are done there, in order for the JIT to recognise patterns, thus simplify its work and make it more efficient.
Assembly (compiled bytecode) is done by JIT, and us, mortals, don’t have access to it. That assembly can be way different than what is in the bytecode.

NVaidya · September 4, 2004, 5:04pm

Would this make Pepe feel better…

http://www.javaspecialists.co.za/archive/Issue054b.html

What’s the deal with the % operator these days anyway ?