Improving draw performance for constantly changing image

Hi,

EDIT: TL;DR:
I’m using ffmpeg to decode video, so I get all the benefits of hardware acceleration that it provides.
I get decoded RGB frame data from ffmpeg from each frame and put it into a Raster bound to a BufferedImage.
I render the BufferedImage. This bit is slow.

I’m implementing a Java-based media player that’s hooked into the ffmpeg media library. I’m simultaneously decoding multiple video streams and displaying them in a number of lightweight java components across 4 monitors. However, I’m encountering performance issues when rendering the decoded video to screen. The video decoding part is negligible here (approx. 10% CPU use on 1 core) so I’m not going to discuss that here. The issue arises in the swing painting code, which blocks deep within Java’s Direct2D code while scaling/drawing the image itself, so I hope someone might be able to help me here.

I currently receive the image as a series of ints in packed 0rgb format (this is flexible, so I can use whatever colour model and packing gives best performance). The data is written directly into an image raster that I’ve instantiated in the following manner:

    imageBuffer = new DataBufferInt(lineStride * height);
    imageRaster = Raster.createPackedRaster(imageBuffer, width, height, lineStride, new int[] {0xff0000, 0xff00, 0xff}, null);                
    ColorModel colourModel = new DirectColorModel(32, 0xff0000, 0xff00, 0xff);
    currentFrame = new BufferedImage(colourModel, imageRaster, true, null);
    currentFrame.setAccelerationPriority(1.0f);

This code is only called once per video player and the raster and bufferedImage re-used for each subsequent frame.
Each time a new frame is decoded, the data is copied into the raster and repaint(30) called (to trigger an asynchronous swing repaint).

I am currently hitting 100% CPU usage on a Core i5 with the following scenario:

Four 1280x1024 monitors are connected to the PC.
Four 720x576 videos are being decoded at 25FPS (this is handled by ffmpeg and is not the contributing factor to the performance issues)
Each monitor is displaying a single JFrame without window decordations, scaled to fit the entire monitor
The JFrame’s paintComponent method is overridden with the paint code as follows:

    graphics.setColor(Color.BLACK);
    graphics.fillRect(0, 0, getWidth(), getHeight());

    graphics.drawImage(currentFrame,
                    0, 0, getWidth(), getHeight(),
                    0, 0, currentFrame.getWidth(), currentFrame.getHeight(),
                    null);

For comparison, if I use four full-screen media players (such as VLC or ffplay), each playing one of the streams, the total CPU usage is around 30%.

I have experimented with various options (such as disabling Direct3D, etc) but without any real gains and mostly losses/unexpected behaviour. I have also considered using full-screen exclusive mode but would like to know whether this is likely to help before I try that route.

Does anyone have any suggestions? I’m happy to elaborate on any areas of the code that may be relevant.

You can at least lose the clear.

Acknowledged. I have removed it and re-tested but it gives no observable improvement to performance.

Have you tried profiling the program while it is running to see which method calls are taking the longest?

The most basic way to do this is with a lot of System.nanoTime() calls and some print statements to discover how long each chunk of coding is running. Be careful, this is fairly messy. If you do try this, don’t including prints statements around a method call which contains print statements inside the method. Depending on the JVM, print statements can be very slow and including timer prints statements inside of methods that call other methods that have timer print statements causes the parent method calls to take much longer and appear slower until you remove the child method’s print statements.
(from personnel experience trying to profile a program this way)

I would recommend VisualVM for profiling, although simply starting your Java app from the command line with the -Xprof switch can sometimes be helpful.

The -Xprof switch will keep track of which methods are running every ‘tick’ (depending on VM, around 60 ticks per second) and when you close the program you will get a printout to the command line of how many ‘ticks’ each method was running for.

The most powerful option is a profiler like VisualVM. VisualVM can be tricky to get running and navigate through, but once you figure it out, it’s a very useful tool for tracking memory and CPU usage of a Java program.
Here’s a link to VisualVM’s documentation page (there’s a link to download it in the page’s menu bar). There are a number of links on the page including an introduction page and troubleshooting page: http://visualvm.java.net/docindex.html.

The problem is probably “receiving and sending each image”. You should be sending a lot of images to GPU instead of 1 at a time. I have no idea how to implement that, but that is my guess for your problem :slight_smile:

In an ideal world you use multiple buffers and have the codec directly perform the color conversion into the target…massive reduction in memory motion.

I did previously state this, but perhaps I didn’t make this clear enough in my original post:

It’s blocking within the D3D blitting part of the java2D code.

I’ve already profiled the application (many times!) and here’s a typical trace (using sampling as full instrumentation slows everything down to a massive crawl):

Switching off D3D does give a small performance improvement:

Switching on openGL stops the application working completely.

Yes - I am in the process of porting the java code directly to C++ and thinking about using the SDL libraries or similar to do the rendering direct in YUV420p encoding. However, I’m definitely not a C++ engineer, so it’s taking a while and has lots of snags - the media decoding alone (let alone the rendering) is hugely complex and not easy to port. Plus it has to run on windows which is just plain horrible for C++ (though minGW is quite nice).

Unfortunately it’s a video stream, and one of the main selling points is that it’s capable of (almost) zero latency. While I can buffer frames, I’m never buffering more than about 10, and often there’s no buffering at all.

If you’re completely nuts you could move the heavy lifting to the GPU. Having the CPU pretty much only does entropy decoding of the video side at the far extreme. A heck of a lot of work though…unless someone else has already done it (and supported codec(s) dependent).

Forgive me if I’m misinterpreting you here, but are you saying I re-implement all possible video codecs I could encounter to run entirely on the GPU?

Unfortunately the performance of java2D is not great when it comes to blitting rotated images. This might also be the case for scaled images too, I can’t remember.
Use the java2d trace option and let us know what the output is. It’s better than using the profiler.
http://www.oracle.com/technetwork/java/javase/java2d-142140.html#gcrus

Most people doing demanding rendering on this forum use openGL. OpenGL is faster but less reliable since it depends on opengl drivers which can be flaky.
When I use java2d it’s usually only for painting in a window or applet that is a fraction of the screen size since that’s the only way I can achieve a reasonable frame rate on slow computers.
Since you’re blitting to 4 monitors I think you will have to switch to opengl to achieve reasonable performance, I don’t think java2d can be tweaked to achieve 60 fps on 4 monitors using a computer with or without a video card since many java2D operations (such as image rotation) are not hardware accelerated.

Cheers,
Keith

I have no idea what you’re doing…I’m just tossing out wild duck ideas. In the simplest case you could send the 3 color channels buffers to the GPU and perform color conversion there. I’m not saying this is reasonable in your case.

Output from adding -Dsun.java2d.trace=count:


19 calls to sun.java2d.d3d.D3DSwToTextureBlit::Blit(IntArgb, SrcNoEa, "D3D Texture")
36066 calls to sun.java2d.d3d.D3DMaskFill::MaskFill(AnyColor, SrcOver, "D3D Surface")
5004 calls to D3DFillRect
4255 calls to D3DDrawGlyphs
4503 calls to sun.java2d.d3d.D3DTextureToSurfaceBlit::Blit("D3D Texture", AnyAlpha, "D3D Surface")
31994 calls to sun.java2d.d3d.D3DMaskFill::MaskFill(OpaqueColor, SrcNoEa, "D3D Surface")
941 calls to sun.java2d.d3d.D3DSwToSurfaceScale::ScaledBlit(IntRgb, AnyAlpha, "D3D Surface")
942 calls to sun.java2d.d3d.D3DRTTSurfaceToSurfaceBlit::Blit("D3D Surface (render-to-texture)", AnyAlpha, "D3D Surface")
19 calls to sun.java2d.d3d.D3DSwToSurfaceBlit::Blit(IntArgb, AnyAlpha, "D3D Surface")
83743 total calls to 9 different primitives

With -Dsun.java2d.d3d=false as well:


2718 calls to sun.java2d.loops.MaskFill::FillAAPgram(AnyColor, Src, IntRgb)
1361 calls to sun.java2d.windows.GDIBlitLoops::Blit(IntRgb, SrcNoEa, "GDI")
12420 calls to sun.java2d.loops.MaskBlit::MaskBlit(IntArgb, SrcOver, IntRgb)
22574 calls to sun.java2d.loops.MaskFill::MaskFill(AnyColor, SrcOver, IntRgb)
1359 calls to sun.java2d.loops.ScaledBlit::ScaledBlit(IntRgb, SrcNoEa, IntRgb)
15660 calls to sun.java2d.loops.MaskFill::FillAAPgram(AnyColor, SrcOver, IntRgb)
12420 calls to sun.java2d.loops.Blit$GeneralMaskBlit::Blit(IntArgb, SrcOver, IntRgb)
3 calls to sun.java2d.loops.FillRect::FillRect(AnyColor, SrcNoEa, AnyInt)
4083 calls to sun.java2d.loops.DrawGlyphListAA::DrawGlyphListAA(AnyColor, SrcNoEa, IntRgb)
72598 total calls to 9 different primitives

Hope this helps.

Appreciated. I thought I’d explained the situation quite clearly, but TL;DR:

I’m using ffmpeg to decode video, so I get all the benefits of hardware acceleration that it provides.
I get decoded RGB frame data from ffmpeg from each frame and put it into a Raster bound to a BufferedImage.
I render the BufferedImage.
Slowness intensifies.

Oh, my previous traces had quite a lot of extra stuff in them (debug overlay text etc) that for simplicity’s sake I’ve commented out. Here it is with literally just the image render:

With D3D enabled:

862 calls to sun.java2d.d3d.D3DSwToSurfaceScale::ScaledBlit(IntRgb, AnyAlpha, "D3D Surface")
16 calls to sun.java2d.d3d.D3DMaskFill::MaskFill(AnyColor, SrcOver, "D3D Surface")
863 calls to sun.java2d.d3d.D3DRTTSurfaceToSurfaceBlit::Blit("D3D Surface (render-to-texture)", AnyAlpha, "D3D Surface")
5 calls to D3DFillRect
1 call to D3DDrawGlyphs
1747 total calls to 5 different primitives

With D3D disabled:

627 calls to sun.java2d.loops.ScaledBlit::ScaledBlit(IntRgb, SrcNoEa, IntRgb)
629 calls to sun.java2d.windows.GDIBlitLoops::Blit(IntRgb, SrcNoEa, "GDI")
31 calls to sun.java2d.loops.MaskFill::MaskFill(AnyColor, SrcOver, IntRgb)
3 calls to sun.java2d.loops.FillRect::FillRect(AnyColor, SrcNoEa, AnyInt)
1290 total calls to 4 different primitives

Please note they weren’t run for exactly the same amounts of time, so numbers may vary.

Which was done by Riven a while ago iirc:

http://www.java-gaming.org/topics/java-media-player/27100/view.html

Swing is swing. If the issue here seems to be the speed of java2d drawing and scaling an image - there may not be much to be done and the performance will vary a lot between systems.

Java has a hard time playing an animated GIF without frame reduction, let alone converting each frame into pixels and then displaying it. I think you’ll have to definitely look into doing the heavy lifting via the GPU. When it comes to reliable frames, Java can’t be trusted to run perfectly on all platforms. :-\

I’m doing the heavy lifting (the media decoding) in ffmpeg, which makes full use of SSSE3 CPU extensions for doing this kind of thing. All I need Java to do is to display each frame without consuming 100% CPU. Cross-platform isn’t an issue; I will only be running the Java version on windows as it’s tied in to ffmpeg via JNI. I’m working on a C++ port which I’m sure will solve the CPU issue and be cross platform, but that’s some time away yet…

Now that is a nice project.

I’ll probably drop a message or two on there suggesting some improvements, as it would be possible to remove the MJPEG dependency altogether with the right output format.

I think I have some old code based on LWJGL lying around that can draw images to the screen and respond to mouse clicks that I’ll probably refresh my memory on and make something of similar nature. The annoying part is that I’ll have to re-implement all of the Swing components that I overlay onto the video in openGL (these aren’t included in the benchmarking above, they were the first thing I suspected and removed when testing the performance), unless anyone can suggest otherwise?