Cortado works!

Momoko_Fan · April 27, 2010, 3:31am

I have jheora (from cortado) working inside jMonkeEngine3. It is easy to get started, but synchronization can be a real pain, especially if you’re using multithreading. Often times the audio packets arrive very late in comparison to the video frames and you essentially have no choice but to buffer the video frames or you’ll be deadlocked when trying to synchronize them.
Once I get proper syncing working, it will be possible to use this, it “just works”. You throw in a theora/vorbis video and it plays just like that, even from a web-server. There are a ton of easy-one-click converters online for all video formats into theora/vorbis. Compression is decent, format is patent-free, only downside is jheora is under LGPL which means you gotta include the jar and source for it separately from app’s jar.
So yeah, hopefully I’ll be able to get the issues out, and then somebody can port it out of jME3 for general public use, or, there’s always the option of everyone switching to jME3

delt0r · April 27, 2010, 7:39am

syncing is in general difficult to get right because you don’t control the multiplexing. In some cases players simply fail if the interleaving is too bad and some have a “cache” option that permits the user to set how bad the plexing needs to be before failure.

Whats the performance like? My tests meant that HD was still a no go for jheora. While the C theora player is now doing quite a bit better than x264 for HD (720p) stuff.

ps I am writing a GLSL iDCT for fun, this may make jheora a bit faster if you skip the java iDCT.

princec · April 27, 2010, 9:17am

I’ve been doing syncing here @ work and basically the way it would appear to do it is you drive everything off of the audio, and buffer video frames behind. With the MPEG2 streams I’m dealing with the number of buffers I need is never more than 3 though, so it’s not a big deal for us.

Cas

Momoko_Fan · April 28, 2010, 3:37pm

Yeah you gotta cache more if you want better sync.

It works fine but you gotta queue up more frames for less choppiness. I tried a 720p 30 fps video with a lot of action and it skipped frames in some places. Hopefully a 24 fps video with less moving things will do better.

Lol good luck with that :P. DCT is actually used in a lot of places and in various ways inside a video codec, you might need to make a lot of mods to the jheora to make it work.

[quote]I’ve been doing syncing here @ work and basically the way it would appear to do it is you drive everything off of the audio, and buffer video frames behind. With the MPEG2 streams I’m dealing with the number of buffers I need is never more than 3 though, so it’s not a big deal for us.
[/quote]
Here I am using 5 buffers and syncing to the system clock and it seems fine for the most part. When you try to sync to audio it’s not really accurate because OpenAL queues up data in large pieces rather than in bytes.

princec · April 28, 2010, 3:53pm

I’m queueing up audio buffers which contain the audio for 1 video frame; when I detect OpenAL has processed the audio for buffer (by polling every millisecond), I tell OpenGL to draw the next video frame, and by the time the buffer swap’s occurred, OpenAL is well underway rendering the sound for that frame. This seems to be working out just great - sound seems to be perfectly synchronized with video.

Cas

Riven · April 28, 2010, 3:58pm

I can imagine it working perfectly, but also a tad slower, like losing a few millis every second in playback speed.

delt0r · April 28, 2010, 4:04pm

[quote]Lol good luck with that Tongue. DCT is actually used in a lot of places and in various ways inside a video codec, you might need to make a lot of mods to the jheora to make it work.
[/quote]
In theora its used in one place (I have read the spec and even hacked the jheora a bit a while ago.). This is not h.264. Its a much simpler codec and its decoding performance is quite a bit higher. In particular it doesn’t do much intra frame prediction like h.264. This makes a iDCT in glsl theoretically possible. My 4x4 one is working pretty good. But the iDCT was only about 60% of the CPU usage the last time i profiled.

As for syncing, syncing of sound is what quite a few apps do. The reason is that it only matters that the sound matches the video, and a few milliseconds is not going to make any difference (1 ms per sec is about 3.6 seconds per hour.). Even more important if your sound card clock is not matching the pc clock (some cards this is true) the voice will have drifted compared to the video, so you would need some way of time shifting the audio. Its easier just to match what matters, and use the sound card as the clock.

princec · April 28, 2010, 5:23pm

It can’t lose any time without gaps in the audio - of which there are of course none. So playback is at precisely the correct rate, which is very accurately controlled by OpenAL’s underlying audio backend.

Cas

Momoko_Fan · April 28, 2010, 5:52pm

I profiled jheora performance on the HD video:

31% - A method called “ReconInterHalfPixel2”
28% - DeBlocking filter
16% - Huffman and in general decoding from file
9% - A method called “ExpandKFBBlock”
6% - A method called “ReconInter”
5% - YUV to RGB, OggDemux, audio

Notice that IDCT is not even in this list, its only 0.1%…
Btw, the decoding thread is seperate from the GL thread in my video player.

[quote]I’m queueing up audio buffers which contain the audio for 1 video frame; when I detect OpenAL has processed the audio for buffer (by polling every millisecond), I tell OpenGL to draw the next video frame, and by the time the buffer swap’s occurred, OpenAL is well underway rendering the sound for that frame. This seems to be working out just great - sound seems to be perfectly synchronized with video.
[/quote]
Ah cool. I don’t think I can do that tho. I have a better idea on how to make the audio clock more accurate.

delt0r · April 28, 2010, 6:14pm

I don’t have my profiling results handy. It was last year so the hardware was not that old. So I don’t really trust your results. I will believe that halfpel and De blocking filter are high on the list since they really suck up a lot of raw mem bandwidth. But Huffman decoding is really fast even done badly while YUV->RGB (720p?) in java only using 5%? Even that’s surprising (but perhaps not on faster modern CPUs).

What profiling tool did you use, how long did you collect stats for? Note i added some of my own timing stuff since these days i am finding hard to get a accurate profiler. Even then i replace the “slowdown” areas with a no op to see if it really does change the timings.

Momoko_Fan · April 28, 2010, 8:50pm

I tested it a few times already, the results stay the same. I am using the NetBeans built-in profiler. I recalibrated it and run the decoder for 100 frames (since it was too slow to run for longer).
Here’s a screenshot of the results:

http://img688.imageshack.us/img688/8951/jheora.png

Riven · April 28, 2010, 9:16pm

There are few profilers as good as VisualVM (which NetBeans uses) in transforming the code in such a way that it alters the performance characteristics completely.

delt0r · April 28, 2010, 9:21pm

Yes, thats what i thought. The method thats taking all the time… is the method that calls the iDCT (Hence DCT in the name). After scanning the code I bet dollars to cents that the iDCT is really whats taking a lot of time (and the de blocking filter). Note that both can use opengl for big speed ups.

The problems i have had with profiling has been the Netbeans and jvisual profilers. I can know they do a bad job, because I can not call a method that takes 80% of the time on some profiling results and it doesn’t speed up at all. Also I don’t think anything that takes less than 2 mins does not give a good reflection of server hot spots performance.

But we will see.

** missed important words

delt0r · April 28, 2010, 10:05pm

just talked to some of the theora guys on irc. It was suggested that testing the full Cortado would give pretty messy profiling results as its multi threaded with a bunch of complicated locks. Also different bit rates would change where they expect the cpu to spend its time. What bit rate source are we talking about here?

Momoko_Fan · April 28, 2010, 11:13pm

Okay so you guys don’t like the NetBeans profiler, can you suggest me another one that is good and doesn’t cost money? I would run the test again on it. I am just using the netbeans one since it’s easy to use and integrates into my project, it seemed to work fine for everything I used it.
I found this post with the profiling results for theora C version and it seems similar to my results:
http://osdir.com/ml/multimedia.ogg.theora.devel/2004-02/msg00078.html

[quote]The method thats taking all the time… is the method that calls the iDCT (Hence DCT in the name)
[/quote]
No you’re wrong. Here’s the method ReconInterHalfPixel2

public static final void ReconInterHalfPixel2(short[] ReconPtr, int idx1,
                           short[] RefPtr1, int idx2, short[] RefPtr2, int idx3,
                           short[] ChangePtr, int LineStep ) {
    int coff=0, roff1=idx1, roff2=idx2, roff3=idx3, i;

    for (i = 0; i < 8; i++ ){
      ReconPtr[roff1+0] = clamp255(((RefPtr1[roff2+0] + RefPtr2[roff3+0]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+1] = clamp255(((RefPtr1[roff2+1] + RefPtr2[roff3+1]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+2] = clamp255(((RefPtr1[roff2+2] + RefPtr2[roff3+2]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+3] = clamp255(((RefPtr1[roff2+3] + RefPtr2[roff3+3]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+4] = clamp255(((RefPtr1[roff2+4] + RefPtr2[roff3+4]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+5] = clamp255(((RefPtr1[roff2+5] + RefPtr2[roff3+5]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+6] = clamp255(((RefPtr1[roff2+6] + RefPtr2[roff3+6]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+7] = clamp255(((RefPtr1[roff2+7] + RefPtr2[roff3+7]) >> 1) + ChangePtr[coff++]);
      roff1 += LineStep;
      roff2 += LineStep;
      roff3 += LineStep;
    }
  }

It doesn’t look like a DCT to me.
I found where it was used in ExpandBlock and the comment says this:

[quote]/* Fractional pixel reconstruction. /
/ Note that we only use two pixels per reconstruction even for
the diagonal. */
[/quote]

[quote]just talked to some of the theora guys on irc. It was suggested that testing the full Cortado would give pretty messy profiling results as its multi threaded with a bunch of complicated locks.
[/quote]
Okay but I am not using the cortado one, I am not allowed to use it since it’s under the GPL, I am just using Jheora that comes with it. Also I profiled with root method being the video decode function, so even if those locks were there, their effects would not be included in the results.

EDIT: Okay I asked my friend to profile an HD 720p video on his mac, using YourKit Java profiler. Here are the results:
YUVConv - 24%
loadFrame - 12%
LoopFilter (deblocking) - 12%
ReconInterHalfPixel2 - 9%
ReconInter - 7%
IDct1 - 3%
IDct10 - 3%
IDctSlow - 3%

He’s using Mac and the java on the mac is probably not that good in optimizing as the Sun one, that might explain the differences. Also like you said maybe its the profiler. I don’t have YourKit profiler but I am gonna get it tomorrow and test this again.

Riven · April 29, 2010, 8:17am

Try passing the -Xprof parameter to the VM

delt0r · April 29, 2010, 8:26am

[quote]Okay so you guys don’t like the NetBeans profiler, can you suggest me another one that is good and doesn’t cost money?Okay so you guys don’t like the NetBeans profiler, can you suggest me another one that is good and doesn’t cost money?
[/quote]
Even with money. No. If you find one, let me know.

To put it simply, we are doing millions and billions of operations per second with complicated 3 tier cache system + instruction cache + branch prediction + out of order execution. Even changing the order changes the performance. Adding profiling code changes the profile. And in java with conditional compilation this is even worse. Basically taking the measurement changes the measurement so much that the measurement is simply false. Like i said. The profiler claimed that 80% of the time was spent in a method. Yet even with the method commented out the run time wasn’t changed more than 5%. IO gets even harder since slowing everything down with instrumentation code doesn’t slow down the IO. So IO performance is many times faster than in reality when profiling,

The best way to run the profiler (i use jvisual /hprof and Xprof. ) Compare and check. I check by moving the problem around to make some things worse. ie if my opengl code is fill limited, higher resolution should make it go a lot slower…

In this case we also have timing loops and locks.

The theora/jheora guys think that at low bit rates the iDCT won’t be a problem because you only have one or 2 non zero coefficients. But at high bit rate they expect both iDCT and huffman decoding to hurt. But there profiling results was showing a huge chunk of work in the YUV2RGB path, and that matches experience in both java and C. In fact the Firefox decoder is using glsl for the YUV2RGB now apparently.

Note one of the main reasons I expect the iDCT to be high is experience. The second is back of the envelope calculations (bandwidth/FLOPS). iDCT in C (asm in fact) is fast because thats what MMX was designed for. Java however does not have this and so this is one area where “java is slow” is in fact true (same goes for YUV2RGB).

So is this 720p. What bit rate? And without profiling do you get real time. Note the C uses less than 12% cpu for 720p24 on my 2 year old system.

Momoko_Fan · April 29, 2010, 7:04pm

Here’s the -Xprof results for Movie Kick-Ass trailer 720p, 30fps, 2000 bit rate:

run:
started ogg reader
new stream 16625
new stream 3801
found theora video
new stream 23968
found vorbis audio
theora dimension: 1280x720
theora aspect: 27x20
theora framerate: 60x2
ogg reader done
ellapsed: 23502

Flat profile of 23.58 secs (2208 total ticks): main

  Interpreted + native   Method                        
  1.3%    28  +     0    com.fluendo.jheora.Decode.loadAndDecode
  1.1%    25  +     0    com.fluendo.jheora.DCTDecode.UpdateUMVBorder
  0.5%     0  +    10    java.io.FileInputStream.readBytes
  0.4%     8  +     0    com.fluendo.jheora.Decode.decodeMVectors
  0.2%     4  +     0    com.fluendo.jheora.Decode.ExtractToken
  0.1%     3  +     0    com.fluendo.jheora.FrArray.getNextBInit
  0.1%     3  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
  0.1%     2  +     0    com.fluendo.jheora.Decode.decodeModes
  0.1%     2  +     0    com.fluendo.jheora.Filter.FilterHoriz
  0.1%     2  +     0    com.fluendo.jheora.ExtractMVectorComponentA.extract
  0.1%     2  +     0    com.fluendo.jheora.Recon.ReconInter
  0.0%     1  +     0    java.lang.ClassLoader.defineClass1
  0.0%     1  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
  0.0%     1  +     0    com.fluendo.jheora.Filter.SetupLoopFilter
  0.0%     1  +     0    com.fluendo.jheora.FrArray.deCodeSBRun
  0.0%     1  +     0    java.awt.color.ColorSpace.getInstance
  0.0%     1  +     0    com.fluendo.jheora.HuffEntry.read
  0.0%     1  +     0    com.fluendo.jheora.FrInit.CalcPixelIndexTable
  0.0%     1  +     0    com.fluendo.jheora.Recon.CopyBlock
  0.0%     1  +     0    com.fluendo.jheora.iDCT.IDctSlow
  0.0%     1  +     0    java.util.jar.JarFile.getEntry
  0.0%     1  +     0    com.fluendo.jheora.Quant.compQuantMatrix
  0.0%     1  +     0    com.fluendo.jheora.FrArray.quadDecodeDisplayFragments
  0.0%     1  +     0    com.fluendo.jheora.Filter.FilterVert
  0.0%     1  +     0    com.fluendo.jheora.Recon.ReconInterHalfPixel2
  5.0%   101  +    10    Total interpreted (including elided)

     Compiled + native   Method                        
 31.8%   703  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
 19.5%   430  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
 16.5%   365  +     0    com.fluendo.jheora.Filter.LoopFilter
  9.3%   206  +     0    com.fluendo.jheora.FrArray.quadDecodeDisplayFragments
  3.1%    69  +     0    com.fluendo.jheora.DCTDecode.ExpandKFBlock
  2.7%    59  +     0    com.fluendo.jheora.DCTDecode.CopyNotRecon
  2.3%    51  +     0    com.fluendo.jheora.FrArray.getNextBBit
  2.2%    48  +     0    com.fluendo.jheora.Decode.decodeMVectors
  1.8%    40  +     0    com.fluendo.jheora.DCTDecode.CopyRecon
  1.5%    34  +     0    com.fluendo.jheora.DCTDecode.ReconRefFrames
  1.5%    33  +     0    com.fluendo.jheora.Decode.decodeModes
  1.0%    21  +     0    com.fluendo.jheora.Decode.decodeBlockLevelQi
  0.8%    18  +     0    com.fluendo.jheora.Decode.unPackVideo
  0.0%     0  +     1    com.jcraft.jogg.StreamState.pagein
 94.1%  2077  +     1    Total compiled

         Stub + native   Method                        
  0.9%     0  +    19    java.lang.System.arraycopy
  0.9%     0  +    19    Total stub


Flat profile of 0.00 secs (1 total ticks): DestroyJavaVM

  Thread-local ticks:
100.0%     1             Blocked (of total)


Global summary of 23.61 seconds:
100.0%  2216             Received ticks
  0.2%     5             Received GC ticks
  3.2%    72             Compilation
BUILD SUCCESSFUL (total time: 24 seconds)

Here’s results for 720p big buck bunny, available for download at buck bunny site:


run:
started ogg reader
new stream 884871684
found theora video
new stream 1274777508
found vorbis audio
theora dimension: 1280x720
theora aspect: 0x0
theora framerate: 24x1
ogg reader done
ellapsed: 279081

Flat profile of 279.11 secs (23219 total ticks): main

  Interpreted + native   Method                        
  1.8%   413  +     0    com.fluendo.jheora.Decode.loadAndDecode
  1.0%   242  +     0    com.fluendo.jheora.DCTDecode.UpdateUMVBorder
  0.8%     0  +   176    java.io.FileInputStream.readBytes
  0.1%    14  +     0    com.fluendo.jheora.Decode.loadFrame
  0.0%     8  +     0    com.fluendo.jheora.Filter.SetupLoopFilter
  0.0%     8  +     0    com.fluendo.jheora.Decode.decodeMVectors
  0.0%     5  +     0    com.fluendo.jheora.State.decodePacketin
  0.0%     5  +     0    com.jme3.video.OggTheoraPerf.start
  0.0%     5  +     0    com.fluendo.jheora.FrArray.getNextBInit
  0.0%     4  +     0    com.jcraft.jogg.SyncState.pageout
  0.0%     3  +     0    com.fluendo.jheora.State.decodeYUVout
  0.0%     3  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
  0.0%     2  +     0    com.fluendo.jheora.Filter.FilterHoriz
  0.0%     2  +     0    com.fluendo.jheora.Recon.ReconIntra
  0.0%     2  +     0    com.jcraft.jogg.Page.serialno
  0.0%     1  +     0    java.util.jar.JarVerifier.beginEntry
  0.0%     1  +     0    com.fluendo.jheora.FrInit.InitFragmentInfo
  0.0%     1  +     0    com.fluendo.jheora.DCTDecode.UpdateUMV_HBorders
  0.0%     1  +     0    com.fluendo.jheora.Filter.FilterVert
  0.0%     1  +     0    com.fluendo.jheora.ExtractMVectorComponentA.extract
  0.0%     1  +     0    com.jcraft.jogg.StreamState.init
  0.0%     1  +     0    com.fluendo.jheora.DCTDecode.UpdateUMV_VBorders
  0.0%     1  +     0    com.fluendo.jheora.Filter.SetupBoundingValueArray_Generic
  0.0%     1  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
  0.0%     1  +     0    com.fluendo.jheora.DCTDecode.CopyRecon
  4.0%   740  +   178    Total interpreted (including elided)

     Compiled + native   Method                        
 35.4%  8223  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
 26.8%  6216  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
 12.1%  2806  +     0    com.fluendo.jheora.Filter.LoopFilter
  8.0%  1856  +     0    com.fluendo.jheora.FrArray.quadDecodeDisplayFragments
  2.5%   573  +     0    com.fluendo.jheora.DCTDecode.CopyRecon
  2.3%   545  +     0    com.fluendo.jheora.DCTDecode.CopyNotRecon
  2.1%   494  +     0    com.fluendo.jheora.FrArray.getNextBBit
  1.6%   364  +     0    com.fluendo.jheora.Decode.decodeModes
  1.5%   357  +     0    com.fluendo.jheora.DCTDecode.ExpandKFBlock
  1.5%   345  +     0    com.fluendo.jheora.Decode.decodeMVectors
  1.1%   259  +     0    com.fluendo.jheora.DCTDecode.ReconRefFrames
  0.7%   158  +     0    com.fluendo.jheora.Decode.unPackVideo
  0.1%    23  +     0    com.fluendo.jheora.FrArray.getNextSbBit
  0.0%    11  +     0    com.jcraft.jogg.SyncState.pageseek
  0.0%     5  +     0    com.jme3.video.OggTheoraPerf.start
  0.0%     1  +     0    com.jcraft.jogg.SyncState.pageout
 95.8% 22236  +     0    Total compiled

         Stub + native   Method                        
  0.3%     0  +    64    java.lang.System.arraycopy
  0.3%     0  +    64    Total stub

  Thread-local ticks:
  0.0%     1             Class loader


Flat profile of 0.03 secs (1 total ticks): DestroyJavaVM

  Thread-local ticks:
100.0%     1             Blocked (of total)


Global summary of 279.18 seconds:
100.0% 23227             Received ticks
  0.0%     4             Received GC ticks
  0.4%    94             Compilation
  0.0%     1             Class loader
BUILD SUCCESSFUL (total time: 4 minutes 39 seconds)

Problem with the -Xprof option is that it doesn’t show the sub-trees under ExpandBlock, like ReconInter and IDCT, so you don’t really know where the bottleneck is.

[quote]without profiling do you get real time.
[/quote]
It plays fine for the most part, but in high-action scenes it starts to lag/drop frames. At some point it becomes more stable (guess some code becomes compiled).

princec · April 29, 2010, 7:16pm

Turn off inlining completely and try it again.

Cas