JSquish - DXT Compression Library

Spasi · October 25, 2006, 11:26pm

Hi everyone, this is the first version of JSquish (~1.82MB).

JSquish is a Java port of the DXT compression library Squish. See this article for details. I will release the full source code after I fix a minor implementation issue.

The above archive contains the library JAR, which is also executable. It’s a tiny LWJGL application that can be used to compare the different DXT compression methods. There are four methods that can be switched (press ‘C’):

Uncompressed
The original texture
Compressed - Driver
The texture is compressed by the OpenGL driver (using the EXT_texture_compression_s3tc formats)
Compressed - Range Fit
The texture is compressed using the Range Fit method. This is a very fast method, with comparable results to the driver one.
Compressed - Cluster Fit
The texture is compressed using the Cluster Fit method. This method was the real motivation for writing a Java port of the Squish library. It’s an embarrassingly slow method (only useful for offline compression), but gives amazing results compared to the driver implementation.

I urge you to try your own textures (press ‘O’ to load a new image) and see the quality difference of the Cluster Fit method. Try relatively small images first, to get a feeling of how slow it is. ;D

SCREENSHOTS (click for hi-res)

[tr][td]Uncompressed[/td][td]Compressed - Cluster Fit[/td][/tr]
[tr]
[td]
http://www.zdimensions.gr/spasi/public/061026_JSquish/uncompressed_sm.jpg
[/td]
[td]
http://www.zdimensions.gr/spasi/public/061026_JSquish/compressed_cluster_sm.jpg
[/td]
[/tr]
[tr][td]Compressed - Driver[/td][td][center]Compressed - Range Fit[/td][/tr]
[tr]
[td]
http://www.zdimensions.gr/spasi/public/061026_JSquish/compressed_driver_sm.jpg
[/td]
[td]
http://www.zdimensions.gr/spasi/public/061026_JSquish/compressed_range_sm.jpg
[/td]
[/tr]

pepijnve · October 26, 2006, 9:43am

I made the exact same thing some time ago. Time to do some performance comparisons

It’s supposed to go in the product I’m working on, but I’ve been hesitant due to possible patent issues. Any idea whether squish falls under the s3tc patent or not?

pepijnve · October 26, 2006, 10:17am

Our public facing interfaces are almost identical, which isn’t surprising given the common heritage. The only big differences I see is that I javafied it (CLUSTER_FIT instead of kColourClusterFit) and I use byte[] instead of int[]. I ran a test with the following code with 100 warmup iterations and 100 benchmark iterations (should I use larger values here?)


BufferedImage image = ImageIO.read(new File("C:\\test.png"));
int width = image.getWidth();
int height = image.getHeight();

int flags = gr.zdimensions.jsquish.Squish.kDxt1 | gr.zdimensions.jsquish.Squish.kColourClusterFit;
int[] input = getImageData(image);
int storageRequirements = gr.zdimensions.jsquish.Squish.getStorageRequirements(width, height, flags);
int[] output = new int[storageRequirements];

for (int i = 0; i < WARMUP_ITERATIONS; i++) {
    gr.zdimensions.jsquish.Squish.compressImage(input, width, height, output, flags);
}

long start = System.currentTimeMillis();
for (int i = 0; i < BENCHMARK_ITERATIONS; i++) {
    gr.zdimensions.jsquish.Squish.compressImage(input, width, height, output, flags);
}
long end = System.currentTimeMillis();

System.out.println("average time = " + (end - start) / BENCHMARK_ITERATIONS + "ms");

The same code was also run against my own implementation giving the following results:
JSquish: 334ms
MySquish: 210ms

Hooray I’m curious where the large difference comes from though. I would have expected my version to be slower since I used byte[] and have to do a & 0xFF each time I want to use a value.

Spasi · October 26, 2006, 12:04pm

Heheh! Given the compression quality though, I was suspecting that someone might have done this already.

I’m not sure on the patent issue, but there are another two OS implementations already. I don’t think I’ve anything to worry about, but the case may be different in a commercial product (if that’s what your product is).

That’s exactly what I plan to do next. I’ve seen this before in several places, using bytes and & 0xFF is much faster than the equivalent with ints. Probably the & overhead is tiny compared to the memory/bandwidth gains. This is the minor implementation issue I mentioned in my previous post. I will also javafy the interface a bit.

Anyway, the bottleneck in Cluster Fit is the solveLeastSquares method that is being called ~500 times per 4x4 block, on average. It takes around 97% of the total execution time, but it is plain floating point math and has nothing to do with bytes vs ints. So, I can’t see how your implementation runs so much faster than mine. Maybe the test image you tried is too simple? From my tests I’ve found that the algorithm performance depends very much on the image contents, it takes a lot of time to find the optimal solution for complex images.

pepijnve · October 26, 2006, 12:36pm

I’ve compared my implementation of solveLeastSquares with yours. The biggest difference is that I inlined all the Vec3 stuff. Profiling showed that construction and manipulation of these objects was a hotspot, so I completely removed that stuff. Since that method is essentially the inner loop of the algorithm it makes a big difference. I’ve attached my cluster fit implementation so you can compare.

Riven · October 26, 2006, 2:06pm

There is a LOT of room to optimize that code:

This is what I found:


    // zero where non-determinate
    float aX, aY, aZ;
    float bX, bY, bZ;
    if (beta2_sum == 0f) {
      aX = alphax_sumX / alpha2_sum;
      aY = alphax_sumY / alpha2_sum;
      aZ = alphax_sumZ / alpha2_sum;
      bX = bY = bZ = 0f;
    } else if (alpha2_sum == 0f) {
      aX = aY = aZ = 0f;
      bX = betax_sumX / beta2_sum;
      bY = betax_sumY / beta2_sum;
      bZ = betax_sumZ / beta2_sum;
    } else {
      float factor = (alpha2_sum * beta2_sum - alphabeta_sum * alphabeta_sum);

      aX = ((alphax_sumX * beta2_sum) - (betax_sumX * alphabeta_sum)) / factor;
      aY = ((alphax_sumY * beta2_sum) - (betax_sumY * alphabeta_sum)) / factor;
      aZ = ((alphax_sumZ * beta2_sum) - (betax_sumZ * alphabeta_sum)) / factor;

      bX = ((betax_sumX * alpha2_sum) - (alphax_sumX * alphabeta_sum)) / factor;
      bY = ((betax_sumY * alpha2_sum) - (alphax_sumY * alphabeta_sum)) / factor;
      bZ = ((betax_sumZ * alpha2_sum) - (alphax_sumZ * alphabeta_sum)) / factor;
    }

    // clamp the output to [0, 1]
    aX = Math.max(0, Math.min(1, aX));
    aY = Math.max(0, Math.min(1, aY));
    aZ = Math.max(0, Math.min(1, aZ));
    bX = Math.max(0, Math.min(1, bX));
    bY = Math.max(0, Math.min(1, bY));
    bZ = Math.max(0, Math.min(1, bZ));

    // clamp to the grid
    aX = (int)(GRID_X * aX + 0.5f) / GRID_X;
    aY = (int)(GRID_Y * aY + 0.5f) / GRID_Y;
    aZ = (int)(GRID_Z * aZ + 0.5f) / GRID_Z;
    bX = (int)(GRID_X * bX + 0.5f) / GRID_X;
    bY = (int)(GRID_Y * bY + 0.5f) / GRID_Y;
    bZ = (int)(GRID_Z * bZ + 0.5f) / GRID_Z;

If you see this, all bells should be ringing:


a = b / x;
c = d / x;
e = f / x;
g = h / x;

Replace that with:


float inv_x = 1.0f / x;
a = b * inv_x;
c = d * inv_x;
e = f * inv_x;
g = h * inv_x;

Do that everywhere, and you’ll see significantly better performance (if it’s the bottleneck)

Further, I benchmarked this a while ago, so in the latest VMs this might not be an issue anymore, but this:


    aX = Math.max(0, Math.min(1, aX));

there are 2 assignments and 2 comparisons here, worstcase scenario. The following code makes that either 0 or 1 assignments and 1 or 2 comparisons:


   if(aX < 0.0f) ax = 0.0f; else if(aX > 1.0f) ax = 1.0f;

Let me know the results

pepijnve · October 26, 2006, 3:01pm

Thanks for the suggestions. These changes gave the following results:
base: 210ms
replace division by multiplication: 209ms
replace Math.min/max with if/else: 176ms

I’ve attached an updated version. My profiler now shows the following hotspots:
ClusterFit#compress4: 46%
ClusterFit#compress3: 21%
ClusterFit#solveLeastSquares: 8%
I don’t really have any ideas on how I could improve compress3/4 though.

Spasi · October 26, 2006, 3:34pm

Thank you both. After using pepijnve’s code and adding Riven’s (a.k.a. Java Performance God) suggestions, my implementation is now 4 (four) times faster! Apparently inlining made a huge difference. I was expecting Hotspot to do a better job and optimize away most of the method calls when using vectors. :-\

Anyway, I’ve updated the archive with the new version. You can also download the library separately (265kb) if you want to compare it again.

pepijnve · October 26, 2006, 3:58pm

I found the big difference between the two versions. My implementation was based on squish 1.7, while yours is probably based on 1.9. The main difference is that cluster fit now goes through multiple iterations. After updating my implementation average time went up to 266ms again. I then ran the test with the updated JSquish and that gave 147ms :o I profiled your version, and I didn’t see compress3 popping up anywhere which seemed kind of strange, so I then peeked at ColourFit#compress. Your implementation differs from the original squish version. Squish does


bool isDxt1 = ( ( m_flags & kDxt1 ) != 0 );
if( isDxt1 ) {
  Compress3( block );
  if( !m_colours->IsTransparent() ) {
    Compress4( block );
  }
} else {
  Compress4( block );
}

while you do


bool isDxt1 = ( ( m_flags & kDxt1 ) != 0 );
if( isDxt1 && m_colours->IsTransparent() ) {
  Compress3( block );
} else {
  Compress4( block );
}

which is a subtle difference that might cause a slight loss of quality. Squish uses the best result of compress3 and compress4, while your using either one or the other. No idea if this makes a big difference in practice though.

Riven · October 26, 2006, 4:05pm

Take a look at this thread about it. It’s horrific.

Riven · October 26, 2006, 4:11pm

Your:


    aX = aX < 0f ? 0f : (aX > 1f ? 1f : aX);

is not equals to:


   if(aX < 0.0f) ax = 0.0f; else if(aX > 1.0f) ax = 1.0f;

(assignment-count wise)

Tell me the diff, if any - it might save you a few cycles.

Spasi · October 26, 2006, 4:25pm

[quote=“pepijnve,post:9,topic:28640”]
Indeed, that was a piece of code that really confused me. I did a few tests last night as I was scratching my head and found that compress4 always produces the best result, that’s why I removed the compress3 for RGB images. 3+4 makes the algorithm 100% slower for no apparent improvement, but I guess it’s too obvious to be a mistake and there must be a reason 3 is there. I’ll investigate further.

[quote=“Riven,post:10,topic:28640”]
Yeah, I remember your tests. I prefer code simplicity over performance, but thankfully in this case the problem was only in one particular method, so it was no big deal to optimize. In general though, we need escape analysis + stack allocation + optimizations urgently.

Spasi · October 26, 2006, 4:28pm

Btw, here’s my solveLeastSquares implementation.

Riven · October 26, 2006, 4:40pm

I’d advice you to loose all Vec’s from your algorithm.

You have to consider the Vec’s can be all over the heap, which is kinda slow, compared to storing continious blocks of vec-data (float[]) in cpu-cache…


		final Vec w = weighted[i];

		alpha2_sum += a * a;
		beta2_sum += b * b;
		alphabeta_sum += a * b;

		alphax_sumX += w.x() * a;
		alphax_sumY += w.y() * a;
		alphax_sumZ += w.z() * a;

Give this a try, even if you have to modify quite some code:


		final float wx = weighted[i];
		final float wy = weighted[i+1];
		final float wz = weighted[i+2]; // requires weighted to be a float[]

		alpha2_sum += a * a;
		beta2_sum += b * b;
		alphabeta_sum += a * b;

		alphax_sumX += wx * a;
		alphax_sumY += wy * a;
		alphax_sumZ += wz * a;

		betax_sumX += wx * b;
		betax_sumY += wy * b;
		betax_sumZ += wz * b;

I’m not saying it will definitly boost performance, but it’s worth it, it might be 2x faster, or not make much difference.

It’s pretty much trial-and-error now.

Spasi · October 26, 2006, 4:49pm

Will do. Thanks again. Btw, Range Fit can be optimized accordingly.

I’m now wondering how the original Squish implementation compares to ours. He does use hand-coded SSE, but still a comparison would be interesting.

Riven · October 26, 2006, 4:57pm

You can’t beat handcoded SSE (4 values / instruction) in Java. The VM only uses SSE with 1 value, which is not nearly as efficient as 4.

In the past I did some benchmarks with it through JNI, and even that was faster (50%) then plain float[] math (1100ns JNI overhead per call).

As long as your API is run AOT, who cares! (damn, that’s a lame way to defend Java :))

Spasi · October 27, 2006, 1:57am

I finished the adjustments, so here’s JSquish in binary and source form.The library and the test application can also be downloaded separately.

What changed:

- I switched everything to byte arrays. No performance difference at all, but at least no memory is wasted.
- I did a few optimizations here and there.
- I changed the public interface to use enums for options and added more entry points with default values.
- I reverted compress to use both compress3 & compress4. Made some more tests and both are affecting the final image after all (altough with negligible difference).

I even tried to do gamma correction, but it didn’t work out well. I moved everything to linear space in ColourSet, run the compression, then moved back to gamma space when writing the compressed block. I was expecting to see a nice, subtle difference, but instead I got a somewhat darker image and minor artifacts. I believe the problem lies with the error computation or the grid clamping, but couldn’t figure a way to solve it. Anyway, it’s good enough as it is.

Btw, I forgot to mention that Java 5.0 is required and Java 6.0 highly recommended (up to 20% faster on the client VM).

pepijnve · October 27, 2006, 8:17am

I’ve been pulling my hair out trying to find the cause of the remaining performance difference between our two implementations (mine was a bit slower :)). Going over both codebases with a fine toothed comb I think I found a bug in your implementation of computeWeightedCovariance. In your implementation you pass in a Matrix instance that is reused. I checked the original squish code and you definetly should be resetting the matrix values to 0 before you start the accumulation loop. Otherwise you’ll be reusing covariance values from a previous calculation. Sadly enough when I corrected this in JSquish the performance gap only got larger I’m kind of stumped as to what could be causing this, especially since my profiler is telling me the exact opposite. The search continues…

oNyx · October 27, 2006, 8:57am

Heh. Awesome ;D/

pepijnve · October 27, 2006, 9:16am

Woohoo! The culprit was a stupid stupid bug. I had written

axis = bestend - bestend;

instead of

axis = bestend - beststart;

which caused incorrect results obviously, but also caused solveLeastSquares to be called twice as much as in JSquish. This tiny change chopped 80ms off of the average time, bringing it to 100ms. Time for a celebration
Thanks Spasi and Riven for all the information. It’s defintely been an interesting learning experience.