Vulkan 1.0 Release

theagentd · March 12, 2016, 1:01am

[quote=“Spasi,post:120,topic:56271”]

Hmm. Do you have a rough ETA for this? I want to know if it’s worth waiting for your changes or if I should keep going.

Spasi · March 12, 2016, 1:43am

[quote=“theagentd,post:121,topic:56271”]
I’ll work on it tomorrow, hoping to push a working prototype by Sunday morning at the latest.

Today I spent half the day tuning jemalloc and trying to get a feel of how its thread caching behaves. Also played a bit with custom arenas. Nothing exciting to report, but I did disable two compile-time features which resulted in a 10-13% performance increase for a tight malloc(4)/free loop. The next nightly build will have the newly configured jemalloc.

The other half, I spent verifying a stupid idea I had, to optimize Java-side thread-local access. It’s so wicked that I’m not even going to share it here, but it is indeed 3-4x faster than ThreadLocal. It’s used internally only in LWJGL and there’ll be a configuration option to enable/disable it (not sure if I want to make it the default yet).

Btw, the reason I worked on that was to get thread-local stack allocations closer in performance to your BufferStack example. The idea is to be able to allocate structs without passing explicit stack objects, then do a static “pop” at the end. That should work fine for most use-cases and you could use an explicit stack for tight, performance-sensitive loops.

theagentd · March 12, 2016, 9:13am

I bet it’s indexing into a static array based on the thread ID. =P

EDIT: Another more concrete way of doing that is to have a specific Thread class that implements some kind of ID number (preferably reused when a thread ends).


public class ThreadLocalThread extends Thread{
    private static ArrayList<Integer> availableIDs;
    static{
        availableIDs = new ArrayList<>(1024);
        for(int i = 0; i < 1024; i++){
            availableIDs.add(i);
        }
    }
    
    private int id;
    
    public ThreadSafeThread(){
        synchronized(availableIDs){
            id = availableIDs.remove(availableIDs.size()-1); //Remove last
        }
    }

    public void run(){
        super.run();
        synchronized(availableIDs){
            availableIDs.add(id); //Return ID to pool.
        }
    }

    public int getThreadSafeID(){
        return id;
    }
}

Riven · March 12, 2016, 11:44am

My LibStruct library has already solved the problem of blazingly fast thread-locals. No synchronization whatsoever. It works as long as you create fewer than 100_000 threads (or however much memory you’re willing to throw at it).

github.com

riven8192/LibStruct/blob/fc8cd3092aa765cbe25cc0e9b8c98a48e279bf3a/src/net/indiespot/struct/runtime/FastThreadLocal.java

package net.indiespot.struct.runtime;

public abstract class FastThreadLocal<T> {
	public static final int MAX_SUPPORTED_THREADS = 100_000;

	@SuppressWarnings("unchecked")
	private final T[] threadid2value = (T[]) new Object[MAX_SUPPORTED_THREADS];

	public int size() {
		int size = 0;
		for (int i = 0; i < threadid2value.length; i++)
			if (threadid2value[i] != null)
				size++;
		return size;
	}

	public static interface Visitor<T> {
		public void visit(int threadId, T item);
	}

This file has been truncated. show original

In LibStruct it is used for pushing and popping the stack on method entry and termination. IIRC it’s at least, say, 100-1000x faster than ThreadLocal, as it replaces ThreadLocal’s HashMap lookup by a plain memory access.

KaiHH · March 12, 2016, 11:54am

Really? 100-1000x faster? That sounds amazing!
However, isn’t Java’s ThreadLocal lowered to the platform’s thread-local memory support during JIT?
I mean, that Java ThreadLocal class is really just a “proof of concept” and shouldn’t be done this “manual” way when a Java program gets compiled.
(just wondering)

Riven · March 12, 2016, 11:56am

Maybe it is today, but when I wrote LibStruct, it was dreadfully slow. LibStruct was unusable without this FastThreadLocal class. Just benchmark it on a JIT near you.

Spasi · March 12, 2016, 12:39pm

Please note that my 3-4x figure was a comparison to a very simple program with a single ThreadLocal. The ThreadLocalMap lookup indeed slows down as you add more TLs, but the quick path is quite fast (~8ns).

The lookup in my code is equivalent to Riven’s, except there’s no array access. It works in any thread and for any number of threads. Btw, in my post I said “thread-local access”, not “ThreadLocal replacement”.

Riven · March 12, 2016, 2:03pm

[quote]The ThreadLocalMap lookup indeed slows down as you add more TLs, but the quick path is quite fast (~8ns).
[/quote]
There was only 1 thread-local in LibStruct, but it ground performance to a halt (compared to my FastThreadLocal)

I refered to “but it is indeed 3-4x faster than ThreadLocal”. Given that ThreadLocal is (was?) super-slow, factor 3-4 isn’t much

Why not share your crafty hacky new mechanism, then we can run some benchmarks.

Spasi · March 12, 2016, 2:43pm

Just finished testing your implementation and it is indeed much faster.

But not because of the lookup speed itself, as I said my implementation is basically the same number of instructions and both are not more than 3-4x faster than ThreadLocal. The big difference is that the JVM is able to detect your code as loop invariant and moves it outside the benchmark loop. That’s how you were able to see 100-1000x speedups. This is admittedly a big advantage for your method. I use Unsafe in my code which basically kills any code motion during JIT.

I’m respectfully asking for permission to add (a modified version of) FTL to LWJGL.

Riven · March 12, 2016, 6:40pm

Ehm… which benchmark loop? It’s a little presumptuous to think I would make such a rookie mistake when analyzing performance.

Like I said, maybe the JVM has optimized ThreadLocal performance these days, causing the FTL performance to be only 3-4x faster than TL, but when I was optimizing my demo, the framerate increased significantly, while I did maybe only a few 100k push/pops per frame, processing tens of millions of triangles, in code that I assume was way too big to fully inline, but I may be wrong there. I actually had quite an elaborate demo (still in the LibStruct repo, called SoftlyLit). In the end I even dropped FTL because it was not fast enough for my purposes (proving the JIT had not optimized it away entirely), opting for passing the stack reference as an argument to the methods that needed it. I had multi-threaded demos too, that did much more than pushing and popping the stack, so the measured performance jump was observed in a real world scenario (as far as demos can be considered as such).

Having said all that, you can take the concept of FTL and implement it into LWJGL.
As for credit, you can add @author Riven to the javadoc. :point:

Riven · March 12, 2016, 7:12pm

As for your Unsafe hack, did you read/write in the thread’s native stack directly?

theagentd · March 12, 2016, 9:53pm

I just updated to the latest LWJGL version, and some found some peculiarities.

VkInstance and VkDevice seem to want the create info struct now as well. Does this imply that the constructor itself creates the object for you? If not, what do they need that information for?
The VkSubmitInfo and VkPresentInfoKHR (as far as I can see) set() functions have changed. They both require a count for just one of their buffers, which I see no reason for why it has been readded. Code generator bug? VkSubmitInfo requires a waitSemaphoreCount and VkPresentInfoKHR requires a swapchainCount.

KaiHH · March 12, 2016, 10:07pm

See: http://www.java-gaming.org/topics/engaging-the-voyage-to-vulkan/37041/msg/354303/view.html#msg354303
This is all by design.
The problem is that some count fields in VK structs (VkSubpassDescription, VkSubmitInfo, VkPresentInfoKHR and VkPipelineViewportStateCreateInfo) affect multiple buffers.

theagentd · March 12, 2016, 10:17pm

Geh. Why can’t it just validate that those buffers have identical lengths and use that length?

KaiHH · March 12, 2016, 10:22pm

It’s not possible, because LWJGL cannot access the buffers (specifically their length) anymore once they have been set. C does not provide a length information on a void*.
Spasi and I have been discussing this very very lengthy and trying to evaluate every possible design. This so far is the best we could think of. But maybe you have a better idea.
The only solution would be to shadow the C struct fields in Java, too.

There are ways to do it, yes. But every way would require severely limiting the use of the Vulkan struct API by requiring that certain fields be set before certain other fields. This would make valid C Vulkan programs difficult to port and would result in a direct port not being valid anymore under LWJGL semantics.

theagentd · March 13, 2016, 12:14am

Forgive me for stating the obvious, but implementing this on the Java side is easy as hell. I assume the code generator doesn’t support this which is why you want to avoid it?

Spasi · March 13, 2016, 2:26am

[quote=“Riven,post:130,topic:56271”]
FWIW, I tested ThreadLocal from Java 6 and up: ~14ns on Java 6, ~8ns on Java 7, 8, and 9.

FTL and the unsafe implementation get that down to 2-3ns.

[quote=“Riven,post:130,topic:56271”]
Thanks!

[quote=“theagentd,post:132,topic:56271”]
LWJGL uses the apiVersion and ppEnabledExtensionNames fields to build the VKCapabilities objects for VkInstance and VkDevice. Vulkan does not provide a way to query the enabled extensions, so it has to be done like that.

[quote=“theagentd,post:136,topic:56271”]
I would be interested to know what you mean exactly. As KaiHH said, we’ve explored the possible approaches extensively and the current design had the best trade-offs. It’d be great if you have a better idea that we could discuss. It’s easy to implement anything in the generator, but it would have to make sense.

The latest nightly build (3.0.0 #47) includes a stack allocation API. See the org.lwjgl.system.MemoryStack class (warning: WIP + no documentation). Struct classes have been augmented with stack allocation factory methods:

malloc(MemoryStack); // explicit stack
calloc(MemoryStack);

mallocStack(); // thread-local stack
callocStack();

// similarly for struct buffers

Example usage: a) thread-local b) explicit

One possible improvement is to remove the need to call push(). It could do it automatically the first time you allocate after the last pop().

Riven · March 13, 2016, 12:50pm

@Spasi: why the asymmetry between nmalloc and ncalloc parameters?

Also, in LibStruct I had cases where num*size overflowed, so I converted malloc/calloc to use longs as inputs.

You also have the assumption that the ‘alignment’ parameter is POT - you might want to enforce that.

Spasi · March 13, 2016, 1:58pm

[quote=“Riven,post:138,topic:56271”]
The C malloc and calloc functions have the same asymmetry.

[quote=“Riven,post:138,topic:56271”]
Technically possible, but why would anyone try to allocate > 2GB on the stack?

[quote=“Riven,post:138,topic:56271”]
Thanks, will do.

Riven · March 13, 2016, 2:42pm

Why not grow the stack towards zero (like the native stack) ?

It makes alignment more natural too:
pntr &= ~(potAlignment-1);

As for >2G allocations: it was a general malloc/calloc function, but indeed, for a stack it makes less sense.