LWJGL: Random direction thoughts

Rolling out a major version release is always an opportunity to rethink everything. Here’s a start of some random thoughts I have WRT LWJGL.

Some term definitions:
java: any JVM language
developer: a LWGL internals developer
user: programmer using LWJGL & java (and generally not doing any custom native side coding)

Choose a baseline compatibly version and nuke everything older then that.

Goals: reduce long-time developer time commitment and reduce noise presented to users.

Arguments against:

  1. It’ll break my program.
    A) No it won’t. Version 2.x will still available. Upgrade or don’t. Choosing a library isn’t some e-peen achievement system.
  2. I can’t scale my program as well to different targets.
    A) The statistically zero percent of people where this is a concern and can actually do so will have zero problem supporting both and loading the appropriate version at runtime.
  3. What about folks that want to use the old-school API as a learning tool?
    A) Version 2.x is possible, but a better solution is to use a mid to high level API instead. Programming to a graphics API basically older than the average JGO member isn’t a fantastic idea.

The thing to remember is that you don’t need to think about today…nor even when V3 is considered usable, but when you guess that more users are writing against LWJGL 3.x instead of 2.x. Personally 3.0 is probably the furthest back in history I’d think about.

Lose functionality writable in java
No reason to be in the main repo unless they are considered important to base functionality (aka support for using the low-level…but not things like look at matricies) or used internally. One or more companion libraries (hosted by LWJGL or 3-party) should cover this functionality.

I have a fair number of other ideas, but spreading them out across time probably is better than one meta-post.

I think that’s basically the entire plan of LWJGL in a nutshell, there.

Cas :slight_smile:

Is there any compatibility issue to be concerned with?

Ie… if i want to support both Mac and Windows will i be able to use the same LWJGL version ?

(May not be an issue but thought i would ask)

Windows, mac and linux all with the same version = yes.

@Cas: Well then: that’s awesome. I fully endorse the basic plan.

Next batch are along the same lines, but a step away from basic GL/AL/CL support and on to some useful things for “user” in “java” (say defs as above) + LWJGL.

Expose some hardware information

  1. Number of physical processors. Being able to know both the number of logical and physical processor is useful (aka more or less required). Example for determining a number of threads to have active per timeslice.

  2. A limited number of hardware supported CPU opcode queries. From a user perspective the only ones worth knowing about are those that there is both: a java equivalent method call AND that method call is a JVM intrinsic. If both of these aren’t true then there’s nothing in java that the user can do with the information. This list is currently very small (This is base on an early version of 7…I need to recheck the intrinsic list at some point. This old list is here):

leading zero count, trailing zero count, population count and byteswap.

Trivial example:


isPowerOfTwo(int x)
{
  // Compiler will drop the dead code
  if (CPUInfo.hasBitCount)
    return Integer.bitCount(x) == 1;   // two opcodes if support...many if not.

  // isolate low bit, if same as input and not zero then true
  return (x & -x) == x && x != 0;      
}

Counter examples include: SIMD operations: not exposed in java so no point. Atomic increment: there are method calls but they are (currently) not intrinsic.

Some others might be of interest to developers since they could use that information and they could potentially expose a method via JNI.

  1. Size of caches and the lengths of cache-lines. Since java is runtime compiled this would allow building cache-aware data-structures.

(EDIT: my example was checking for LZC where it should have been checking POPCOUNT…doh!)

  1. is already doable with Java using Runtime.getRuntime().availableProcessors();

Not sure how useful 2) & 3) would actually be and seem a little out of scope of what LWJGL does and better suited to be part of the Java Runtime, even so its rare that such information will be needed by the users.

If you really do need that sort of fine grained information and want to optimise for certain processors you can use something like the following:

System.getenv("PROCESSOR_IDENTIFIER");
System.getenv("PROCESSOR_ARCHITECTURE");
System.getenv("PROCESSOR_ARCHITEW6432");
System.getenv("NUMBER_OF_PROCESSORS");

To identify the processor, should be easy from there to match up its spec and then optimise accordingly.

The problem with availableProcessors() is that it doesn’t differ between physical and logical cores. Personally I don’t think that’s a problem, but it could be useful to know how many physical cores the computer has.

Yes: availableProcessors is the number of logical. Neither piece of information is very usable on its own. If you assume logical == physical and that isn’t the case then you’re likely to spawn too many and pay the very high cost of context switches over and over. If you assume physical == logical*magic_factor and logical == physical then you’re under utilizing.

Doing 2 & 3 in pure java possibly doable (depending on environment variables seems very fragile and non-portable) but is a real PITA as the code needs to be kept up to date with processors (I’m not even sure that you can get exact results with this information). On the CPU side this isn’t the case as the CPUID queries are fixed…one time development cost and you’re done forever. If these ‘kinds’ of queries were deemed reasonable for LWJGL to expose then it could make sense to use an overkill library (again one-time cost assuming the library is maintained) and to expose more features as deemed appropriate. Say thread affinities.

This really isn’t very much work. In my option it is in the spirit of LWJGL’s goals of provide user working in java the low-level functionality needed to write multimedia apps without resorting to native code.

Can you even get that info? I mean with virtualization, hyper threading etc does the number of “physical cores” even have a simple answer?

Is there any case when you would not want to use all available logical cores? I know that hyperthreading can hurt performance in some games, but Java can take a lot of benefit from hyperthreading since it helps massively when you have lots of cache misses which is very common with Java’s memory model.

Well personally i like to be able to restrict how many cpu’s a game is running when i have sims going. Just because i have cores does not mean i want the game/app to use them

This is however an edge case. I just wish people would always let such settings be manually overridden.

@delt0r: yes the information is available (I’m not sure what you’re think about in terms of the virtualization feature)

@theagentd: yes you want all logical cores to have an actively running thread for all timeslices to maximize throughtput. Knowing the number of physical allows you to make a more accurate estimation. So if one piece of the puzzle is more important then java is currently providing the most important. Intel claims you can get up to a 30% boost per core with HT…I’ve only ever measured on the 10-15% (if memory serves) range. Having a rough idea how something scales on my N core HT box gives a pretty reasonable guess about how many I should spawn on a M core box with or without HT.

I’ve been looking at hwloc. It has a simple C API (easily added to LWJGL), works on basically every OS and CPU out there and has great features. Among many others, you can query:

  • Total physical memory.
  • Number of cache levels, how big each level is, which cores share which level.
  • Number of processing units per core (hyper-threading).
  • Cache-line sizes and cache associativity.

It also provides a cross-platform API to pin threads/processes to CPU cores (affinity).

This might seem like too compute-oriented or for server workloads (e.g. hwloc exposes per-socket information but we’ll never encounter multiple sockets on a gaming machine). I honestly have stopped treating LWJGL as a pure gaming library since we added support for OpenCL, so I really think this will be useful functionality to have. Even for games, core information can be quite helpful if secondary threads are used (for physics, sound processing, asset loading, etc). You generally don’t want to mix two “compute-heavy” threads in the same hyper-threaded core, performance will suffer. Also, using a dedicated core for latency-sensitive stuff can be very beneficial.

Btw, I have not been able to find a library for CPU feature detection, that is acceptably cross-platform/maintained/reliable. Only libcpuid comes close, but I’d like something for ARM CPUs as well. Anyone knows a decent alternative?

[quote]You generally don’t want to mix two “compute-heavy” threads in the same hyper-threaded core, performance will suffer.
[/quote]
This has been our experience to the point where hyper threading is disabled on our clusters.

As for including this or not or other performance related feature. I can’t see any harm if its easy, doesn’t create unreasonable dependencies and fails somewhat gracefully.

Interesting. That seems to point toward supporting thread-affinities as being pretty desirable.

WRT: hwloc. My initial thought was overkill, but it seems to provide reasonable features and maintained, so if the API is nice and easy and if it is simple to integrate…why not? The risk appears to be low.

WRT: libcpuid like library which supports Intel & ARM. Seems like an unlikely animal…too few projects would need such a thing. If one exists it might be harder to find vs. roll-your-own. My brain isn’t pulling up any likely places to look.

WRT: compute + HT: This is just personally curiosity…that is the functional unit(s) conflict that causes the issue?

Some other random thoughts that are of lesser interest from my perspective (and greater risk & dev cost):

  1. Performance counter queries. (pretty arch specific…but that’s fine for the target audience)
  2. CPU local variables (this is like thread local but per CPU instead of thread).

I should note that some of these thoughts are based on:

  1. make LWJGL usable to a wider audience, which widens the pool of potential contributors.
  2. perceived quality. fancy features tend to make things more attractive to people for some strange reason.

While fancy new features are nice they tend to only be useful for and used by a small niche. Therefore IMO they are probably better suited as extra extensions/utilities or as an addon library rather then part of the core.

IMO the more code and features there are the more there is that can break and needs maintaining plus it makes the library less flexible when it comes to stuff like porting to new platforms, compiling to native code, etc. Do less but do it well.

From what I can tell the vast majority of applications just use/need a robust windowing system + OpenGL (including ARB extensions) + OpenAL. The nice thing is LWJGL3 is designed to be pretty modular (unlike LWJGL2) so should be pretty easy to just build custom versions using the build files.

The hyperthreading thing was in the context of simulations. So fairly optimized code that is vectorized with little branching. In this case HT was over slower than none. Since just about all code that is run on the High performance clusters is like that, most have it disabled. We should note it wasn’t a huge difference, 10-20% or thereabouts.

This is not the case for the “general purpose” clusters. ie where a lot of python scripts and database intensive jobs are run.

For games and desktop apps i have no idea what would work out faster. I would suspect HT to win slightly or to be a tie since there should be more waiting around for most processes.

I’m not seeing what causes the problem. You have 2 threads: M (for main) and H (for hyperthreaded). M can only issue and retire a fixed number of ops per clock and (as I understand…unless there’s some balancing scheme of which I’m unaware) should always run at full speed. H jumps in an executes some ops when M isn’t using the execution-unit/resource in question…so I can see how H might be starved but not how total throughput is reduced.

Don’t anyone spend a moment on this…just wonder if anyone knows of the top of their head.

I’ve seen performance boosts of 20-30% percent in real code, although that code wasn’t very optimized in my case. On my i7, I got 4.9x scaling using 8 threads. Possible, but unlikely with well written code. Just my 2 cents.