What I did today

There was one other issue with the “render point-sprite quad and ray/box-intersect the resulting fragments”: Numerical stability when cubes are far away. The intersection code from the paper uses a ray/plane intersection test and determines whether the point of intersection lies within any of the possible three sides of the cube visible to the ray.
When rendering many cubes beneath each other, it is sometimes possible that the intersection test will determine that the top side of the cube was hit (giving a normal color of green in the below images) whereas the adjacent pixel is determined to have hit the front or left/right side of the cube (giving a red or blue color). This is effectively aliasing due to the low framebuffer/pixel resolution at far distances. But worse, sometimes the ray even went straight through the cube without a hit being detected for either side!
Although the effect is exaggerated when using those high contrast colors and diminishes when actually shading the cubes, this error became severely apparent when rendering depth values for occlusion culling. It is unacceptable for a ray to go through a cube and hit another cube hundred meters behind it because that higher maximum Z value will propagate through the HiZ mip chain and destroy any occlusion benefits (because no occlusion will be detected for big occluded objects, which use a high mip level to sample depth values from).
To counteract this, on voxel creation I gave each voxel a bitflag (as the w component in the voxel’s GLSL u16vec4 vertex attribute) saying which of its neighbors actually exist. This information is then used in the ray/box intersection test to determine whether the ray could actually have hit that side of the box (it couldn’t when there is an adjacent box at this side), in addition with an epsilon value depending on the distance between camera and voxel center effectively giving a margin around each possible side of the cube hit by the ray, so that no ray will go through a voxel anymore and proper depth values will be preserved.
Image without and with epsilon and side-flag correction:

You are using the last frame’s depth information for your high-z buffer, do you?

No, there’s no temporal coherence and depth reprojection stuff in that right now. Building the Hi-Z buffer every frame for the nearest occluders. Figuring out the nearest visible occluders (using the BVH tree for frustum culling and to find the k-nearest neighbors around the player), rendering them and building the mip chain takes roughly 200µs. so right now is absolutely okay for me to do it every frame.

Coded a BSP based dungeon generator for Lethal Running:

Continued the voxel ray tracing experiment with trying stackless (rope-based) kd-tree traversal with SAH split heuristic on it - I had this around for a few months just not used for voxels. This is waaay faster than BVH with Morton Code sorting, though I haven’t tried an “Extended Morton Code” algorithm which alters the amount of bits allocated for each dimension.
Both the BVH and the kd-tree traversal now produce the exact same image, but when the BVH one takes 1902µs. the kd-tree one took 1035µs. Apparently, the increased bandwidth for the additional 6 int16_t’s (for the rope node indices) is still less than the reduced amount of memory fetches due to not having to backtrack to ancestor nodes anymore.
But the killer argument in favor of stackless (rope-based) traversal is: No need for a ray-restart after the primary hit! Normally, whenever starting a new ray (such as a secondary ray) at the point of the primary ray hit, a full tree traversal has to be done to first figure out where the ray actually starts. Storing the 32-bit stack of the primary ray in the BVH traversal is useless here, since the bits are specific to the path the ray took previously and won’t help with avoiding ray-restart.
This is not so with rope-based stackless traversal, where there is no stack and the algorithm can traverse the tree at any given point in any direction without starting from the root. My plan thus is the following: Rasterize the voxels with point sprite quads (is by far the fastest way to get primary ray hits) and when rasterizing each voxel, write to the G-buffer the node index (uint16_t) in the kd-tree containing that voxel. This information is available when building the kd-tree from the voxels list and will just be used as an additional vertex attribute of the voxel GL_POINTS. Then when shooting a secondary-ray to compute reflection/refraction for e.g. water and also when sampling irradiance from each voxel, we read the node index as flat attribute in the fragment shader (or as a voxel attribute in the compute shader) and start right at that node when shooting the secondary ray! No need to traverse down the tree first.
Here is some image of the kd-tree structure and images when I tried to figure out the exit side of a ray in a cube:

EDIT: Some pipeline sketch:

Very nice! And that lets you rasterize voxels cheaply, evaluate shading only for visible Fragments - supported by tracing secondary rays cheaply because you have the voxel index per Fragment already. Transparents are also not a Problem despite deferred shading, because you can render transparent voxels like solid geometry and in the shading pass trace refraction race against your voxel tree. This is what i did as well with my deferred voxel shading voxel tracing approach, veeery nice :slight_smile: if i did not talk rubbish until now, i have a question: If you trace refraction rays for transparent Fragments into your voxel tree, do you evaluate simple direct lighting where you hit?

Played around yesterday with prototypes on an acrade flight-model.
(Added some simple drones to have something to shoot at)

WebGL demo “Kill”

here a Newtonian flight model, using actual physics (way harder to control, but more interesting in a sense).
WebGL demo “Fly”

Seems you have a bug in your prime number code,
you should edited line 30

Came to properly implement directional Irradiance Caching with Hemispherical Harmonics (HSH) with 6 H-Basis functions per voxel x side x vertex. The video shows integration of HSH together with some Fresnel approximation to reduce the amount of reflection towards 0° incident angles (with respect to the face normal):

0P1KSgpxhZw

How is this useful? To have Fresnel-based reflections of actual incoming directional irradiance. When extending this with chromatic/per-color-channel information, then this can be used to implement reflections on glossy materials.

[quote=“h.pernpeintner,post:6146,topic:49634”]
There is currently only soft hemispherical lighting. And all lighting evaluation is currently being done in the off-screen compute shaders which cache irradiance for all voxels. The rendering shader only interpolates the irradiances between the 4 vertices of the hit side.

No, my code is fine. It jumps back to 30 (and resets X) only if it has found a proper divisor (so that A isn’t prime). It has to reset X in that case. Your code doesn’t work properly, because it doesn’t check all the numbers as potential divisors that it should check. It prints out a lot of numbers as prime that actually aren’t.

Today I finished the 3D camera, now on to adding height to the map terrain.

On my way to do hierarchical occlusion culling on the GPU using compute shaders, guided by the scene’s kd-tree.
Here is a video showing only hierarchical frustum culling:

FNuqvgHlrz8

(notice how the culled areas become larger the farther away the camera is from a particular node)
Also, the frustum the culling uses is smaller than the camera’s frustum to show the effect.
So, how does it work:

  • The CPU (Java, that is) initially assembles a list of kd-tree nodes at level N (where N is really small) very quickly and submits that to the compute shader via a SSBO
  • At every pass, the compute shader reads in that list, performs culling, and writes the visible nodes into an output SSBO
  • Those two buffer bindings ping-pong between each other
  • An atomic counter buffer is used to record the actual number of written nodes in each step, which will be the input for the next step (stream compaction with parallel prefix sums is absolute overkill for these very small amount of nodes) (I’ve also implemented warp-aggregated atomics in GLSL only to see that the GLSL compiler seemingly does this automatically already (as is mentioned in the article for nvcc))
  • The last pass is used to write MultiDrawArraysIndirect structs for the voxels in the visible nodes to a separate SSBO (the voxels SSBO containing all voxels is built in such a way that all voxels within any kd-tree node are contiguous, so that it is easy to generate a MultiDrawArraysIndirect when only having a kd-tree node, which contains the index of the first voxel it or its descendends contain along with the number of voxels)
  • This last SSBO is then used as the draw indirect buffer for a glMultiDrawArraysIndirectCount() call together with the number of draw call structs stored in the atomic counter buffer, which becomes the indirect parameter buffer

Here is the compute shader doing the culling:


and here is some portion of the host code driving the culling:

Next will be what will bring the most benefit for this highly fragment shader and ROP bound rendering: Combining Hi-Z occlusion culling with the hierarchical frustum culling. This means that Hi-Z occlusion culling will also be done hierarchically, starting with a coarse kd-tree node level and refining the nodes when they are visible. The reason why I am doing frustum culling on the GPU is: Hi-Z culling has* to be done on the GPU and doing it hierarchically through the kd-tree will benefit from fewer nodes to be tested.

*that’s not entirely true, since there are games out there using a software rasterizer to cull on the CPU

Today was an interesting day, as I had witnessed how two parts of the rendering pipeline competed for being the major bottleneck when applying two slightly different techniques of voxel rendering.
The first technique was rendering point sprite quads covering the screen-space projection of the voxel, as presented by: http://www.jcgt.org/published/0007/03/04/
The second technique I came up with was to use the geometry shader to compute and generate the convex hull of the projected voxel with the help of this nice paper: https://pdfs.semanticscholar.org/1f59/8266e387cf367702d16acf5a4e02cc72cb99.pdf
While the first technique produces a very low load on vertex transform/primitive assembly, it suffers from many additional fragments being generated for close voxels, where the quad enclosing the screen-space projected voxel contains a large margin/oversize to a) still make it a quad and b) cover the voxel entirely. This produces a higher load on fragment operations (fragment shader doing the final ray/AABB intersection and likely more importantly the ROPs reading/comparing/writing depth and writing color).
Now my idea was to reduce fragment operation costs by reducing the amount of excess fragments the quad produces, by not making it a quad anymore but a perfectly fit convex hull comprising either 4 or 6 vertices.
Having heard many bad stories about how geometry shaders perform, I still gave it a try and I was positively surprised at an increase of roughly 21% in total frame time when generating the fragments with the convex hull for close voxels.
Here, the cost of fragment operations was reduced to a point where this wasn’t the bottleneck anymore, but: vertex operations (passing the GL_POINTS to the geometry shader and there emitting a single triangle strip of either 4 or 6 vertices) now were. One could literally see how for moderately far away voxels where the screen-space quad had little oversize/margin, the quad rendering solution overtook the convex hull geometry shader solution.
The latter however was ideal for close voxels. So, it’s going to be a hybrid solution in the end.
Here are some images and a video showing the overlap/margin of the point sprite quad rendering and the (missing) overlap of the convex hull rendering:

7TFKwAUZ0qE

Convex hull generation geometry shader: https://gist.github.com/httpdigest/fa7e071b87ca24b0a54fe0821e3c1a60

Not much to show, but have been modifying the core of the game to integrate YAML configuration instead of hardcoded classes, managed to reduce the code base by ~1.8k LoC (the solution looks much cleaner now), at cost of ~500 YAML lines. YAML configuration is used as a base when something is being setup, for example a unit is initialized from the following configuration:

name: "archer"
attackDamageMax: 25.0
attackDamageMin: 15.0
attackRangeMax: 30.0
attackRangeMin: 10.0
attackRangeStdDev: 2.0
attackSpeed: 4.0
collidables:
  - !circle { x: 0, y: 0.725, radius: 0.75 }
defense: 0.0
fireableProjectiles:
  - ARROW
lifePoints: 100.0
movementSpeed: 2.0
operators: []
plunderCapacity: 20
price: 50
recruitable: true
recruitmentCooldown: 15.0
stamina: 100.0

Besides being cleaner, I have prepared the ground to be able to reaload the configuration once the game in debug mode, which is super useful in my opinion when for example it is required to tune object hitboxes and balance the game. I plan to make available both YAML examples and console-like widget I have built for the debug mode of my application.

Created a dynamic door / key event system, which basically allows me to alter tiledmaps during runtime.
For instance I can change blocked values from true to false when the player contains a certain item.

The class responsible contains methods such as:

public static void setDoorPropertyToBlocked(GameMap map, String id) {

		int max = map.getTiledMap().getTileSetCount();

		for (int i = 0; i < max; i++) {

			int firstgid = map.getTiledMap().getTileSet(i).firstGID;
			int lastgid = map.getTiledMap().getTileSet(i).lastGID;

			for (int a = firstgid; a < lastgid; a++) {
				Properties properties = map.getTiledMap().getTileSet(i).getProperties(a);

				if (properties != null && properties.getProperty("door") != null && properties.getProperty("door").equals(id)) {
					properties.setProperty("blocked", "true");
				}
			}

		}

		reloadDoorEvents(map);
	}

This method basically changes the blocked value of a certain door event (matched by an id) from unblocked to blocked.
The reload door event method then parses the map again and adapts changes:

private static void reloadDoorEvents(GameMap map) {
		for (int xAxis = 0; xAxis < map.getTiledMap().getWidth(); xAxis++) {
			for (int yAxis = 0; yAxis < map.getTiledMap().getHeight(); yAxis++) {

				int tileID = map.getTiledMap().getTileId(xAxis, yAxis, 0);
				String doorBlocked = map.getTiledMap().getTileProperty(tileID, "blocked", "false");
				String door = map.getTiledMap().getTileProperty(tileID, "door", "false");

				if (!door.equals("false")) {
					map.getBlocked()[xAxis][yAxis] = Boolean.valueOf(doorBlocked);

					if (map.getBlocked()[xAxis][yAxis] == true) {
						Rectangle temp = new Rectangle(xAxis * 32, yAxis * 32, 32, 32);
						temp.setLocation(xAxis * 32, yAxis * 32);
						map.getBlockedList().add(temp);
					} else {
						Iterator<Shape> iterator = map.getBlockedList().iterator();
						while (iterator.hasNext()) {
							Shape shape = iterator.next();
							if (shape.getX() / 32 == xAxis && shape.getY() / 32 == yAxis) {
								iterator.remove();
							}
						}
					}
				}
	}

This methhod gets invoked by something I a call keyevent, which activates as soon as the player hitbox touches it:

@Override
	public void onActivate() {
		if (InventoryController.getInventory().checkIfItemWithNameInInventory("Dungeon Key")) {
			Item toRemove = InventoryController.getInventory().getItemByNameInInventory("Dungeon Key");
			InventoryController.getInventory().removeFromInventory(toRemove);
			TiledMapModificator.setDoorPropertyToUnblocked(getCurrentMap(), doorID);
			onActionSound.play();
			PopupMessageController.activatePopupMessage("You used a key.", this);
			this.setActive(false);
		}
		else {
			PopupMessageController.activatePopupMessage("I need a key for that.", this);
		}

		return;
	}

In the editor it then looks like this:

The blue one is represtenting the door and the purple one the key event.

Got an idea today for how to efficiently determine the set of actual visible voxels for gathering/caching irradiance only for those visible voxels. Will write a detailed description later. Here is a video highlighting the visible voxels captured at certain points:

0xMQYkvWGJc

Super nice debugging tool Kai :smiley: That’s something I always skip because of time pressure. Of course I pay the price of this later most times.

Thanks. Visual debugging is definitely important, and I was just tired of trying to analyze glGetBufferSubData() numbers.
As for time to do it: Up till now this is not becoming a strong factor for me. I don’t have any abstractions in the code, neither do I factor out common code or parameterize those in any way. Always copy/pasting from a previous demo and applying slight modifications is just so much faster. Also, I try to keep the programs single-file and as small as anyhow possible to quickly iterate different ideas. Only shader sources need additional files. This is why I am really looking forward to Java 12 becoming a thing with (multiline) raw string literals, so that a whole demo/example can indeed be one file (without having to resort to multiple concatenated “blabla\n” strings of course).

There are only very few cases where the right abstraction doesn’t give you more development speed. Finding the right abstraction that doesn’t make small changes harder is a different topic though. I was thinking just like you, some time ago in my own engine, before I took the discipline to do some “proper programming” and not another bunch of hacks. It was so worth it. The more often one does it, the better one gets at it and the more one realizes how much it is worth :slight_smile:

I have a system that uses a file listening mode for auto reloading and runtime compiling shader code - how would you get this when you have all your shaders as static strings somewhere in your classes? Or don’t you need runtime recompilation on change?

Yes, when you arrive at some point where you know which parts will change and which parts won’t, then it is benefitial to do abstractions. But I certainly haven’t reached this point yet. Any new technique I might discover might make me throw away most of the code anyways. It is only after you decide on a certain “architecture” (rendering pipeline, capabilities, flexibility/pluggability, etc.) of your engine, that you can start organizing your code in such a way that changes to moving parts become easy while changes to static parts (which you anticipate to not change often) become harder.
I’ve just recently used a very hacky Java Preferences way of storing and loading presets of window size/position and camera position and orientation to allow for performance comparisons between code changes.