Tasks are assumed to be short-lived - how short?

Ragosch · May 27, 2006, 3:46am

Task are assumed to be short-lived. In the documentation it is said that they might be aborted and re-queued if they execute for too long. My question is now, how short is short-lived?- Is there a way to find out when a task is going to be aborted in advance in order to setup a child-task and stop executing before it will be aborted?- A task running too long which is re-queued may fail to complete the next time it runs again. And because the execution order of tasks is guaranteed on a per-user-base this might become a problem. So it would be good to know, where the deadline is.

Ragosch

Jeff · May 31, 2006, 10:01pm

I believe a task that is aborted for runnig too long is disacrded, not requeued.

However the watchdog timer actually isnt in the EA1 release so for the moment you dont need to worry about it. In the long run I expect it will be settable, but we need to walk thorugh the logic on that… (at the momenty Im leaning tyowards a kind of handshake-- the developer sets the timeout he would like in his deployment descriptor but the admin sets a maximum timeout allowed. if the requested timeout is alrger then the maximal allowed, installation fails with an error… but this is all still nascent thinking.)

Ragosch · June 1, 2006, 12:58am

Handshake sounds good to me. Any idea of how long it might be?- I mean are we talking of hundreds of milliseconds here or just a few of them?

Ragosch

beowulf03809 · June 1, 2006, 1:40am

I imagine the team is going to need to do some profiling and benchmarking before they can really start committing to specifics like that. Even then, until they have a few “real” games running in the wild there will be some tweaking. I think it’s safe to say they will probably err on the side of caution though and set the caps more to the upper expected limit than risk a lot of complaints early on about aborted tasks.

Then again, maybe they’ll just use it as a Survival of the Fittest challenge that everyone must finely tune their apps before they are worthy of wearing the DS label.

Ragosch · June 1, 2006, 4:25am

Nothing against that, I just want to know, do I need to tune it more to lets say 200ms or to 5ms or maybe up to 1 second?-

That makes a big difference in breaking down processes and game logic into GLOs and before I dont know such simple facts I am not able to really work on the design of those GLOs. And I really dont want to waste my time on things I can throw away in a few weeks or months. Sure, they cant provide exact numbers yet, no one demands that - just a statement giving an idea of what is in the mind of the developers about these numbers.

Ragosch

Jeff · June 2, 2006, 12:45am

For practical reasons the shorter your task the better because the fewer obejcts you lock the less likely you are to run into an abort.

However the real issue is “what needs to be atomic” because the unti of atomicity is a task.

So id say as short as possible with the necessary level of gauranteed atomicity.

Ragosch · June 2, 2006, 1:37am

Ok, then I try to be more concrete. I am currently trying to find out how to store our really huge terrain efficiently - either inside the SGS using a set of (terrain chunk descriptor, terrain mesh/texture info, terrain decoration info) which would sum up in using about 1.05 million GLOs for that, or just storing the terrain chunk descriptors in the SGS and use another extern database outside of the SGS on a different server instead.

If I would need to extract the requested data out of several GLOs and send it back to the client, I need to know if I have enough time to do this in a single task for 1 terrain chunk (the atomic operation here, because our terrain is terraformable and not transaction-safe information could lead to cracks in the mesh, flying objects or objects stuck in the ground visually).

But again, if I am more or less forced to use the out-of-SGS solution and just operating on terrain chunk descriptors inside the SGS which would arise the need to communicate between the SGS and the external database the question arrives “why should I use SGS if the answer to all vital questions for our game so far is use an external database instead?”.

So far we need to externalize the economic database and the terrain database - now to the next point, calculating the flow network for our terrain. Currently this is organized in a huge flow network graph with several “roots” which is traversed down 1 level each idle cycle of the server and the results are stored in the associated terrain decoration info; as it is now we need about 3 to 5 seconds to recalculate the whole flow network using just idle cycles of the server (not realtime, but good enough to render realistic rivers, flooding and dry periods on the terrain). I wonder what the answer will be to these requirements of our game … again “use an external solution for this”?.

I really start thinking that the SGS might be the wrong technology for our game - I can imagine a lot of games which might be able to be run on the SGS, but I am really in doubt if we can implement our requirements on the SGS given the current abilities of the SGS. That is why I ask for more precise information about the timing of task. How much will I be able to do in a task?- How many GLOs may I be able to access with GET before a task might be discarded assuming they would have just a few kB each?-

These are the things I am struggling with … any ideas or comments?

Ragosch

Jeff · June 4, 2006, 2:40am

So my answer would be this:

Use the internal GLO storage. Where it makes sense break your terrain up into seperate GLOs (maybe ina quad tree?) so you arent having to load and deserailize the entire world for every operation. Finally, if you are still finding that it takes too long to load, then there is soemthign we need to fix in the SGS.

If you are having to go outside then I agree the model is broken and needs fixing 8)

Ragosch · June 4, 2006, 5:27am

Maybe in a quadtree - I am disappointed, do you think I am a complete 3D noob?-

We use a modification of the terrain rendering algorithms presented in this terrain.pdf document. Terrain chunks are requested from the server only if the local database does not have a valid copy of it yet. A valid copy is a copy which is not expired or invalidated. We use that local database also to provide a special feature which we call “3D-automapping”. Players are able to access their memories and “walk” through the terrain like it was when they were last there (we simply do no updates of the local database while in rememberance mode), but in this mode there are no moving objects and we use just ambient lighting to let the scene look unreal in a way (because it is just a retrospective, not reality).

This quadtree-based terrain is overlayed by an octree for all the objects. Ofcourse we use those data structures to make fast software culling possible. Local structures which are combined objects basically are represented by scenegraph-branches and currently stored in the database using a modified subset of X3D. Quadtree and octtree are client-side representations while those scenegraph-branches are used client- and server-side.

The terrains ground structure is a 8,193x 8,193 heightmap where each tile is enhanced with JIT-generated overlay-heightmaps of size 33x33 (which need not to be stored in any database). Due to the used algorithm a terrain chunk is very memory-efficient while it can be rendered at different LOD levels very fast (asynchron loading process) and features detail enhancement in the near. This is a quite fast and sophisticated method to render terrain because the grid distance between vertices on the map is 62.5cm while the map spans over 163.84x163.84 km² and 99.9% of the terrain details are JIT generated only if needed.

I guess, this is sophisticated enough, isnt it?- Unfortunately we need to store terrain and object information in a way which can be easily accessed by the game logic because our terrain is not static but terraformable (much in the way like wurmonline.com has implemented this feature). Some alterations of the terrain may cause a larger part of a quadtree branch to be recalculated (basically a new screen pixel error is applied to some nodes of a branch). That is one of those places where I ask myself if we could do this recalculation in a single task or if we need to break it down into several ones. That is one of the reasons why I ask for a more precise answer to “how long is a task running before it will be aborted”.

Ragosch

Jeff · June 4, 2006, 8:15am

This is all interesting stuff. To be honest, its data like this that we actually need to tune the system.We havent had a lot yet. We’re just starting relationships with some folks doing actual projects who can give us that data. (Im going to foward yours to the team as its actually very useful for us, thanks.)

The restriction on task length is really for two reason:
(a) To prevent a lock being held on an object for an inordinate amount of time
(b) To prevent endlessly looping tasks which never release locks.

Now for (a) it is looking like we might be able to relax that constraint some with some optimistic-computing stuff we’ve been gettting from anothe research group. If we can get that to work well enough that tasks can run in parallel all the way to commit time even though there are potential lock conflicts it may more or less eliminate the issues in (a).

(b) will remain an issue but without (a) it could be a VERY long time out (on the order of minutes perhapse.)

Im rather suspecting ultimately that number will have to be tuned a bit on a per-app basis. In any event everything you are doing sounds emminently reasonable for the server to be doing so if it IS taking too long ina task, Id say its incumbent on us to find a way to allow that task to complete.

This is a good exampel of why I call the APIs “90% done.” There may need to be features added to support various real world issues that arise. Its possible we may need to add a way of marking a “long task” though Im hoping it doesnt come to that.

Ragosch · June 4, 2006, 8:55am

I did not think of tasks running for minutes, not even for seconds. I would be totally satisfied with a solution which would give me about 100 ms pure execution time (i.e. time to GET and save GLOs not included).

Ofcourse most tasks should be much shorter. Longer execution is just needed for these special tasks which need to update a larger structure in a transactional safe way (like recalculation of the screen pixel error metric in those nodes of a quadtree I mentioned in my last post).I agree that marking those tasks would be good in order to give the SGS the ability to provide some extra execution time for those tasks. This extra execution time saves time compared to a child task which would need to load the GLOs again before it could continue to work (that is not very efficient), not to speak of breaking the transactional safety when using a child task instead of a marked “long task” - I start to like your idea of marking tasks.

Ragosch

Ragosch · June 4, 2006, 9:45am

Case (b) is a bug issue and it could have a very long time-out because it is of no real use to deal with those bugged tasks over and over again. Instead of causing lags big enough to make the staff work on the buggy tasks a too short time-out would abort those tasks too early which might not be recognized by the team because it does not cause real trouble (with complaining players who recognize those lags very fast).

When calculating the time-out you should think of the programmers point of view. He has no control over the loading process of GLOs and does not know how long it may take in practice to access a GLO, but he can estimate the amount of time needed to do the task not counting the access time needed. So maybe your time-out method should just take pure execution time into account, so that you are able to give a precise hint on this issue.

Ragosch

Jeff · June 4, 2006, 7:25pm

That would definitely be preferrable. There are some open issues about how we get to data liek pure executio ntime, but if we had to we coudl exacape to below java to tlak to the OS. W emay need to do that anyway because it turns out Java has no (safe) way to stop a thread from outside of its own context. MVM has some controls but it is unclear when that will be available

Ragosch · June 5, 2006, 1:16am

Hm, I think you can forget about the MVM in the context of J2SE. From what I read it is planned to transfer the MVM to J2EE and J2ME, but afaik there is no decision made for J2SE yet … but my infos are about 1 year old, maybe you have newer ones.

Ragosch

abies · June 5, 2006, 11:35am

SGS has a lot of similarities to J2EE. If MVM will ever reach J2EE, I don’t think it will be any problem to adapt it to other container-based technologies.

Jeff · June 7, 2006, 8:47pm

J2EE actually implies J2SE as J2EE has no vm technology as part of its definiton-- it buuilds ontop of J2SE.