Inquiries about Darkstar

dormando · April 5, 2007, 10:25pm

Hey,

First, perhaps some of the stickies could be combined into one larger FAQ sticky? over half the first page are static topics…

Hello! My nick is ‘dormando’, and I work for a large website which makes interactive flash games as part of their business model Our primary platform is PHP, but the game server platform is Java. We’ve been through evaluating a large number of java ‘game’ server products, and all have been pretty darn miserable. We even tried developing one in-house, but that did not pan out as expected.

I’ve been playing with darkstar for a few days. Congrats on pushing this interesting piece of software from within Sun! I do have a number of concerns and questions though.

Feel free to tell me to FOAD and show up at JavaOne, since I plan on being at the darkstar sessions there.
Is there any measure, or public idea at all just how far along darkstar is development-wise? The site is very threadbare, and the most details I’ve seen out of the project was in an interview on LWN. None of the features discussed in the LWN interview are actually in the current 0.9.1 release.
It’s very interesting how the development model of Darkstar appears to be based around the threaded Software Transactional Memory ideal of atomicity. While this can make development easier, it’s not exactly proven yet. Under some circumstances, it might be more efficient to declare task locks or other similar constructs, vs tripping an STM-style transaction rollback often. Would this be possible with darkstar?
The way the storage system works tells me I’ll need to re-implement the whole thing myself, for a few reasons;
- How would one do reporting on stored objects? The current options for migration also appear painful.
- Our prodction servers don’t have local harddrives. Storage needs to use a remote system. I have hundreds of servers, ditching hds makes my life easier
- We’ll probably want to integrate a memcached cluster for higher speed distributed access to objects on read-only tasks.
The site and interview mention being able to ‘magically’ start and stop game applications within a cluster. Since this is not a feature in 0.9.1, can we expect this to be a usable feature after JavaOne? Will we be able to rely on the cluster not collapsing if someone uploads a faulty set of java classes? (this is a commonly-claimed feature, but never seems to actually work!)
I’m trying my darndest to not launch into implementation-specific questions! This isn’t a question, but a statement replacing one! Several!
Another issue with the STM-task oriented development model fitting into ours is the need to make many external calls. While a lot of work can be done by ManagedObjects, the games are augments to the rest of the site. Meaning we need to talk back to our PHP datastores to persist updates to their characters and load information related to the characters. Doing something like this from within a task pose two risks:
- Blocking IO!
- Since the STM machine can demolish a task at any point, a series of external calls could be aborted in the middle. Or a change could happen on the external end, then the task gets aborted before the local side ManagedObject gets updated. We would have to create a whole new system on our side to deal. Is there a better way to handle any of this? A service or somesuch?
The task-based development model bugs me a little. It’s too close to the awful way PHP handles requests; load all your info, set up your environment, handle the request, then throw everything away. This makes embedding a scripting engine have unusable performance since the interpreters aren’t Serializable (and even if they were, it’d be a TON of data). It’d be interesting to have a way to ‘cache’ certain things local to the server, so you only need to do heavyweight initialization once (or a few times) per game server, then load your scripts per-task. I’m left wondering how performance is affected in a lot of other areas as well…
Async IO support! You have a highly threaded task-based model, but no way to do async IO of any kind from within tasks, in any way?
Crosstalking. Channels, resource migration, all sound great until you get past a few dozen servers for one game instance, or you need to link together multiple datacenters. What can be done, if any, to direct traffic a little to avoid a mesh network scaleout issue, where all nodes get chatter from all other nodes and ultimately reduce effectiveness as you scale?

Sorry for the long post. I’m very interested in seeing where Darkstar goes, and plan on showing up to JavaOne to see exactly what it has to offer.

Jeff · April 5, 2007, 11:53pm

Hi,

So let me see what I can answer for you. Definitely come to the j1 sessions too…

(2) What is up on the site today is the SDK version.The goal of the SDK version was to nail down the APIs and coding model so that developers could begin to code.

One of the costs of going open source is that we’ve had to put a fair bit of time into the existing code in order to get it ready for public consumption as opposed to blazing ahead with the more interesting parts of the job. What we release at JavaOne will be the open source version of the SDK stack. It will expose most of the intended extension mechanisms and facilities but will still be aways from the full multi-node stack, which I suspect is what you are referring to.

One of the nice things about Open Source, though, is it allows us to be more transparent in our development process. With the JavaOne release will also be our road map for the features necessary to get us all the way to the multi-node system we intend for production use.

As for “how far along” we are, its kind of a hard question to answer. Most of the APis and system design is done, though there are a few areas that we may still be finalizing post JavaOne. Those are among our highest priorities. There is ongoing work on both of these topics. As you may have noticed we are also growing the team. Sun has opened requisitions for a number of new team members. Thats part of what makes it hard to talk in terms of timetables. More people means less time 8)

(3) As Darkstar’s daddy I can fill you in a bit on that background. Darkstar borrows from a number of places, including tuple-spaces and some past transactionally based multi-processing systems. I designed the basics about 5 years ago and only this year came across the idea of transactional memory. There are definite similarities but its more I suspect a matter of common influences and similar goals then any direct inspiration in this case.

A primary goal of Darkstar was to make parallel processing managable by game coders. Game coders live in a mono-threaded world. This meant a requirement that all the hairy issues of multi-processing, such as races and deadlocks, be handled for them. An apparently mono-threaded event driven model was an ideal choice for the “face” to present to game developers.

we give the average game developer the simple rule that they want to avoid object contention as much as possible. We also tell them to lock objects early (“getForUpdate/markForUpdate”) that they know they are going to want to modify. All of this is to avoid deadlocking and rollbacks and, where unavoidable, make them happen as early as possible. We provide feedback on such things as percentages of tasks aborted and access to what specific objects are triggering them to help the developer find points of contention and fix them.

For someone more savvy its a fairly straight forward process to do things like take your object locks in order and pre-lock what you intend to modify.

My sample game that I am working on now has very few tasks that get abandoned and restarted, actually. Decoupling the one area I had contention pretty much fixed any issues.

Having said that, Darkstar services live totally outside of the transactional task execution context. if you have your own way of managing data, you can certainly hang it into the environment as a service.

(4) Your certainly free to implement your own data service. OR you can put a secondary database service alongside the existing services for things you need to store relationally. There is a current known limitation in that space, as discussed in one of the threads, but its a limit we very much intend to remove.

As for network connected storage, we’re doing some looking at that right now so stay tuned…

(5) heh. Your right, its hard to make work right. Its certainly a goal but its a major development effort in of itself. Deployment is an important part of the management of this system and how you handle upgrading is a part of that. There are a number of issues involved, from just how you roll the classes forward and backward in an organized way to what you do about non-compatible class changes.

The interview was with me and, I’ll be honest here, at the time I thought the solution was fairly straight-forward. I’ve since become more aware of the issues involved and its not the slam-dunk I thought it was then, but its still an issue we have to address in some way for you.

Something we are never going to be ale to do though is protect you from uploading new classes that are broken. Thats what staging and Q&A are for. 8)

( 6 ) There is Nooooo… question 6.

(7) We’re starting to get into stuff from the extension manual on the next question but briefly…

The system handles this by allowing services to be participants in the transactional context of a task. The channel service for instance does not actually send the data you put to a channel until the commit phase of the task. Your own services can also participate in the transaction.

Services can also queue tasks. The solution for blocking IO is indeed not to block, but rather let the Service handle the Io asynchronously and queue a new tasks to handle the results. A number of our own services do this and the extension manual shows a toy example of doing it as well.

( 8 ) Thats why we have our own custom object store because you are right, its performance is key to everything.
But you don’t have to tear everything down. Again, a scripting engine can and probably should run as a service.

( 9 ) Again, asynch IO is handled by a writing a service, same as blocking IO. You get the asynchronous event inside of the service and generate a task to handle it. This is in fact exactly how the channel manager communicates with the system.

( 10 ) You wont start to see this until the multi-node versions start rolling out, but because processing nodes are symmetrical we are free to move user connections between them. One reason this is done if for fail-over, but it can also be instigated by the system for load balancing purposes. Within the system itself we can track communication groups as well as communication partners (co-channel members) and shuffle you around the data center as we need to.

We are also frankly counting on a serious data center. This isn’t something you build with white boxes and link-sys routers from CompUSA… We’re talking more like gigabit and infiniband switches in the back-end. The flip side however is that you only need one data center to host many games so what it costs you in professional infrastructure you get back in efficiency of operation. We expect this to create a market in Darkstar hosting, actually.

Wow. A bunch of great questions. Some of these will be answered in more detail when the extension manual is released but i hope that at least gives you the flavor.

dormando · April 6, 2007, 12:44am

Hey,

Thanks for the detailed responses! Very much appreciated!

The system makes it unwieldy for me to quote, so I’ll respond back with the same numerics.

(EDIT) - oh, I should also clarify that our clients are all Flash. Only the backend is java. Not that this should matter much as the client protocol’s pretty simple.

(3) Gave this a little more thought, and it doesn’t seem as big of a deal as before. The only big reason I could think of task-specific locks would be to prevent the overhead of a task even firing if it knows it’s going to blow up at the top. In practice this shouldn’t be a huge deal, and if performance counters work out it should be easy to spot and deal with.

(5) Heh, all good stuff to know. I’m mostly referring to busted runtime. Even good testing schemas can miss obscure runtime bugs that cause the system to turn into a halted state.

(7-9) Sounds like a lot hinges on the how services are designed and provisioned. I do feel a little better about knowing how wide their usage is though, it makes good sense.

(10) Ahh, I removed the last CompUSA Linksys switch from our datacenter a year ago. Nothing but multi-gigabit backbone trunks from gigabit edge switches for me Although infiniband might be pushing it a little.

To put a little perspective on this, I plan to launch our next major game service on a dedicated set of 25 servers with at least eight total cores and four gigabytes of ram a piece. If the system is popular, there’s a chance it will balloon past 50 within a few months, then hopefully grow with the company.

No tower whiteboxes stacked on their side, or bottom-dollar ev1servers networking here.

This specific issue poses one of the bigger issues for me; currently we plan on having one shard per server. Upon connect the user will be given a list of servers with empty user slots ordered by the load of users on each. This is an easy approach that is difficult to argue around. Personally I’m all for the magic that is auto resource balancing, but in practice it rarely exists, if at all, in the enterprise!

I’ll list out a few obnoxious issues that I’ll run into personally, or that I see others hitting pretty soon;

Hosted applications aren’t always possible if you have investors who’re paranoid about control of their major products. I might be in a situation where I’d say “please!” to the offer of outsourced game hosting, but due to concerns about the longevitiy of that hosting, legal issues, costs, etc, it might not be plausible.
In a custom racked setup, crosstalk between racks can be an issue. Not everyone wants 5,000 cable trunks going back to a c6509 or hp5505zl or whatever they have. Some smaller companies won’t have the option, as individual racks in large ‘bulk’ datacenters only allow single gigabit crossconnects between racks. This isn’t a major problem, usually, but is real.
The mesh scaleout isn’t solely a datacenter or hardware issue, it’s a law of diminishing returns problem. For example, in a much more simplified case of someone scaling out a cluster of MySQL servers:
- Person has one “master” server, and N slaves. As they add slaves, read traffic is spread evenly among the N slaves.
  - For each slave added, X more read queries can be ran against the service, this initially scales linearly with the number of servers.
  - Over time, the MySQL master usually gets more writes to it, which each slave must also execute. Since almost no one runs a service which is purely read growth with no write growth…
  - If for every N slaves you add, the write load on the master has also increased 5-10%, the actual effectiveness of your entire cluster drops. Past a certain number of slaves, the amount of read umph you get by adding the slave is completely negligable. This number tends to be a lot smaller than you’d expect, I’d say 15 nodes at most for the exceptionally little-written-many-read website. Of course, now we all know if you have that much read load, you really ought to be using a dozen cheap memcached machines

I believe most mesh setups suffer from the same ordeal in a way I can’t adequately describe right now. As one adds more servers, each server becomes likely to get more load than existed with fewer servers. Eventually adding more machines to the mesh becomes pointless. Dependingly wildly on the design, that problem can come after 12 servers, 50 servers, or 100 if you’re lucky.

I’ll stick the point in here that Sun engineers probably outclass me, as the largest DC setup I’ve managed was 30 some odd racks across two datacenters. If it turns out the design can deal with the mesh expansion, that’s fantastic! If not, we’ll still have to shard, as long as I can figure out where the diminishing circumstances hit so we’ll get the best balance of performance vs scale.

Jeff · April 6, 2007, 1:59am

Ah yes.

Our scheduler is actually pretty smart and has the hooks in place to hold a task back until another task it previously contended with has finished. Having said that, not all the plumbing between it and the Data manager to be aware of that is done yet, but we’re working on it 8)

Well, there are advantages to working for a hardware company… you get cool toys to play with

I cant imagine a good gigabit back end being a problem for anything but maybe the most massive deployment. having said that we’re still working on getting quantified numbers and tightening up our protocols so I can’t give you any real hard numbers yet… but hopefully soon.

Uh huh. And if your not setting out specifically to solve this problem then why waste bandwidth on it.

But for us its a core area. One of the key problems we took on was how to do this without sharding. The worst case you see of sharding, which is the MMO space, its on top of "geographical’ load balancing and the resulting combination creates massive waste of system resources. By being able to tear down those walls of pre-allocated resources we bring down the total amount of hardware needed to run the game and get much better efficiency in the back-end.

If you’ll allow me a fast example, even games that do organize themselves traditionally can benefit from this. An MMO might still break itself down into “zones” for ease of world construction and manageability of play. But the ways thats been done up til now, fixed memory and CPU resources had to be allocated to each zone just to hold state. With Darkstar, only Zones in use are actively taking resources and a single machine naturally and dynamically splits itself up to handle multiple zones according to the current usage of each zone. The result is the size of your world is no longer limited by your machine count, the only thing that effects is how many users the total world will support.

And you get durable persistence of zone state thrown into the mix “for free”.

The first step of this load balancing is pretty simple… users create load. Spread the users, spread the load. The second step though, to make it really scale out, as you observed requires a bit more intelligence in the back end. All the clues to create affinity groups are provided to us in the patterns of object access and channel joining. Making best use of that information though is sure to be an area of ongoing research. Which is one of a number of reasons this is a technology out of Sun Labs.

Agreed. To some, very large companies we;ve already talked to and continue to talk with, control of their own destiny is critical. we expect those “customers” to build their own data centers. Still, those customers often have more then one online product so an in house “hosting” arrangement where they share equipment and operating expenses across all their products as opposed to one off racks of equipment for each new game has real value to them.

For those not in quite so extreme a place, the fact that Darkstar is open source we hope will attract a reasonable sized playing field of hosting providers. Darkstar app and service code should move freely from one to another should the need arise.

And obviously there will be some trade-offs between how “big” a back-end someone builds and their scaling limits. Part of the point of the playground is for us to be able to build out and test “reference implementations” of the back-end.

Oh sure. To over-simplify… nothing ever scales linearly.
That being said, we’ve really designed to exploit certain inherent limitations in the class of applications were targeting.

Its all well and good to talk about a giant mele of 5,000 characters at once, but no ones screen has either the size or resolution, nor their graphics card the ability, to render and display all that. And if they could, the human being would be in total sensory overload anyway.

Play naturally splits itself up into groups. Its the social instinct of humans to form these groups. Even the simplest and best designed chat room has a real limit of probably somewhere between 20 and 30 people all talking at once before it becomes totally unmanageable and unreadable.

People will naturally divide themselves up, but what we really are reaching to do is allow the game designer to allow their users to make those decisions dynmically and have the resources follow them, rather then arm twisting the users into pre-designed clusters that are tied to resources.

This way, wherever the users are, thats where the processing power goes. As opposed to today where, particularly in the MMO space, you see one of only two things: geographically assigned hardware (a great strategy if people were Gaussian, but they aint) or “instanced missions” (which build hard boundaries around very small groups and cant have long-term persistent state.)

Well, we’re a lot more then just a dumb grid of hardware 8) We are depending on transient affinity groups to form naturally, but given the subject matter we’re addressing I think its a pretty safe bet.

Well I’m hoping you never have to… thats my goal as lead architect… but if you have to shard at two orders of magnitude greater then you need to today, I’ll still consider that a win 8)

dormando · April 6, 2007, 7:23am

Aww, have to one-up me by splitting your quoted replies

I misspoke here. I desperately wish to have a scalable architecture instead of a bunch of uneven shards, heh. It just sounds too good to be true.

Heh. I’m glad someone else sees the insanity in the secondlife model. My girlfriend’s sick of hearing me rat on them

Nice plug

I really hope to do this myself; but I do need some ammunition for selling this to my coworkers. Looks like we can’t test our applications on it until the rest of the docs come out, but then…

Internally to a shard, all the games already do this. Even the new ones planned have limits on the number of users in each “zone”, for gameplay and server reasons. This is all stuff we’ve figured out already, and from your answers looks about right for your eventual development goals. Our internal project (which sadly I wasn’t able to be involved in) tried a few of the same design goals for scalability (on a much lower scale). However the chat channels would break down the scalability there…

In the case of say, guilds, chat groups, the crosstalk gets heinous. While you’d want to have users grouped and playing together migrated to the same server, you can’t both have that and let their constant guild chatter become localized.

However I believe that specific situation can be remedied by using channels for local chat, and a service gateway to a jabber cluster for more global chat and crosstalk. The more we can assure servers can isolate their resources, the more independent they are, and the further they scale. I’d be thoroughly impressed if you guys can pull off a scalable, open sourced game server which actually uses all of these advanced scaling concepts. Good luck! See you at JavaOne

Jeff · April 6, 2007, 2:43pm

Thanks, these are all great questions that clearly come from experience.

Something I’ve been very impressed with is the level and quality of the community we have forming around this… no please, keep forming while i finish getting the community resources ready!