n00b Qs: Startup, shutdown, debugging, etc.

chumDS · April 14, 2007, 7:21pm

A few SGS n00b Qs:

Given that I can’t have a non-final static variable in a managed object, how would I impelment the Singleton design pattern?
Is named-bindings the “best practice” for that?
How can I recognize when the server first starts up?
I don’t mean the AppListener.initialize() method; I mean when I’ve shut down the server and am starting it up AGAIN. I’d like to do some prep-work when this happens (simple example: send myself email “Server started at %TIME”), but don’t see a good interface in the API for this. I have the rather lame “if CurrentTime-LastTime > 30 seconds”, but that’s a bit hackish and unreliable (in both directions.)

Is there “an approved way” to do this?

How to do a “graceful” shutdown?
Right now, I just ^C (or kill) the java process. I’d like there to be a way for the administrator to log-in and initiate a task to shut down the server, which could affect the game logic (i.e., no new logins, broadcast msg “The server is going down in 90/60/30/15/10/5 seconds”, etc.), then eventually shut-down programatically with something a bit less harsh than “System.exit(0)”.

Is there “an approved way” to do graceful shut downs?

Do I have to run my AppListener under SGS? Are there any source-level-debugging
tools I can use on my AppListener?
Is there a way to ask the system “give me a list of all managed objects in the database”?
Currently, I keep a masterObject which contains a list of managed references. The problem I’m anticipating with this is: once my world has a million trees and a million birds and a million fish and a million tigers and a million whatevers – every time I want to generate a new tree, I have to add a ManagedReference to the end of this list, and then this gi-normous object has to be written out to the database.

In addition, it has the problem that if I ever accidentally make a ManagedObject but forget to put it on the list, I “leak” objects into the database, and there’s no good way to get them out without erasing & starting over.

How do you guys handle large object counts?

Jeff · April 16, 2007, 1:04am

[messed up post removed]

Jeff · April 16, 2007, 1:14am

Actually you can do it the way you think you would, with a static field, but you dont WAnT to do it that way.

The reason is because that "singelton’ is only single with respect to one VM, but Darkstar is a distributed processing system with N VMs. Your objects may get processed on any of them at any point in time.

A "Darkstar Singleton’ is simply managed object with a known name. You attempt to retrieve it by that name. Should that fail, you create it.

That simple.

You cant and shouldn’t. Remember, Darkstar is a system that fails over and does its best to make it appear to your code like it is constantly running. If you shut the whole thing down thats effectively just a long pause in operation. When it restarts it restarts exactly where it left off.

This is really desirable behavior. It just takes some getting used to.

If you want a “reset” then write reset logic into your app in response to a reset packet you send.

Again, a full Darkstar back end gives 5-9s uptime. What is it you are trying to accomplish? Be more specific and maybe we can help more.

Administration features are being designed with the “big backend” fully distributed system we’re implementing now. We’d be interested in hearing thoughts on what features you might like to see.

For the 0.9 SDK stack, just kill it. the whole point is that it recovers from such things 8)

yes and yes. I debug right out of eclipse. Just set the SGS kernel as your main and put break points in your code.

There is a way to iterate over all named objects. See the DataManager’s API

Don’t do that 8)

Seriously, we cant protect you from every bug you might create. Having said that, we are examining what sort of secondary database tools might be useful. At a guess, this probably falls into this space

Very well? grin I’m afraid i need a more specific question. Very large is very relative.

chumDS · April 16, 2007, 2:13pm

I follow what you’re saying – I really do! – and I get that this is desirable behaviour most of the time. However, for my particular application, it is desirable for the simulator to have a chance to do some “catch up work” when it’s been “paused for a long time”, before it allows anyone to log in, for example.

So I have different states for my sim:

Initial run – caught by initialize(), lots of prep work, then you can log in.
Normally runing – the sim proceeds at its normal pace.
Restarted – the sim would like to prohibit logins while it does some long-running catch-up tasks, then enable logins.

I’d like to take the server down for an upgrade (software or hardware.) This will take a couple of hours. When the server comes back up, I’d like to catch the “you’ve been down for a long time” event.

Again, I can do this with “if it’s been 30 seconds since your last ‘frame’…” thing – that just seems hackish, and the sort of thing that it would be nice to catch as an event or with an API callback.

So, is there no support for the “small backend”? I understand that BB is how you guys make your money, and the thrust of the project, but it might be worthwhile to support boostrap operations that begin with a small cult following on a single personally-owned server before making the leap to “the big time”…

/shrug ok. Just seems heavy-handed, is all. Just to be clear, though, 0.9 & beyond is designed to recover from System.exit(0)?

Of course, but the docs also say that too much named-binding is a performance issue and, to be honest, it seems like a hassle for run-of-the-mill objects. My question was about non-named objects.

Of course, I don’t expect SGS to “fix all my bugs” – I was just asking for a bit of help with some tools to help me find/fix my own bugs

Are the 2ndary DB tools slated in the javaOne timeframe, or are they “later”?

By “you guys”, I meant the other developers. As in: “what do you guys do when you want to itterate over all the not-named MOs?” Maybe a better way to ask the same question is: I’m currently keeping a “master list” of Managed References, and this strikes me as rather heavy-weight since, every time I add or remove an item to/from the list, it needs to be re-streamed to the DB. As my list of MRs grows, this list may become quite substantial. So the Q is: do folks have a better way to manage their long list of (unnamed) MRs, so that they can be itterated under the current SDK?

Thanks!

sethp · April 16, 2007, 3:11pm

Hi chumDS. I wanted to respond to a couple of these points en masse. Let me know if this helps…

To the point of re-start and failure, this is more complex than it may seem at first. Also, when Jeff talks about design with the “big backend” in mind, it’s not that this is all we will support, it’s that we need to completely understand a big system to fully understand how the design should work from one node to N nodes. We provide the same model regardless of your system’s size, and so we want to get that model right. Part of the complexity is that, in an enterprise cluster, the system may literally never go down, but individual nodes will go down or come up. Why should they notice this behavior unless your application is doing something specific to a given node, which it shouldn’t be doing

We have talked about several possible models for notifying the application in these cases. Thus far, none of them seem quite right. It may seem like a hack to you, but really doing some time-delayed check is probably the best thing. Note that we have discussed the possibility of a notification if there was system-wide failure, but thus far we don’t have any concrete plans to notify applications of this case. For your system, I would schedule a periodic task that is supposed to run every 30 seconds (if that’s your rough window), and have it check that it’s not more than N seconds off from when it should be running.

Note that some of what I said above applies to shutdown too. It’s not that we won’t do it, or that we will only support it at the high-end. It’s really just that understand the complex cases of a large system affect the design that will be in place for any-sized system. We want to finish this design before we push any real shutdown tools. I know it seems wrong (or, at least, I hope that it seems wrong :)) but for now just ^C is the right way to go. Since the system is durable, this is actually safe to do.

Finally, to name-bindings and iteration. We have discussed this issue here and with customers in great detail, and indeed we added the ability to iterate over names after some of these design discussions. The problem is that being able to iterate over every object in the store doesn’t really buy you anything. What are you going to do with those objects? In practice, there has to be some reason you’re interacting with a given object, and usually that has to do with its relation to other objects in the system. Please don’t take this the wrong way, but a well-design application shouldn’t need to maintain a collection of all the objects it has managed. You should be able to start with some smalled set, either name-bound or installed as call-back handlers, and then build out graphs from there. If you have some specific use cases for why you can’t do this, and why you really need one big collection of references, then please let us know!

seth

Jeff · April 16, 2007, 3:50pm

Then your doing exactly the right thing. Set up a periodic task that checks how long the system as been down.

Or send it a packet from your own admin client that tells it to cleanup.

Well, “:hackish” is a relative term. It fits the model we are designing to which is that there may be processing delays but processing never gets dropped. Delays can occur from a lot of reasons, not just because a host goes down. The can occur because of over-loaded systems. From hiccups in your back end. From loss of supporting hardware such as switches. And so on,

These are things you need to handle anyway to be robust, so its really a requirement to watch this.

Actually, thats not the issue at all. The “big-backend” is also slated to be open sourced.

We make money from the whole suite, by encouraging the building of SGS applications that run in SGS back-ends.

What is important to understand however is that the “little server” is an SDK, not a production server. It has one goal:
To as effectively as possible in a single server environment present the execution environment your code will encounter
in the multi-server back-end. Things that help the developer to simulate that in their own environment are definitely part of the
goals for the SDK. Things that might be specific to the SDK server by and large are not.

As we finish work on the production servers you will see many features flow back into the SDK. But it is not our intention to have anyone trying to run production services based on it.

Does that make more sense?

Our focus is very much on small developers and we do want to see you “bootstrap.” This is how we see the progression:

(1) You learn, develop and prove basic code-correctness on the SDK
(2) You go to the Playground or similar service run by an SGS hosting provider to scale test.
(3) You deploy to a hosting provider who splits revenue with you

Trying to build the kind of robust and performing system we want the SGS to be just isn’t really practical on a single-server
platform. If you find for some limited application you can actually deploy production on the SDK and it works out for you thats great, but its not our focus or intent.

Yup we had issues with the EA stack but I’ve been doing this many times nightly for a month on my current project and haven’t seen any problems 8)

Of course and I apologize if that was a bit flip.

As I say, we are looking at secondary tools to do things like introspect and edit the object store. Letting us know about actual development issues encountered in the field will definitely help with that and we appreciate that very much.

Probably later. Its not set in stone yet but we do have our hands full right now dealing with finishing the full production system. One possibility is that, after the code goes open source, if we cant get to it fast enough the community could put something together.

True. But AIUI from our DB guy we’d have to do effectively the same thing under the hood anyway and this way its cost is under your control.

Part of the concern coming from our DB side is that such iteration is risky in our multi-processed environment. Objects are potentially being created and deleted in parallel so you start looking at race issues, which is part of what we’ve tried to keep hidden through the rest of the system. Such iteration also could easily outstrip the time allowed for a task so you would have to monitor your actual execution time and spawn child tasks as necessary to complete things. Finally, you could potentially lock large numbers of objects in a single task by iterating this way which could absolutely kill the server performance. Your iterating task would run into potential deadlocks often, causing a great many task aborts and re-tries.

For all these reasons, it seemed external inspection on a locked DB (either by halting the back end or by working on a snapshot) seemed a much safer option.

Ah, this gets into the area of parallel data structure which is also on our “utilities” list… to implement some things like distributed lists, and maps and such that divide their data up into usefully parallel chunks.

The SDK is very much on the bleeding edge right now. There is a lot of stuff we still want to provide to you. Its just a question of time and priorities.

tigeba · April 16, 2007, 6:35pm

I was thinking it might be nice to add true namespace support to the DataManager API, so that it would be easier to search over the repository for a specific class of object. What I mean is that now, I can define arbitrary namespaces as strings ( player., item.), but my only mechanism to search a namespace is to iterate over the entire list of all bound managed objects with DataManager.nextBoundName(), and do string comparisons on the results. For example, in my little chat simulation, I want to look for all the connected players for a “who” command. Right now I iterate over all the bound managed objects looking for names that start with “player.”, where it would be nicer to just iterate over a namespace that I already knew was just players.

Perhaps the addition of

DataManager.setBinding(Namespace namespace, String name, ManagedObject object)
DataManager.getBinding(Namespace namespace, String name, Class type)
Datamanager.nextBoundName(Namespace name)

DataManager.getBinding(String name, Class type) could just default to the root namespace.

I guess I was expecting something a little more like JNDI, except maybe not as complex as JNDI. Did I just contradict myself?

Jeff · April 16, 2007, 6:53pm

I thought we had a way to start the lexigraphic search at a known name.

Let me look again…

tigeba · April 16, 2007, 6:55pm

It is entirely possible I completely missed it. It would not be the first time

Jeff · April 16, 2007, 6:57pm

Yeah search is in lexigraphic order.

So if you say nextBoundName(“player.”) it will return the first object in lexigraphic order that begins with “player.” or else the next
one period if there are none that begin with “player.”

So to be 100% robust you do have to either store the name on the object, or have a known prefix/class 1:1 relationship so you can test with instanceof, to see if you’ve reached the end of everything that, say, begins with “player.”

Does that help?

tigeba · April 16, 2007, 8:29pm

Yep, that makes sense. I looked back at the docs and it appears that I didn’t read carefully enough the section where it said that the order the names were returned was based on the name encoding. I think I assumed they were returned in some arbitrary order, or the order they were entered into the datamanager.

Jeff · April 16, 2007, 8:39pm

SOkay, I read too quickly too…

Correction, you don’t need to store the name because nextBoundName(…) returns the string not the object.

You can just check that 8)

chumDS · April 17, 2007, 3:10am

(Bulk responses)

Btw, I hope that my answers/questions were “too flip”, either.

I sort-of assume that you guys all have plenty-strong egos, and understand that you’ve got a great thing happening and it’s much appreciated, so my …uh… “gentle critiques” are intended as either suggestsions or questions, and nothing more. I certainly don’t mean things like “seems hackish” to imply any sort of derision toward your system.

(To clarify that example, the problem with that particular “hack” is that it’s inaccurate. If I ^C & restart the server immediately, the 30 sec timer misses the event. If I do something that bogs my system for 30 sec (I actually had an IO (non-SGS) block across that timer, once!), then I get a false-positive. That’s what I mean by “hack”, as opposed to the “clean”-ness of an API.)

Yeah, I get the “cluster never goes down” thing, although it might. What about when Vsn 2.0 (of my game) comes out, and I have to upgrade everything?

Uh… shoot, I had a good example of the “itterate” thing, but can’t think of it. A lousy example is: I want to give all players an extra day of play-time, so need to itterate over them. Of course, my players are named, which is why this is a lousy example, but maybe I decide that all my …uh… bears need an extra 200 Hit Points or something. I don’ t keep named bears in my world, as there are millions of them. (Not to mention the salmon that they eat! ;)) Hmmm, ok, not exactly the best example – but you get the idea! :

Oh! How about object migration. Like, all my millions of (unnamed) trees are going to become treeVsn2-s. It’d be nice to ask the server

myTreeObject [] allTrees = myAppListener.getAllObjects (myTreeObject.class);

And then migrate them. Btw, this is a good reason for the AppListener.serverStarted() callback, too

I don’t have any objection to ^C, SIGKILL or System.exit(0), just wanted to make sure that that’s the sort of “robustness” you guys intended, and that I wasn’t “pushing too hard.”

Anyway, thanks for all your help – I’m well on my way. I’ve got another n00b-Q, but I’ll start a new topic for it.

Jeff · April 17, 2007, 2:50pm

chumDS:

(Bulk responses)

Btw, I hope that my answers/questions were “too flip”, either.

I sort-of assume that you guys all have plenty-strong egos, and understand that you’ve got a great thing happening and it’s much appreciated, so my …uh… “gentle critiques” are intended as either suggestsions or questions, and nothing more. I certainly don’t mean things like “seems hackish” to imply any sort of derision toward your system.

(To clarify that example, the problem with that particular “hack” is that it’s inaccurate. If I ^C & restart the server immediately, the 30 sec timer misses the event. If I do something that bogs my system for 30 sec (I actually had an IO (non-SGS) block across that timer, once!), then I get a false-positive. That’s what I mean by “hack”, as opposed to the “clean”-ness of an API.)

Well, i guess 'Im trying then to understand the difference in your design btw a 30 sec delay and a 30 second down time. What do you see as the functional difference? Our initial thinking here is that they are the same thing to the code and would have to be handled in the same manner.

A very good question 8) We’d like to make that upgrade as seamless as possible. Having said that, there are numerous issues with “roling updates” that make it a difficult problem to solve, Its something we’re actively thinking about right now but have implementation scheduled for after we have a basic multi-node system up and running correctly.

You shouldn’t be iterating oer the whole database then. That would be “A Bad Thing”. You should be either using a naming convention or keeping your own list of bears depending on how many you expect in game and how fast the list changes.
As I mentioned, we intend to provide some data structures for doing large lists and such in an SGS friendly manner,

Well, if they are serialization compatible changes, you just change the code.
Otherwise, this can be done with some serialization tricks (read replace/write replace.) Done this way you can let it happen in a lazy fashion. Finally, if you want to do it actively, again, you should keep your own list

We are thinking at the moment about what tools might be needed to help you deal with serialization incompatible changes. Thats part of this whole upgarde issue mentioned above.

Except server start isn’t the issue here. Reset of your world is. So build a reset command. Thats a lot less
“hackish” then making it a side effect of starting the server.

chumDS · April 19, 2007, 4:58am

Ah, that’s easy.

In a 30 second delay, the game-server has been running the whole time, but it had a really long frame (assuming a game that has frames measured in FPS where F >= 1

In the 2nd example, the admins took the server down intentionally, probably changed something, then brought it back up again.

I suppose this probably comes up more during early development than during “everything is running smoothly” post-delivery time. Still, helper-APIs for early-development are nice to have! Again, I think my migration examples are pretty good.

Ok, that’s really all I was asking – if something like that was already in.

Also, could you please clarify the issues around “too many named items”? I don’t remember where I saw it, but I thought one of the docs told me not to go and name too many things, because that was a performance issue.

…Or did I dream that?

Well, sure, except that, let’s say I’m in middle-development (vice early-), so I’d rather not reset my ENTIRE world – I just want to upgrade the trees?

(Ok, so that was a tad abbreviated… 8) )

Jeff · April 19, 2007, 2:31pm

Ah, so change is really the issue, not the down time. Good. I was just trying to clarify that.
I realize that today you have to take the SDK server all the way down to get classes to unload. I don’t
expect this to be true forever. The EA2 actually had basic commands for shutting down and restarting individual apps but we haven’t gotten back there yet with the 0.9 stack.

Well, in early development I usually wipe the object store between each run because I want to see what the stack does from a clean start-up anyway. Leaving the results of previous versions around at that stage seems dangerous to me.
YMMV.

One thing that I think is emerging form this discussion is that, maybe, as a previded extension, I aught to think about throwing together a JMX manager that makes it easy for you to wire such admin controls into your app.

Does that sound useful to the user-base?

No, but the data structures are on our list and I may start trying to develop some as parts of demo applications.
Just a question of bandwidth and resources. We’re a serious Sun project (after years of working with management to get there) but we’re still a lean project. And given the cutting edge research nature of some of the big problems we have to have solved for you, it really is a good way to operate. As those Big Issues get nailed down it will be possible to start parallel tracking more of these details and getting more people to assist with them.

Welcome to the bleeding edge. The bright side, FWIW, is that as someone who worked coding on the cutting edge of platforms such as the SEGA Saturn, I can confidentially say that, as a platform, we’re in better shape then the starting up period of most new platforms. 8)

Had a talk with our resident database expert and we may have over-stated this a bit somewhere (probably my fault.)

There is a cost associated in a few places with naming things, but we’re working to keep it low and reasonably scalable with the understanding that you will potentially have a lot of named objects. The concern is basically don’t go silly with it. In addition to the costs of the storage and such and potential scaling issues if you massively over use it, every name lookup is an extrab indirection and this costs.

SO the advice is this. Where you reasonably need name-resolvable roots of object graphs, go ahead and use it and dont be afraid of it. But if your desiging thinsg liek data structures, referencing through names is a good deal less efficient then managed references so don’t use names where you could use a managed reference. An example might be a hash table built out of managed objects-- you would want to find nodes by managed reference and not by some naming scheme,

Well see my earlier comments. One of a few things I think will apply:
(1) The code change is a serialization compatible change. In this case you can just change the code and, as logn as you have built appropriate defaults (static or calculated) for the new fields it will “just work.”

(2) The code change is serialization incompatible, in which case there will need to be a conversion. This can be done today by adding read replace methods to the effected classes and defining the new code in a new class. We are also looking at what other tools may be useful in this case.

Really, we need to drill down further here, I think. If its juts a code change to how trees behave, the above should work. If its a change, say, to how trees are placed that requires removing all the trees then you want a “tree manager” that tracks those managed objects through managed references, can remove them, and also probably contains the code to generate them.

As I mentioned above, one thing this is suggesting to me is that some generic “control panel” functionality might be useful which is why I’m thinking about the idea of enabling easy use of JMX for basic app administration. That way, you can easily wire in a “clear trees” button on the fly if you need it.

Is there some part of the problem I’m still not following?

Anyway… as you point out, there are ways around this. I just thought that, since stuff is all neatly in a database, anyway, it’d be handy to have some DB-esque access to it. It’s not that the things I want to do “can’t be done”, it’s just that there are things that the API could do to make it easier.

The issue here is that we aren’t just a standard DB underneath, A standard DB could never scale the kind of read/write loads we need at the speeds we need.

By limiting the scope of the Object Store we’re able to reach for the run-time performance we need. Nothing comes in comp sci without a cost somewhere

The bottom most layer of the Object Store in the SDK is BDB, but thats just an implementation detail and a convenience for us so we aren’t ALSO trying to write reliable/recoverable disc storage right now. If we expose that though then all of a sudden we’re back in standard DB land.

This is why, for RDBMS stuff, the right answer is a manager that talks to the RDBMS out-of-band with the task system.

Heh as much as reasonable. I know you AREN’T saying this… but as an extreme I have a friend who teaches game coding for a living. (Shawn Kendall, hes on these boards on and off.) And Prof. Kendall talks about the kids who come into his class and want to know where the “make my MMO” button is on the tools 8)

Clearly we’re all more sophisticated than that here, but the idea carries through. We can’t solve every problem your going to have building a game… we’ve tried to bite off what seem to use to be the biggest problems and make them more manageable. Having said that, this feedback DOES help us focus on what the real “big issues” are so I really appreciate it.

chumDS · April 20, 2007, 5:02am

Oh, agreed, and I hope I don’t seem unappreciative – but now you’re in a weird position where (I’m using gross shorthand, here, “work with me”) “you’ve lowered the bar so far that any idiot can tackle the MMO problem”, and so now you have idiots like me looking for the “write my MMO” button

Ok, not QUITE – but there is that weird balancing act, there. It’s a little like when people hear “write once, run anywhere” and think that means that they only have to test on their one development machine, and everything else will work identically. Then we wind up explaining to them: “well, sort-of, in theory, not really…”

Anywho… you’ve been great answering my Qs, and much appreciated. I certainly don’t expect SGS to “write my MMO”, but I do so appreciated great tools, which SGS is shaping up to be, so can’t help but think of how it could approach the “just add game-logic and content, then bake at 350 for 25 minutes…” model.

That’s more of an ideal than an actual goal, though – a sort of philosophy that I see benefit to striving for – even if it’s never fully acheivable, asomtotically approaching it is A Good Thing.

Jeff · April 20, 2007, 2:45pm

Heh yeah. A wiseman once said “never automate sharp objects.” To some extent, we are breaking this rule.

I’ve experienced this first hand before. My first job at Sun was doing performance tuning on the JDK and helping Sun customers tune their apps. RMI makes it so easy on the surface to write networked apps that “any fool” might think the could write one. And they could… but not one that performed well. To do that still took understanding the basics of the networking problem.

So I agree with you that there is a danger here and we need to educate early and well if we’re going to avoid it 8)

[quote]
Anywho… you’ve been great answering my Qs, and much appreciated. I certainly don’t expect SGS to “write my MMO”, but I do so appreciated great tools, which SGS is shaping up to be, so can’t help but think of how it could approach the “just add game-logic and content, then bake at 350 for 25 minutes…” model.

That’s more of an ideal than an actual goal, though – a sort of philosophy that I see benefit to striving for – even if it’s never fully acheivable, asomtotically approaching it is A Good Thing.
[/quote

And asking the question "what COULD we do to make this better/easier is always a great one to be asking. We know we’re just at the start of something. This wont be the final feature set of this version that you see today, nor the last version we ever do 8) And the questions/requests/use-cases really help so I do really appreciate them.