Long execution/background tasks

totalslacker · April 3, 2007, 5:07am

Another question from a n00b:

How do I write a task that takes a long time to complete (i.e., it waits on io)?

I have a task which sends the client a large amount of data at startup, but the server times it out after about 3 seconds. How can I work around this?

Thanks for your help!

karmaGfa · April 3, 2007, 12:10pm

Strange, I remember the documentation saying that the default value is 1 min.
You can change the time out value. See the documentation.

It is strange also that your task is 3 seconds long. Does it takes that long to forge your message ??
I recommend you to split the creation/send of your message into subtasks.

sethp · April 3, 2007, 2:28pm

The timeout value isn’t documented right now, but tasks should definitely not be taking on the order of seconds to run. Our model assumes many short, discrete tasks that run and then yield to allow other tasks to run. If a task takes on the order of seconds to run, it’s using a thread that can’t be running some other task, and so tasks start to backup in the scheduler and latency ensues.

If you actually need some long-running task (here we’ll define “long-running” as more than 100ms, to pick an arbitrary number), then the right approach is probably to write a service that maintains its own thread of control and queues up these requests itself. For the model where you just need to send a startup message, that should actually work pretty well. Jeff is writing an example Service that does something very similar for dealing with I/O.

seth

Hoyle · April 3, 2007, 4:05pm

How much overhead is their in creating a task? I did a simple test, I created a Runnable that displayed the current time. I tested executing it with thousands of tasks vs thousands of threads and it seemed threads were about 3x faster (that time includes creation and execution). Since a Task consist of a transaction I can understand this.

How many “short lived” tasks would you expect an average machine to be able to handle? (assuming these tasks are discreet and don’t block on each other). I’m just trying to understand what kind of performance characteristics I should expect? The code I’m working with is event based and I’m trying to decide if I should create a task for every event or try to bundle a few events together and hand them to a single task to execute.

sethp · April 3, 2007, 6:54pm

I’m not quite sure what you’re talking about here. Do you mean that you tried creating a bunch of threads versus just invoking a bunch of methods in a JVM? Or are you saying that you implemented the com.sun.sgs.app.Task interface, ran many thousands of tasks in the SGS system, and compared that with spawning your own threads? Or was it something else?

Running a Task in our system is fairly light-weight, but as you note it does involve setting up a Transaction, and it does have to wait until one of the scheduler threads is ready to run it. This is part of how we manage fairness for access to resources. On a smallish server, the overhead for running a transactional task is only on the order of a few milliseconds. This typically translates into at least 15k tasks a second when there’s no contention, but this will obviously change based on your application, hardware, etc. This is also based on highly un-optimized code that is improving.

Let me be clear about this: you should never be creating your own threads from within a Darkstar application. Ever. At the Service level, you might need to do this, but only in certain cases. Even there you shouldn’t be doing it just to circumvent the core scheduling, since that will defeat all of our efforts to think about when a given task should actually run. If you’re writing a Service that is firing events, then you should definitely be using our Tasks to do so, although it’s up to you to decide if these need to be bundled or can be done individually. If you don’t use our scheduling system, then the application code being called-back won’t be invoked correctly, and will fail pretty quickly.

seth

Jeff · April 3, 2007, 7:06pm

IMPORTANT!!

You should NEVER try to wait on Io in a task. or do anything that blocks in the OS for any significant amount of time.

This is in the tutorial.

The system is doing exactly what it is supposed to in aborting your task. if you could find a way to force a task to do so, that would seriously break assumptions that the system depends on to operate

The right answer is to write a service that handles such things for you. That is covered in detail in the SGS Extension manual which will be out at JavaOne.

This should not be giving you a problem if (a) you aren’t trying to load that data from an OS file inside of the task and (b) you are using the system’s IO. The system’s IO occurs outside of a task’s transactional context (it, not surprisingly, is a Service.)

As Seth says, DON’T start launching your own threads inside of the SGS’s VM from task code. The code run in those will occur outside of the task context and thus (a) have none of the guarantees of code run in a task and (b) could be violating SGS assumptions and break other SGS code.

Hoyle · April 3, 2007, 7:59pm

As for my tests, I wasn’t starting threads inside the app server. I basically had two applictions as follows:

App 1) A simple java application that tried to create a thread and execute it as fast as possible. It tracked how many threads it could create, execute, and complete over a given period of time.
App 2) A SGS Server that spawned simple tasks that did nothing but track how many were spawned, it looked like this:


public class MainTestTask implements Task, Serializable{

	private static final long serialVersionUID = 1L;
	
                      // A very simple task with pretty much no contention
	public void run() {
                                           // Stack method that tracks how many times this task ran using an AtomicInteger
		TaskTracker.trackTask();
		
                                           // Start a new task
		AppContext.getTaskManager().scheduleTask(new MainTestTask());		
	}

}

When I let App 2 run, it starts out executing about 1600 tasks every second, but after about 30 seconds it drops down to about 40-50 tasks per second and the longer it runs, the slower it gets. I have noticed in my data directory it generates lots and lots of 10 meg log files (I let it run for a few minutes and I have over 100 of them).

Any clue as to what I’m doing wrong here?

Jeff · April 3, 2007, 8:07pm

Well, I*'m not 100% sure what you are trying to test here.

A VM spawning a thread will always be faster then the SGS running a task.

Remember, a Task is more then just a thread. Its also a transactional context and managing that transactional context takes work,. as does the infrastructure for handling task retries in the event of conflict. (Even if there is no conflict, there is still some infrastructure over-head.)

It also requires access to the ObjectStore to load any objects it uses, and the task object itself counts as an object that must be loaded. (You could reduce this by not using a managed object for your listener, but the flip side then is that your going to have them filling up memory until their associated tasks complete.)

We try to keep these costs as low as we can but its unreasonable to expect them to be free.

Hoyle · April 3, 2007, 8:13pm

The test of just spawning the threads was to just get a baseline. I totally understand that spawning a Task won’t compare against a Thread, and I’m not expecting that. I just wanted to get an idea of what I would expect.

As it stands right now it seems that I’m doing something wrong, since at first I get over 1600 tasks per second and then as the server runs it runs fewer and fewer tasks. Of course this never measures up to Seth’s mention of 15K tasks per second.

sethp · April 3, 2007, 9:51pm

That seems reasonable to me. Getting a baseline like this is always a good idea. As Jeff suggested, there’s a reasonable amount of work being done for each Task. Note that a big-T Task is going to get managed through the DataService, and if it’s not also a ManagedObject, removed after successfully run. In other words, your code above does the following work:

Get the next task from the scheduler
Create a transactional context
Get the Task object from the DataService (see note below)
Invoke the Task, which in turn (your code) invokes the TaskService, creating a new ManagedObject
Remove the managed Task
Commit the Transaction, which cause the next Task to get scheduled
Repeat…

Note that tasks are managed by the TaskManager, because in the event of failure we need to make sure that they still get run in the future if the current Transaction commits. So, In the case of your Task (which is not a ManagedObject and therefore not already persisted), the TaskManager is managing a new object that keeps a Serialized copy of your Task. This is part of the “fairly un-optimized” code that I mentioned in my previous message. In particular, there’s a lot done here with name-mapping that’s going away. Still, some of this work is unavoidable if you want to guarantee durability.

You’re absolutely right to be digging in at this level, and asking questions about throughput. In fact, I’m glad to see people doing this, because data can only help us improve the system! I just wanted to be clear what the expected model is for the stack, and also that what you’re investigating right now is a version of the system that tries to meet its functional contract, not necessarily one that performs where we’d like it to. We’re working on that now, and I expect the system you see in a few months to have very different performance characteristics

The 1600 part is about expected, but the slow-down is curious.

The reason I expect the 1600 part is because you’re running in lock-step. You schedule one task, and then run another once the first one is finished. This means that only 1 scheduler thread is running, and so there’s no parallel execution. It also means that you’re seeing well less than 1ms overhead for each Transaction. When I cited 15k tasks, I was talking about parallel execution, which is what an application normally looks like. Now, it turns out that you can’t just run with 10 threads and get a 10x improvement; it’s sad, but true You should see a much better result, however, and when I address the previous comment about name bindings and management in the DataManager, you will see closer to a 10x improvement here. Note that by default there are only 4 threads running in the scheduler, but if you want I can show you how to tune that.

The slow-down is a little strange. There’s nothing you’re doing that should cause a significant back-log of work in the system. Have you tried running with profiling (as discussed in a previous thread)? It might show a little of what’s going on. If you post what your AppListener code looks like, I will try running your example and see if I can replicate this behavior. If I can figure out what’s going on, I’ll let you know. Thanks again for helping to dig into this!

seth

Hoyle · April 3, 2007, 10:19pm

Thanks for the info, what you described is pretty much what was expecting was going on under the hood.

So I started doing some investigating, since I figured this is an issue on my end as no one else is seeing this. I think what is happening is that my code for tracking the tasks per second was buggy. I’ll dig into this more, and if I turn up something else I’ll let you know.

I do have a question about all of the log.* files in the data directory. Why does this continue to grow in size as I run my application? I’m not storing any managed objects (except for Tasks). Running for a few minutes gives me a hundred files each 10 megs.

Hoyle · April 3, 2007, 10:49pm

So after fixing my code, it turns out I’m not seeing the weird slowdown. Which is good. Now that my profiling code is properly profiling, I seem to get about 300 Managed tasks per second or 850 non managed tasks per second . My code looks like this


public class Server implements AppListener, Serializable, ManagedObject {

	private static final long serialVersionUID = 1L;

                     private static final int INITIAL_TASK_COUNT = 1;    

	public void initialize(Properties properties) {
		
		for (int t=0;t<INITIAL_TASK_COUNT;t++) {
			MainTestTask task = new MainTestTask();
			AppContext.getTaskManager().scheduleTask(task);
		}
	}
	
	public ClientSessionListener loggedIn(ClientSession clientSession) {
		return null;
	}
}

The MainTestTask is the Task code mentioned in my prior post.

JamesM · April 3, 2007, 11:04pm

The database log files are part of the normal operation of Berkeley DB. When you’re not concerned about database recovery, you can typically delete most of them (see below). In a production environment, there are various rules you need to follow when archiving and cleaning logs in order to ensure that the database can be fully recovered, so you don’t want them automatically removed in that case.

See http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/logfile.html for the gory details.

The short version: periodically run db_archive -d to get the list of inactive log files, and remove them. Note that the db_archive tool is not included in the SGS distribution.

Hoyle · April 3, 2007, 11:20pm

Thanks a lot James, that makes sense.