advice for debugging weird error

One of our (server-side) apps runs perfectly for anything up to a couple of weeks, and then will suddenly start running at 100% of the CPU. The problem is completely intermittent. When it started happening a couple of months ago, it was during peak usage, then happened a week or so later at the lowest point of usage during the day, then happened 2 days later, then ran fine for another 3 weeks.

After the latest occurrence, top shows 3 threads that are causing the problem, but as far as I know, there’s no way to map a PID to a Java thread. The stack dump produced by kill -3 shows me nothing obvious, although perhaps I’m don’t know what I should be looking for.

The worst part about this problem is that I’ve been completely unable to reproduce it during testing. Frustrated hair pulling, ahoy!

Anyone have any thoughts/advice…?

Thx
J

  1. perhaps this is better in networking, or is it nothing to do with games? :slight_smile:

  2. Could be a hacker. Never underestimate the number of strange problems they can cause, often through acts of random incompetence rather than insightful brilliance.

  3. Which stress testing framework are you using?

  1. not really games. sort of entertainment-related though.

  2. can’t see anything obvious in terms of security breaches. my sys admin is pretty damn on to it, in regard to security. in fact he’s anal retentive, so I’m doubtful it’s a hacker. might be wrong of course, so I’ll get him to take another look.

  3. using jakarta jmeter to stress test. and I hate it. currently hunting around for a console-based, scripted test tool.

Probably an obvious suggestion, but do your app logs point to anything out of the ordinary?

When the app goes to 100% CPU, is it still functioning, or does it seem to be caught in a bit of repeating code somewhere?

Does the CPU usage ever go back down? After 5 minutes? After 2 hours?

What sorts of issues do you see on the client side when the server app goes to 100%?

What’s different about your test environment vs. your real-world environment? Of those things, which one of the seemingly small and unrelated things have you not tested against? For example, I once tracked down a printing problem to a change from standard time to daylight savings time - a PRINTING problem of all things. Get as far outside the box as you can.

100% cpu… hm.

There was a way to do that by sending weird pathes to the servlet.

Something like http:///…/…/…/…/…//…/…/…/

But that was fixed ages ago. However, it might be a good idea to update everything.

That’s my problem. We’re logging pretty comprehensively, but there’s nothing obvious in the log files.

When the CPU hits 100%, everything still works, but -extremely- slowly. Slowly enough to cause timeouts in the client browsers (mobile phones).

It certainly doesn’t seem to recover after (up to) an hour. But it has never been left longer than that, for obvious reasons.

Differences between the test and production environment…? Well I’ve got no budget to duplicate production unfortunately, so we’ve mimicked it as best we can, on the hardware we’ve got to hand. Operating System wise, they’re both running Fedora and I’m pretty sure the same version. Exactly the same version of JDK and same version of Jetty, Apache & mod_jk (Thought about running with tomcat for a while, but that caused a few minor issues that need to be resolved first). We’re running the latest versions of everything we can.

In terms of some part of the system that isn’t ‘exercised’ other than when the problem occurs – there isn’t one. Everything is pretty much slammed every day.

Well, what are your application solving?

Some of your algoritms might be superpolynomial (2^n, where n is input size) without you reliasing it. That means that as the data grows, the server will slowly kill itself. That could be what happened.

Dont think it has to be a very advanced algorithm thats superpolynomial . A famous NP Hard problem is SubSetSum, it goes like this.

“using a list of numbers, find out if a subset of it, is equal to k”

So try to debug with the same data as you have in your server now.

Tried that. Tested with live data a number of times. The problem isn’t easily reproducable. Plus we’ve done various optimisations in recent releases which have seen performance improvements without an commensurate increase in CPU usage. Data isn’t the issue, otherwise restarting the server wouldn’t fix the problem – and it does. When I restart, CPU usage goes back to between 1-8% depending upon the load at the time.

This is why I’m drawing a blank.

re: hacking, I’ve encountered numerous situations where the security is flawless, nothing is compromised, but the weird and wonderful side-effects of e.g. a “crack Apache” script being run against a non-HTTP server stress various very very strange protocol situations that weren’t covered by unit tests.

e.g. one server had a race condition where if the connection was dropped at exactly the right time, with a 1-byte full server-side read buffer, that thread would stick in a while loop. Or another time, I had a server which assumed (reasonably enough) that 32kb was enough data for a simple text messaging protocol…until a hacker tried to send a multi-megabyte windows rootkit (to a linux server ;)). The security system worked fine, and nothing would have happened even on windows, but it uncovered a subtle bug in the “buffer overflow” code that our unit tests hadn’t covered.

Anyway, what I would do first of all is grab ethereal and start doing packet sniffs on the incoming data. That should give you enough to do semi-perfect replays of the condition. There’s a lot of things that won’t show up in logfiles (trivial example, not particularly relevant for java: weird packet fragmentation) but will show up in packet sniffs.

Beyond that, if this server is using NIO, email me at ceo @ grexengine.com and I’ll try and help some more.

heh well its hard to know, with the little info.

What about bottlenecks?

Do you have a database running?

of transactions

#of connections

whatever?

Perhaps its somethin on the server. Transfeer application to another server or replace it with a backup.

Hardware problem?

Perhaps you have a piece of code that’s failing, but rather than throwing an exception, or putting something in the log file, it’s simply trying over and over and over, because of the way that code handles errors. Perhaps trying to write something to disk or RAM, and the chip or the drive is starting to fail? Maybe trying to read from an input device (network, hard drive, database)?

Do you have only one server?

If you have multiple servers, is it only one particular server?

here’s a basic rundown of the application:

apache running on one server, connects via mod_jk to another 2 servers each running jetty + our own application server. We have multiple connections (using jakarta dbcp) to a postgres database running on another server. 99% of traffic goes to jetty instance (#1). Server #2 is reserved for infrequent events which have a significantly higher load than our typical daily 1 million+ hits. We control access using rewrite rules, so if you ain’t on our IP list, you ain’t getting in (so to speak). #1 is the one with the problem.

That’s a good point regarding h/w failure. Something I hadn’t considered actually (and should have). We’ll have to replace another server shortly, so we can probably roll this problematic one out and use the new one and see if the problem still occurs. Thx kul_th_las,

J