TrainCollector on 1.4.2 running way too often

Not really game related, but it is about soft-real time server, so I thought that it is best forum to ask for possible solutions (best brains here ;))

We have a server which is started every day in the morning and put down during evening. It uses up to 800MB live memory in the evening. Memory usage starts at 100MB and grows constantly during the day (it has a cache of all things which have happened during a day). At the same time, server is creating incredible amount of short term garbage (we are in process of creating a new version of it, but currently we need to support old one and correcting it is a bit too dangerous). By incredible amount I mean 1GB per few seconds in some cases. Server is Solaris Sparc, 8CPU machine (unfortunately shared with some other processes and database, but our one is the biggest).
Normal GC was performing quite well - finding almost all garbage in every run, taking tens of ms per run. Even in these hard cases of 1GB of trash per 5s, it was running once per second for 50ms, which was absolutely acceptable. Unfortunately, every half a hour or so, it had run full gc, which stopped the world for 3-10 seconds (depending on time of day). 3s is already very long, but 10s were actually causing a lot of trouble in systems around the server (we are soft real time in range of single seconds - so any request can be processed with delay of 1-4 seconds, but we need to process hundreds of them at once and none can be delayed more than that).

First fix was to increase startup memory to 1.2GB. This reduced amount of full gc runs considerably (to be honest, we are probably able to avoid running it even once on some days). Unfortunately, if it happens (late during a day), it will still take 10s+. To avoid that, we decided to use -Xincgc (it is java 1.4.2 and we are not fully trusting CMS one there). We tested -Xincgc on smaller process, it was working perfectly, so we put it into main server and…


7133.041: [GC 523380K->521678K(1310656K), 0.0370789 secs]
7133.082: [Inc GC 521678K->523001K(1310656K), 0.0235397 secs]
7133.128: [GC 523513K->521715K(1310656K), 0.0303082 secs]
7133.162: [Inc GC 521715K->523013K(1310656K), 0.0201618 secs]
7133.206: [GC 523525K->521576K(1310656K), 0.0278879 secs]
7133.237: [Inc GC 521576K->522875K(1310656K), 0.0228183 secs]
7133.284: [GC 523387K->521589K(1310656K), 0.0317932 secs]
7133.319: [Inc GC 521589K->522887K(1310656K), 0.0189042 secs]
7133.368: [GC 523399K->521383K(1310656K), 0.0288767 secs]
7133.400: [Inc GC 521383K->522879K(1310656K), 0.0204636 secs]
7133.450: [GC 523583K->521797K(1310656K), 0.0320104 secs]
7133.486: [Inc GC 521797K->523123K(1310656K), 0.0210482 secs]
7133.534: [GC 523635K->521801K(1310656K), 0.0280706 secs]
7133.565: [Inc GC 521801K->522899K(1310656K), 0.0200594 secs]
7133.609: [GC 523411K->521640K(1310656K), 0.0280528 secs]
7133.640: [Inc GC 521640K->522923K(1310656K), 0.0188250 secs]
7133.688: [GC 523435K->521740K(1310656K), 0.0305865 secs]
7133.721: [Inc GC 521740K->523011K(1310656K), 0.0188402 secs]
7133.769: [GC 523907K->521919K(1310656K), 0.0294120 secs]
7133.801: [Inc GC 521919K->523205K(1310656K), 0.0199029 secs]
7133.848: [GC 523909K->521969K(1310656K), 0.0321136 secs]
7133.883: [Inc GC 521969K->523080K(1310656K), 0.0217210 secs]
7133.937: [GC 523784K->521962K(1310656K), 0.0329194 secs]
7133.973: [Inc GC 521962K->523242K(1310656K), 0.0207843 secs]
7134.020: [GC 523946K->521936K(1310656K), 0.0292769 secs]
7134.053: [Inc GC 521936K->523082K(1310656K), 0.0197298 secs]
7134.101: [GC 523786K->521770K(1310656K), 0.0292355 secs]
7134.133: [Inc GC 521770K->523056K(1310656K), 0.0199293 secs]
7134.180: [GC 523760K->521928K(1310656K), 0.0287748 secs]
7134.212: [Inc GC 521928K->523233K(1310656K), 0.1615441 secs]
7134.400: [GC 523745K->521713K(1310656K), 0.0267714 secs]
7134.429: [Inc GC 521713K->523037K(1310656K), 0.0211824 secs]
7134.472: [GC 523549K->521740K(1310656K), 0.0287111 secs]
7134.504: [Inc GC 521740K->523056K(1310656K), 0.0225123 secs]
7134.556: [GC 523568K->521765K(1310656K), 0.0300917 secs]
7134.589: [Inc GC 521765K->523065K(1310656K), 0.0203878 secs]
7134.638: [GC 523769K->521763K(1310656K), 0.0292120 secs]
7134.670: [Inc GC 521763K->523046K(1310656K), 0.0203244 secs]
7134.717: [GC 523750K->521779K(1310656K), 0.0263912 secs]
7134.747: [Inc GC 521779K->523103K(1310656K), 0.0175217 secs]
7134.787: [GC 523807K->522035K(1310656K), 0.0277603 secs]
7134.818: [Inc GC 522035K->523318K(1310656K), 0.0184335 secs]
7134.863: [GC 523822K->522042K(1310656K), 0.0325576 secs]
7134.899: [Inc GC 522042K->523171K(1310656K), 0.0235218 secs]
7134.950: [GC 523875K->522091K(1310656K), 0.0289250 secs]
7134.982: [Inc GC 522091K->523416K(1310656K), 0.0211100 secs]
7135.030: [GC 523928K->522149K(1310656K), 0.0289845 secs]
7135.062: [Inc GC 522149K->523253K(1310656K), 0.0224358 secs] 

As you can see, GC seems to run in almost infinite loop. There is still plently of memory to grow, but it is just keeps spinning around. I thought for a moment that there is some kind of magic happening near the half of free memory, but same thing is happening even on startup


8.016: [GC 8244K->6237K(1310656K), 0.0045076 secs]
8.021: [Inc GC 6237K->7602K(1310656K), 0.0051462 secs]
8.041: [GC 8285K->6265K(1310656K), 0.0038853 secs]
8.045: [Inc GC 6265K->7801K(1310656K), 0.0057120 secs]
8.063: [GC 8313K->6285K(1310656K), 0.0034518 secs]
8.067: [Inc GC 6285K->7821K(1310656K), 0.0045019 secs]
8.083: [GC 8333K->6306K(1310656K), 0.0038941 secs]
8.087: [Inc GC 6306K->7842K(1310656K), 0.0053213 secs]
8.105: [GC 8354K->6326K(1310656K), 0.0036972 secs]
8.108: [Inc GC 6326K->7862K(1310656K), 0.0052712 secs]
8.129: [GC 8374K->6347K(1310656K), 0.0035272 secs]
8.132: [Inc GC 6347K->7883K(1310656K), 0.0046338 secs]
8.149: [GC 8395K->6368K(1310656K), 0.0035588 secs]
8.153: [Inc GC 6368K->7904K(1310656K), 0.0045717 secs]
8.169: [GC 8416K->6388K(1310656K), 0.0036467 secs]
8.173: [Inc GC 6388K->7924K(1310656K), 0.0055688 secs]
8.191: [GC 8436K->6409K(1310656K), 0.0054081 secs]
8.196: [Inc GC 6409K->7945K(1310656K), 0.0051368 secs]
8.214: [GC 8457K->6430K(1310656K), 0.0036832 secs]
8.217: [Inc GC 6430K->7795K(1310656K), 0.0044594 secs]
8.238: [GC 8478K->6457K(1310656K), 0.0036165 secs]
8.241: [Inc GC 6457K->7481K(1310656K), 0.0049430 secs]
8.270: [GC 8504K->6498K(1310656K), 0.0038750 secs]
8.274: [Inc GC 6498K->6498K(1310656K), 0.0051421 secs]
8.327: [GC 8546K->6580K(1310656K), 0.0049546 secs]
8.332: [Inc GC 6580K->6580K(1310656K), 0.0055412 secs]
8.383: [GC 8628K->6663K(1310656K), 0.0046790 secs]
8.434: [GC 8711K->6745K(1310656K), 0.0052698 secs]
8.439: [Inc GC 6745K->6745K(1310656K), 0.0054981 secs]
8.502: [GC 8793K->6828K(1310656K), 0.0047852 secs]
8.507: [Inc GC 6828K->7851K(1310656K), 0.0049552 secs]
8.539: [GC 8875K->6869K(1310656K), 0.0060865 secs]
8.546: [Inc GC 6869K->8234K(1310656K), 0.0054154 secs]
8.570: [GC 8917K->6896K(1310656K), 0.0043734 secs]
8.574: [Inc GC 6896K->8432K(1310656K), 0.0052009 secs]
8.593: [GC 8944K->6917K(1310656K), 0.0033186 secs]
8.597: [Inc GC 6917K->8453K(1310656K), 0.0062499 secs]
8.617: [GC 8965K->6937K(1310656K), 0.0038922 secs]
8.621: [Inc GC 6937K->8474K(1310656K), 0.0049791 secs]
8.640: [GC 8986K->6959K(1310656K), 0.0035988 secs]
8.644: [Inc GC 6959K->8495K(1310656K), 0.0058623 secs]
8.665: [GC 9007K->6980K(1310656K), 0.0037146 secs]
8.669: [Inc GC 6980K->8515K(1310656K), 0.0048813 secs]
8.689: [GC 9027K->7000K(1310656K), 0.0035126 secs]
8.692: [Inc GC 7000K->8536K(1310656K), 0.0046235 secs]
8.712: [GC 9048K->7021K(1310656K), 0.0035688 secs]
8.715: [Inc GC 7021K->8557K(1310656K), 0.0046934 secs]
8.735: [GC 9069K->7042K(1310656K), 0.0037387 secs]
8.739: [Inc GC 7042K->8577K(1310656K), 0.0049509 secs]
8.759: [GC 9089K->7062K(1310656K), 0.0040764 secs]
8.763: [Inc GC 7062K->8427K(1310656K), 0.0047845 secs]
8.787: [GC 9110K->7089K(1310656K), 0.0036977 secs]
8.791: [Inc GC 7089K->8113K(1310656K), 0.0050502 secs]
8.824: [GC 9137K->7130K(1310656K), 0.0040338 secs]
8.828: [Inc GC 7130K->7130K(1310656K), 0.0048565 secs]
8.888: [GC 9178K->7214K(1310656K), 0.0050150 secs]
8.893: [Inc GC 7214K->7213K(1310656K), 0.0047898 secs]
8.953: [GC 9261K->7296K(1310656K), 0.0052650 secs]
9.011: [GC 9344K->7379K(1310656K), 0.0055470 secs]
9.017: [Inc GC 7379K->7378K(1310656K), 0.0053565 secs] 

With 8CPU machine, these delays are really killing us. We fell back to normal GC with big start heap and hope that we can manage to avoid full GC during a day - but I’m still curious what is wrong with -Xincgc ?

Just to show you similar run with normal gc


2902.556: [GC 819818K->409387K(1298048K), 0.1768601 secs]
2942.078: [GC 820779K->409940K(1298048K), 0.1774365 secs]
2978.235: [GC 821332K->411218K(1298048K), 0.1793135 secs]
3021.251: [GC 822610K->413365K(1298048K), 0.2061550 secs]
3060.037: [GC 824757K->414599K(1298048K), 0.1884546 secs]
3098.130: [GC 825991K->416048K(1298048K), 0.1852688 secs]
3139.267: [GC 827440K->417545K(1298048K), 0.1656779 secs]
3174.744: [GC 828937K->418040K(1298048K), 0.1843382 secs]
3212.421: [GC 829432K->418667K(1298048K), 0.1700896 secs]
3251.689: [GC 830059K->419690K(1298048K), 0.1686255 secs]
3292.449: [GC 831082K->420551K(1298048K), 0.1926219 secs]
3336.919: [GC 831943K->421289K(1298048K), 0.1883368 secs]
3376.109: [GC 832681K->422074K(1298048K), 0.2008794 secs]
3421.384: [GC 833466K->422708K(1298048K), 0.1797199 secs]
3460.110: [GC 834100K->423588K(1298048K), 0.1664174 secs]

Hey look at this:


7133.041: [GC 523380K->521678K(1310656K), 0.0370789 secs]
7133.082: [Inc GC 521678K->523001K(1310656K), 0.0235397 secs]

The first GC collects 2MB of heap. The incremental GC moments later LOSES 2MB of heap! And it just repeats ad infinitum. Looks like you have found a bug. I can’t imagine any circumstances in which a GC causes the heap to fill up…

Cas :slight_smile:

Well, if incremental gc runs in parallel, it could be explained - it was not able to collect anything (because no garbage was found in old generation) and background threads running already managed to allocate 2MB of garbage in meantime. Only problem is that -Xincgc is not supposed to run in parallel with mutator threads AFAIK… (I’m talking about 1.4.2, which means that -Xincgc == Train Collector)

Exactly. -Xincgc on 1.4.2 is a stop-the-world collector - when it’s collecting, no new() can take place. Or so I thought.

Cas :slight_smile:

I found the CMS collector worked well for me on 1.4.2. I have a video streaming application that benefitted from it. I was pleased with the results and I didn’t do any tuning other than switch to the CMS collector. It might be worth doing some long-term tests with it so you can build up some trust :slight_smile: