I’ll be speaking on Wednesday night (Sept 30) in the
Sun offices in NY and am looking forward to meeting up a number of people. If you’re around please do join in. After that I'll be off to JAOO Aarhus Denmark where I’ll be offering my
performance seminar on the 4th of Oct (late addition to the schedule). The all day tutorial includes a meety problem to sink your proiler into. Afterwards I'm happy to have people to join in for a beer!
When
blog-city was in it’s infancy
Alan Williamson wasn’t using a lot of hardware to support the system (new stuff was in the mail). In fact, it was running on a single PIII 850MHz PC with 512M of ram and a single disk. Running on this bountiful basket of hardware was BlueDragon (JVM), Alan’s own CFML engine, along with an instance of MySQL. All of to support a user base of between 40,000-50,000 users. While the system was bursting at the seams, it did manage to cope remarkably well. A truly amazing story and a credit to Alan’s talents!
My small part in this story was to lend a hand tuning GC. Alan has just released a new version and there were a few issues with it running out of memory. I happened to make some observations that pretty much confirmed what Alan already suspected, keeping session state in the JVM was detrimental to system stability. It simply had to go! We couldn’t just simply dump session state. However, session state was keeping object alive. It simply had to go but where?
Garbage collectors have been tuned to eliminate short lived objects very efficiently. The difficulties come when applications start holding onto data for a long time. While the collectors won’t fret over small amounts of data being retained, if that volume starts to climb, than it get very difficult to tune GC so that your end users don’t start feeling it’s affect on system performance. The current strategies for managing short and long lived objects are completely orthogonal.
How one tells if your system is retaining a lot of objects is to take a look into the GC logs. You will need to turn on -XX:+PrintTenuringDistribution. The output will look something like this.
- age 1: 2151744736 bytes, 2151744736 total
- age 2: 897330448 bytes, 3049075184 total
- age 3: 1274314280 bytes, 4323389464 total
- age 4: 1351603024 bytes, 5674992488 total
- age 5: 1529394376 bytes, 7204386864 total
- age 6: 1219001160 bytes, 8423388024 total
This is an extreme example that I took out of a recent email discussion on tuning GC. It is extreme in that it’s not often that this amount of data survives over time. In all there is about 8 gig of data in the survivor space with about 1.2G+ at each age in survivor.
To give you and idea of the stress this places on the system, on each young generational GC, about 1.2G of data needs to be copied into a survivor space, 1.2Gs of data needs to be copied into old space, and about 6Gs of data needs to be shuffled about in young space. This can only result in one thing, a very long GC pause.
The solution is obvious, if copying is a problem, don’t copy! We could tune the system just to tenure right away but then we’re still going to end up with a problem where the mass of short lived objects that would be prematurely tenured would chew up old space and we’d be left with the same problem, a long GC pause or worse, an OutOfMemoryError being thrown.
Getting back to Alan, this was the problem he faced and this problem was made worse by the lack of hardware supporting the system. Alan is facing OOME cause by session state data. This is not a memory leak but more something I call object loitering. Session data will go away but in the mean time, it acts like a leak.
The solution to Alan’s problem was to save the session state off in a database. He choose to use MySQL. Since each session had a unique key (jsessionid) he knew there was no possibility of a transactional conflict. Consequently he configured mysql to run without the transactional engine avoiding that expensive operation.
In implementing this “fix”, Alan was in effect selectively tenured long lived objects to an alternate memory space. Since this space was outside of the JVM, he turned long lived objects into short lived objects. Another way to selectively tenure is to put these objects into some sort of distributed cache pulling session objects into the mainline VM only as needed. In either case, moving toward short lived object plays right into the GC strength.
Would G1 help in this case? I really don’t have an answer to this question however I have a guess. The fact that regions with high levels of liveliness will not be collected would seem to break the current equation that the cost of a GC is dominated by the number of objects that survive. That said, this positive contribution would seem to be offset by the fact that it’s still generational which means in this case it would *not* avoid the copying. To avoid copying means that we’d have to accept the premature promotion of long lived objects. While a possible solution might be to provide some facility to hint to the collector that it should always promote this type of object, I’m not sure I want to give developers this option. Maybe a better way to extend hotspot to recognize that these types of objects coming from this section of the code should be promoted or even directly created in old space. But for now, we will need to architect alternate solutions.
tags: jvm mysql garbage collecdtion gc g1 tuning performance blue dragon