Thursday, October 9, 2008

SystemTap SIGUSR2 tracing

A story of how people can get all excited about "performance issues". And how simple and easy tools can help figure out the root cause.

I was involved recently in an issue with Java performance where a system had "flipped out" and gone to 100% CPU busy across all of the processor cores after a software upgrade. Numerous technical people got involved in a flurry of activity, and of course performance teams were involved because systems tend to perform poorly when the CPUs are busy 100% of the time.

Various people dug into things and determined that something was generating way-too-many SIGUSR2 signals to Java, which apparently was driving the Java engine into constant garbage collection. Many many many Java threads running. It was a sight to behold.

Naturally, everyone involved claimed that *they* were not responsible for the spurious SIGUSR2 signals. So clearly this was a hardware issue. As hardware engineers were being brought in, some peers in the SystemTap team quietly suggested that we use the SystemTap tool to help figure out who was sending the signals. Ahh, a note of reason.

Turns out there's an off-the-shelf SystemTap script available which snags the Signal.send entry point, and is already instrumented to track who's triggering the signal, and then count up the number of times each source triggers a signal.

The script showed who was triggerig the signals, which was exactly the ah-hah!" clue the technical teams wanted and needed. Shortly there-after a fix was found and applied, and the 100% CPU busy condition was solved. With the CPUs no longer 100% busy for no particular good reason, our performance work was done. If only these were all this easy.

We documented the SystemTap steps over on a wiki page in DeveloperWorks.
My only wish out of this exercise was that the SystemTap community had even more "off-the-shelf" scripts which were certified as "safe to use" on customer real-life systems. I have to admit I get a little leary of popping scripts like this on a customer system - so it's something we carefully tested first on a local system. A quick matrix of scripts which are tested and safe on Distro versions / platform combinations would be very nice to have.

2 comments:

Anonymous said...

Awww, aren't you going to tell us who the culprit was?!

Bill Buros said...

At one point I heard that a simple "fix" was applied to the Java release to prevent the runaway signals and there were several temporary work-arounds that also helped. I'm not sure what the real underlying problem turned out to be once it left our realm of performance considerations - and I've developed this habit of laying low when convenient..

One of many bloggers.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.

The Usual Notice

The postings on this site are my own and don't necessarily represent IBM's positions, strategies, or opinions, try as I might to influence them.