When we do performance work in the Linux labs, we usually have an interesting mix of systems, some newer than new (also known as engineering models), some customer brand-new, some fairly ancient. For the systems we use directly, we have a good feel for the general ongoing performance characteristics of that system. We regularly run performance regressions of standard workloads watching for changes in the Linux software stack. Since we do this on a regular basis, any performance blip (sometimes an unexpected software regression, but usually a nice Linux improvement which is working its way through the process) is seen and we can go look at it. The side effect of this is we also can catch the rare occasion when something has happened on the hardware side.
One of the interesting aspects of the POWER line is the continued strong focus and evolution of the RAS characteristics of the systems. "RAS" can often get lost in the marketing collateral of new systems, and is usually ignored by the programmer community, but it really is pretty important to customers. For us, what this means on a practical level is that on rare occasions, the POWER systems will automagically detect possibly failing components and will disable those components if and when necessary, logging the errors.
- For those who are constantly searching for more materials to read, a comprehensive description of the latest POWER RAS characteristics is available at http://www-03.ibm.com/systems/p/hardware/whitepapers/ras.html.
- The Appendix A of the White Paper lists the Linux support provided by the SUSE and Red Hat Linux Distributions. A lot of people in the Linux teams around the world have worked on RAS support and have tested this thoroughly. The breadth of the coverage is pretty impressive and continues to be improved.
- This coverage is key because it nicely highlights the "enterprise readiness" of Linux for customers.
Hardware failure examples are rare. So it's something we usually discover after spending way too much time doing software analysis, when we should have checked the service/error logs first. Eventually we get the right performance data which highlights the problem, and THEN we go check the service logs, which confirms what we determined. After we check the service log, we usually look at each other and state the obvious: "Hey, YOU should've checked the service log".
An example over the last year was with an older Power 5 system which was being used by a software vendor to do software performance testing. Unbeknownst to us, the system had cleverly detected that one of the L3 caches was failing, logged it appropriately, and then disabled the failing L3 cache so that the system could continue operating. The performance aspects were REALLY strange. We were running a multi-process multi-threaded workload that max'ed out the whole system, but the results were not balanced which made it look like a scheduler bug.
It was a long story, but eventually we checked the L3 cache sizes and confirmed that one of the L3 caches was "missing". We should've seen the following.... which shows the two 36MB L3-caches on this system...
cd /proc/device-tree/cpus/
find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \;
0000000 0240 0000
0000004
0000000 0240 0000
0000004
but instead we saw something like the following (this being from memory)... the L3 cache size of one of the caches was zero.
cd /proc/device-tree/cpus/
find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \;
0000000 0240 0000
0000004
0000000 0000 0000
0000004
The system config confirmed the behavior we were seeing. Once we got the piece fixed, the strange results were taken care of.
Anyway, circling back to Mike's servicelog post, that looks like an easy piece of software to try out. I popped it onto one of our SLES 10 sp1 servers in the lab, but it needs a Berkeley database library to link in - the website blithely says copy libdb-4.2.a into the source directory - and I don't have time this week to hunt that part down. I've got the .so file, and there's probably some easy way to bridge the two, but I'll try it out early in the new year. If the serviceable event process works as advertised, this would be nice to pop on our systems, and use it to highlight serviceable events on the system. Even better, what would be really cool is for it to send me a polite email that informs me about the error, but I suspect this needs some wrapper systems management software.
The key message for emerging performance analysts is to make sure the hardware is doing what it's supposed to be doing. A related topic in the future will cover the occasional disconnects between what someone tells you was ordered for the hardware you're testing on, and what is really in the hardware system you're testing on. Same story... you need to be sure you know what you're really working with.