Improving performance on Linux (mostly Power)

Various thoughts on the process of improving performance on a Linux system - in a mode of discovering just how much there is to learn. Customers use their systems uniquely - some care passionately about performance, some just want and expect the best "out-of-the-box" experience with no tweaking. I have observed that people in search of performance answers generally want the simple answer, but the practiced answer to any real performance question is: "Well, it depends..." - Bill Buros

Tuesday, February 12, 2008

RHEL5.1 exploits POWER6

Whew. The months tick by quickly these days. Back in October 2007, Red Hat released a service update which officially supported and exploited the IBM POWER6 systems. The Red Hat exploitation code had been worked on by LTC programmers from around the world - providing a Linux OS version that handily and easily provides very nice performance for enterprise customers in production mode.

The performance teams submitted a series of SPEC.org (see the Appendix in the white paper link below for the list) and single-system Linpack publishes at the time, which have since been reviewed, approved, and published. One of the things we really like to focus on is the "whole software stack" when performance is looked at. So a short white paper was written up to explain the various software pieces and how they were tuned. To me this is far more interesting than the bottom-line metrics and the various sparring leadership claims that bounce around the industry. 'Course, it's always fun to have those leadership claims while they last.

The paper - an interesting mix of marketing introductions and then some technical details - really is intended to focus on the basics of leveraging the platform specific compilers (IBM's XL C/C++ and Fortran compilers), platform specific tuned math libraries (ESSL), a Linux open source project called libhugetlbfs, MicroQuill's clever SmartHeap product, FDPR-pro - which I've posted on earlier, IBM's Java, all running examples on the IBM System p 570 system.

Following these activities and working with various customers and interested programmers, there's some interesting work going on with using the oprofile analysis tools on Linux - especially as a non-root user - and working on understanding edge cases where the performance of a workload isn't purely predictable and deterministic. What's interesting is the possible performance variability of a workload when memory being used by a program isn't regularly placed local to each executing core (a classic NUMA-type issue), especially in the case of varying page sizes supported by the Linux operating systems.

I'll post some examples of using oprofile in different ways this month, and also an example of a Fortran program which can show varying run-times and how to address that.

Friday, February 1, 2008

Winning with Linux on new Power systems

IBM just announced two new POWER6 systems, the IBM System p 520 and IBM System p 550. In essence, p 520 has up to 4 cores of a 4.2GHz POWER6 processor and up to 64GB memory, while the p 550 has up to 8 cores and up to 256GB memory. The two systems are sweet little systems, and I recommend checking them out.

Elisabeth Stahl has a good summary blog post on the leadership publishes submitted earlier this week for these new systems. I point to her blog since she nicely has all of the requisite disclaimers and pointers to the data points submitted to SPEC.org and with SAP this week. It takes a couple of weeks for the various review processes to complete, so it'll be easier to comment on these once they've been reviewed, approved, and published. I have found that Linux programmers like to see the published files and walk through the specific details. Show me.

The cool part is Linux on Power continues to be a parity player in the leadership performance metrics for POWER6 customers (in this case using examples for Linpack, SPEC CPU2006, SPECjbb2005, and SAP). There are summaries of some of the submitted bottom-line numbers for AIX and Linux for the p 520 and the p 550 on IBM's website.

There's a short paper which should be out soon that discusses the simple steps and software products that can be used on Linux on Power to achieve the best performance for POWER6 using Linux. It's based on similar publishes done last October and uses actual results that are published on the SPEC.org website.

Monday, January 21, 2008

Workloads: Standards? Customer? Proxy?

So what's the "Best Workload" to measure performance with? Industry standard workloads? Real-life customer applications? Proxy workloads developed to mimic customer applications? Micro-benchmarks? Marketing inspired metrics?

Well, it depends. Each has a purpose. Each has its drawbacks.

Let's review.

Industry standard workloads, typified by the wide breadth of SPEC.org workloads, are benchmarks painstakingly agreed-to by industry committees. The workloads are specifically intended for cross-industry comparisons of systems and software stacks, which can make the committees an "interesting" position to be on. I'm not sure I would have the patience and tact needed to function effectively in that realm.

In the Linux community, I have occasionally found some resistance to the world of industry standard workloads, with quick examples of things like SPECcpu2006, SPECjbb2005, SPECweb2005, and many others. Trying not to over-generalize, but sometimes the community views unique performance improvements targeted at these workloads as "benchmark specials". Ok, sometimes I have seen some members of the community get more than a little excited about a change being proposed which were presented as "we gotta have this for xxx benchmark".

I find this attitude frustrating at times, since the workloads usually were developed with the specific intent of addressing how customers use their systems. But note to self, I've discovered that when I can find a real-life customer example of the same workload, the proposed change (fix or improvement) is enthusiastically embraced. With that small change in how a problem is proposed, I've been very impressed with how well the broader Linux community wants to improve performance for real customers.

(by the way, quick aside, some industry standard benchmarks do not pan out because really clever "benchmark specials" DO emerge, and the comparisons then can lose any customer relevant importance)

Real-life customer applications. So why not test with "real" applications more often? While some customers do have specialized example applications that they can share with vendors, usually the customer application represents the "jewels" of their business, so it's not at all surprising that they have no interest in sharing. So some time and energy is spent working to understand the workload characteristics, which can then be leveraged in identifying proxy workloads. Most performance teams have a good set of sample workloads which can serve as examples to look at improvements and bottlenecks.

On occasion, a team can run into a customer application which is unique and not nicely covered by their standard suite. Valerie Henson ran into this a while back, and created a nice little program called eBizzy, which eventually ended up on SourceForge with a third release just made available in January 2008. Numerous performance problems have been uncovered with programs like this, with fixes flowing back into the mainline development trees for Linux and the software stacks. There are many other examples floating around the web.

Proxy Workloads. Creating workloads to mimic customer applications is what I loosely call proxy workloads. These are handy when they are available as source code and unencumbered with various specialized licenses. The workloads can be modified to address what the performance team is focusing on, and more importantly, they're available for the programmers in the community who are trying to reproduce the problem and prototype the fix. Using proxy workloads is a good example of where the combination of customer, performance teams, and programmers can effectively address performance issues with application programs..

Micro-benchmarks. Some people use very small, simplistic, programs for performance assessments. These targeted programs are often packaged together and are fairly quick and easy to execute and get simplistic answers from. These are practically more useful for digging into specific areas of a system implementation, but when used to generalize an overall system's performance, can cause headaches and mis-understandings.

I have seen several over-zealous system admins at some companies take a simple single-thread single-process integer micro-benchmark, run it on his new 1U x86 server with 1GB memory, and then run it on a new very large system (anyone's, this is not vendor exclusive), and make the obvious conclusion that new little 1U server is the new preferred server system for massively complex workloads planned for the datacenter. Somehow they completely miss the point that only one thread is running on the larger system with many cores and lots of memory, thus 98% idle. It'd be funny if these people were just sitting in labs, but it's amazing how an authoritatively written email inside a company can create so much extra work in refuting and explaining why the person who wrote it, perhaps, just perhaps, was an idiot.

That said, micro-benchmarks do play a critical role with performance teams, simply because they do in fact highlight particular aspects of a system, which allows some very focused testing and improvement activity. Improvements have to be made in a bigger context though, since you may tweak one part of the system, only to hurt another part.

Marketing inspired metrics. Finally, I'll close with a quick discussion on what I term the "marketing inspired metrics". While some companies in the marketplace are really good at this, I have to chuckle and admit I'm not convinced this is one of IBM's core strengths, or for that matter, Linux.

The ability to make up new performance metrics, highlighting vague workloads in the absence of any concrete comparable system criteria, and formulate an entire marketing campaign around new "performance terminology" is an art form. I shudder when marketing teams get that gleam in their eye and start talking about performance. Uh-oh.

It's so much safer when we tell them "This is Leadership", or they simply task us with "Make it Leadership". Leadership of course is always how you phrase it, and what you choose to compare to, but at least that's usually a facts-based discussion.

Focus on repeat'able data. So performance teams focus on real data, which is supported by analysis and assessments of the data, watching for and correcting the common mistakes in gathering data from a system. In Linux performance teams I've been involved with, there is a fanatical dedication to automating the process of gathering and scrubbing the data, building in safe-guards to catch the common mistakes, so that when work is done on a workload, be it industry standard, a customer application, a proxy workload, or a micro-benchmark, we're focusing on finding and fixing the problems, not on the steps of gathering the data.

And I always try to avoid those marketing inspired metrics, that's for the marketing teams.

Thursday, January 17, 2008

Performance Plus

“Higher system performance—the traditional hallmark of UNIX evolution—is still critical, but no longer sufficient. New systems must deliver higher levels of performance plus availability plus efficiency. We call this Performance plus.” —Ross A. Mauri, general manager, IBM System p

While more AIX-centric than I would've hoped for from an article for System p, I did see an interesting article from an IBM Systems Magazine which touches on some of the challenges emerging for performance teams in today's marketplace. The article focuses on application availability and energy efficiencies as appropriate parallel focus items in addition to performance metrics and benchmarks. No surprise there, but it is interesting to see IBM's senior executives emerging with a new term - "Performance Plus" - which generally means we'll be living that as a mantra within a month or two.

The challenge comes in emerging with new metrics to numerically quantify the balancing act of system/application availability, energy usage (across cooling, power draw, and peak energy demands), increasingly virtualized servers, and the classic how fast does my application run?

If we could figure out how to make "metrics" a series of coding requests in open-source projects, we could get things going across various mailing lists and get more people cranking out code, ideas, and brand-new metrics. In the meantime, I guess we'll start getting more creative with our Linux brethren world-wide who are working on exactly all of these issues in real-life scenarios. The balancing act is already in progress, the metrics will need to emerge over time.

Friday, January 11, 2008

Executable magic - FDPR-pro

One of the products we use quite frequently is the FDPR-pro product available from IBM . It's one of those "magic" products which allow you to easily optimize a program after it has been compiled and linked.

The program, Feedback Directed Program Restructuring, is available for Linux on Power programs from IBM's alphaWorks web site. Last week the Haifa development team updated the download'able version with a new version - Version 5.4.0.17 - which is what caught my eye.

The official name is Post-Link Optimization for Linux on POWER, which makes for an acronym that has never really caught on - PLOLOP - and various other strange permutations. I seem to recall that the FDPR-pro name had a name conflict somewhere in the world, but that's the name I've been familiar with for years.

The post-link optimization has been used with database engine products, big ISV products, CPU intensive benchmarks and customer workloads. Borrowing shamelessly from the alphaWorks web site, this product is a performance tuning utility which can improve the execution time of user-level application programs. It does this with a multi-step process of running the program in a commonly used mode, where FDPR gathers run-time profiles and uses those profiles to modify/optimize the executable image.

The clever part is that FDPR can be made aware of cache line sizes and specific system characteristics and optimize (if desired) for execution on a particular set of systems. The daunting aspect of FDPR is the number of possible optimizations. As happens with compilers, there are lots of optimizations and areas to focus on.

Using FDPR can become a natural part of the build process of an application. Build, link, then run the extra steps needed by FDPR. FDPR is a supported product and re-structured executables can be supported in normal business commercial environments. Sometimes the performance gains are minimal, and in many cases we'll see performance gains in the 10% to 12% range. More is possible, but it really depends on the application. In rare cases the changes will cause an executable to core-dump, but we've found the team in Haifa extremely helpful and they usually are able to fix problems quickly.

A product to help with the daunting number of FDPR options is another alphaWorks product called ESTO - Expert System for Tuning Optimizations. ESTO automates the process of identifying options and tuning with FDPR for Linux on Power. As explained on the ESTO project page, the approach leverages "iterative compilation" and uses an "adaptive genetic algorithm with exponential population reduction". I have a feeling that the team that built the product description had fun listing the technologies and the approaches used.

In the near future, we're going to get some examples posted up here and on some LoP wikis.

Wednesday, December 12, 2007

servicelog and hardware performance

Mike Strosaker put up a interesting blog over the weekend covering Servicelog which I caught on Planet-LTC. As I was reading it, it reminded me of just how critically important easy service'ability is for performance work.

When we do performance work in the Linux labs, we usually have an interesting mix of systems, some newer than new (also known as engineering models), some customer brand-new, some fairly ancient. For the systems we use directly, we have a good feel for the general ongoing performance characteristics of that system. We regularly run performance regressions of standard workloads watching for changes in the Linux software stack. Since we do this on a regular basis, any performance blip (sometimes an unexpected software regression, but usually a nice Linux improvement which is working its way through the process) is seen and we can go look at it. The side effect of this is we also can catch the rare occasion when something has happened on the hardware side.

One of the interesting aspects of the POWER line is the continued strong focus and evolution of the RAS characteristics of the systems. "RAS" can often get lost in the marketing collateral of new systems, and is usually ignored by the programmer community, but it really is pretty important to customers. For us, what this means on a practical level is that on rare occasions, the POWER systems will automagically detect possibly failing components and will disable those components if and when necessary, logging the errors.

For those who are constantly searching for more materials to read, a comprehensive description of the latest POWER RAS characteristics is available at http://www-03.ibm.com/systems/p/hardware/whitepapers/ras.html.
The Appendix A of the White Paper lists the Linux support provided by the SUSE and Red Hat Linux Distributions. A lot of people in the Linux teams around the world have worked on RAS support and have tested this thoroughly. The breadth of the coverage is pretty impressive and continues to be improved.
This coverage is key because it nicely highlights the "enterprise readiness" of Linux for customers.

So why is this important to performance?

Hardware failure examples are rare. So it's something we usually discover after spending way too much time doing software analysis, when we should have checked the service/error logs first. Eventually we get the right performance data which highlights the problem, and THEN we go check the service logs, which confirms what we determined. After we check the service log, we usually look at each other and state the obvious: "Hey, YOU should've checked the service log".

An example over the last year was with an older Power 5 system which was being used by a software vendor to do software performance testing. Unbeknownst to us, the system had cleverly detected that one of the L3 caches was failing, logged it appropriately, and then disabled the failing L3 cache so that the system could continue operating. The performance aspects were REALLY strange. We were running a multi-process multi-threaded workload that max'ed out the whole system, but the results were not balanced which made it look like a scheduler bug.

It was a long story, but eventually we checked the L3 cache sizes and confirmed that one of the L3 caches was "missing". We should've seen the following.... which shows the two 36MB L3-caches on this system...

cd /proc/device-tree/cpus/ find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \; 0000000 0240 0000 0000004 0000000 0240 0000 0000004

but instead we saw something like the following (this being from memory)... the L3 cache size of one of the caches was zero.

cd /proc/device-tree/cpus/ find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \; 0000000 0240 0000 0000004 0000000 0000 0000 0000004

The system config confirmed the behavior we were seeing. Once we got the piece fixed, the strange results were taken care of.

Anyway, circling back to Mike's servicelog post, that looks like an easy piece of software to try out. I popped it onto one of our SLES 10 sp1 servers in the lab, but it needs a Berkeley database library to link in - the website blithely says copy libdb-4.2.a into the source directory - and I don't have time this week to hunt that part down. I've got the .so file, and there's probably some easy way to bridge the two, but I'll try it out early in the new year. If the serviceable event process works as advertised, this would be nice to pop on our systems, and use it to highlight serviceable events on the system. Even better, what would be really cool is for it to send me a polite email that informs me about the error, but I suspect this needs some wrapper systems management software.

The key message for emerging performance analysts is to make sure the hardware is doing what it's supposed to be doing. A related topic in the future will cover the occasional disconnects between what someone tells you was ordered for the hardware you're testing on, and what is really in the hardware system you're testing on. Same story... you need to be sure you know what you're really working with.

Monday, November 26, 2007

Green 500 ?!

Interesting. Flippingly remarked in my last post that I wondered what the energy/thermal ratings were for the Top 500 clustered systems, and then this morning stumbled on the Green 500 list at http://green500.org.

My first thought was "how in the world do you measure power consumption across so many clustered systems??" And it appears that they calculate the overall power consumption depending on what can be measured, then the overall metric is calculated. The web site even nicely provides a tutorial paper on how to calculate / determine your power consumption.

http://green500.org/docs/tutorials/tutorial.pdf

More stuff to read and try this week. Time to pull the power meter over to a small Linpack test system running the latest RHEL 5.1 release where we recently pushed a number of publishes out and see what's happening. We regularly play with Linpack on the single SMP servers to make sure they scale fairly linearly (ie: 4-core to 8-core to 16-core) and this may be a good way to apply some power consumption metrics to the performance metrics. Linpack is good because it's fairly steady-state for a relatively long period (depending of course on how big you define the problem size to be solved).

Naturally... what, when, and how to measure watts and thermal and energy consumption for systems under test are going through many debates and discussions these days in the industry these days. A whole new dimension of being able to say "well, it depends" when asked about performance and the trade-offs. If you get a chance, watch what happens with SPEC.org's SPECpower initial benchmark (at http://www.spec.org/specpower/ ). This initial benchmark is focused on CPU centric workloads, but more dimensions are undoubtedly coming.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.