Improving performance on Linux (mostly Power)

Various thoughts on the process of improving performance on a Linux system - in a mode of discovering just how much there is to learn. Customers use their systems uniquely - some care passionately about performance, some just want and expect the best "out-of-the-box" experience with no tweaking. I have observed that people in search of performance answers generally want the simple answer, but the practiced answer to any real performance question is: "Well, it depends..." - Bill Buros

Monday, April 7, 2008

What "one thing" do you want in the Linux kernel?

An interesting question came by this week. If I could have one free wish, one thoughtful choice, one wise selection, what would I want in the Linux kernel?

I was of course assured that my choice didn't mean I'd get it, but it was an interesting hallway poll for the day. I suspect the answers were being pulled together as input for this week's Linux Foundation meetings happening this week in Austin Tx... lots of people in town.

The mind races.

So back to the Linux Weather Forecast. Whew. What to choose, what to choose.

http://www.linux-foundation.org/en/Linux_Weather_Forecast written up by Jonathan Corbet of lwn.net fame.

In no particular order,

A new completely fair scheduler - the promise of an updated scheduler is intriguing for performance teams. Course, on a practical level, the thoughts of rather extensive regression testing keeps popping up in my mind. Watching the progress of that effort is reassuring though, so this will be cool in the next revs of the distros.
Kernel markers - cool technology which will make it easier for tools to hook into the kernel, which of course a performance team is always interested in. We need to make it absolutely seamless and safe for a customer, on a production system, as a protected but definitely non-root user, to gather system metric information.
Memory use profiling- now this will be really nice - way too many times we're asked about memory usage of an application - which is particularly dependent on other things happening in the operating system.
Real-time enhancements - continued integration of the real-time work happening in parallel in the community. This is proving particularly helpful in the HPC cluster space as work continues to improve the deterministic behavior of the operating system.
Memory fragmentation avoidance - another longer-term project which positions the kernel for far more aggressive memory management and control of varying page sizes.
And numerous other choices... better filesystems in ext4 and btrfs, better virtualization and container support, better energy management, improvements in glibc, etc etc

But really? All of these are in play today. All of them are being polished, updated, with many creative minds at work.

So what I asked for was for the kernel and Linux community to continue strengthening the ability to help customers easily understand and improve the performance of their applications and the system. Out of the box. Across all of the components. No special adds, rpms, or kernels on their installed system.

The nice part is the Linux programmers I work with are all committed to this. So the work continues - fit, finish, and polish - and continue working with the longer term changes which are being developed.

Tuesday, March 18, 2008

Reboot? Real customers don't reboot.

To get clear, clean, and repeat'able performance results, performance teams in the labs generally don't think twice about rebooting the system and starting fresh (after all, who knows what someone has done on the system - particularly those kernel programmers).

In my experience though, to balance this automatic tendency, it's important to note that re-booting a system isn't normal behavior for real-life customers.

For example, when I suggest that a customer re-boot his/her system, there's usually a perceptible pause, sometimes even a quiet chuckle. Turns out most customers are quite happy with their systems running Linux, and simply don't consider the process of re-booting as anything normal. I'm not surprised, but it does mean that the tuning options practically available to customers have to be done dynamically and not a kernel boot option.

This aspect continues to be improved in the Linux with work across the operating system. The ability to control energy consumption, system resource usage, adding/removing CPUs, and adding/removing memory are all examples of cool things being worked on in the Linux community.

For the performance team, we're particularly interested in the ability to control things like SMT on and off (something needed on Power systems), the number of CPUs running, and minimizing kernel memory fragmentation.

Some pieces are there, some are emerging, and some are being invested in. I'll hunt some examples down and post them here in the coming days.

Tuesday, February 12, 2008

RHEL5.1 exploits POWER6

Whew. The months tick by quickly these days. Back in October 2007, Red Hat released a service update which officially supported and exploited the IBM POWER6 systems. The Red Hat exploitation code had been worked on by LTC programmers from around the world - providing a Linux OS version that handily and easily provides very nice performance for enterprise customers in production mode.

The performance teams submitted a series of SPEC.org (see the Appendix in the white paper link below for the list) and single-system Linpack publishes at the time, which have since been reviewed, approved, and published. One of the things we really like to focus on is the "whole software stack" when performance is looked at. So a short white paper was written up to explain the various software pieces and how they were tuned. To me this is far more interesting than the bottom-line metrics and the various sparring leadership claims that bounce around the industry. 'Course, it's always fun to have those leadership claims while they last.

The paper - an interesting mix of marketing introductions and then some technical details - really is intended to focus on the basics of leveraging the platform specific compilers (IBM's XL C/C++ and Fortran compilers), platform specific tuned math libraries (ESSL), a Linux open source project called libhugetlbfs, MicroQuill's clever SmartHeap product, FDPR-pro - which I've posted on earlier, IBM's Java, all running examples on the IBM System p 570 system.

Following these activities and working with various customers and interested programmers, there's some interesting work going on with using the oprofile analysis tools on Linux - especially as a non-root user - and working on understanding edge cases where the performance of a workload isn't purely predictable and deterministic. What's interesting is the possible performance variability of a workload when memory being used by a program isn't regularly placed local to each executing core (a classic NUMA-type issue), especially in the case of varying page sizes supported by the Linux operating systems.

I'll post some examples of using oprofile in different ways this month, and also an example of a Fortran program which can show varying run-times and how to address that.

Friday, February 1, 2008

Winning with Linux on new Power systems

IBM just announced two new POWER6 systems, the IBM System p 520 and IBM System p 550. In essence, p 520 has up to 4 cores of a 4.2GHz POWER6 processor and up to 64GB memory, while the p 550 has up to 8 cores and up to 256GB memory. The two systems are sweet little systems, and I recommend checking them out.

Elisabeth Stahl has a good summary blog post on the leadership publishes submitted earlier this week for these new systems. I point to her blog since she nicely has all of the requisite disclaimers and pointers to the data points submitted to SPEC.org and with SAP this week. It takes a couple of weeks for the various review processes to complete, so it'll be easier to comment on these once they've been reviewed, approved, and published. I have found that Linux programmers like to see the published files and walk through the specific details. Show me.

The cool part is Linux on Power continues to be a parity player in the leadership performance metrics for POWER6 customers (in this case using examples for Linpack, SPEC CPU2006, SPECjbb2005, and SAP). There are summaries of some of the submitted bottom-line numbers for AIX and Linux for the p 520 and the p 550 on IBM's website.

There's a short paper which should be out soon that discusses the simple steps and software products that can be used on Linux on Power to achieve the best performance for POWER6 using Linux. It's based on similar publishes done last October and uses actual results that are published on the SPEC.org website.

Monday, January 21, 2008

Workloads: Standards? Customer? Proxy?

So what's the "Best Workload" to measure performance with? Industry standard workloads? Real-life customer applications? Proxy workloads developed to mimic customer applications? Micro-benchmarks? Marketing inspired metrics?

Well, it depends. Each has a purpose. Each has its drawbacks.

Let's review.

Industry standard workloads, typified by the wide breadth of SPEC.org workloads, are benchmarks painstakingly agreed-to by industry committees. The workloads are specifically intended for cross-industry comparisons of systems and software stacks, which can make the committees an "interesting" position to be on. I'm not sure I would have the patience and tact needed to function effectively in that realm.

In the Linux community, I have occasionally found some resistance to the world of industry standard workloads, with quick examples of things like SPECcpu2006, SPECjbb2005, SPECweb2005, and many others. Trying not to over-generalize, but sometimes the community views unique performance improvements targeted at these workloads as "benchmark specials". Ok, sometimes I have seen some members of the community get more than a little excited about a change being proposed which were presented as "we gotta have this for xxx benchmark".

I find this attitude frustrating at times, since the workloads usually were developed with the specific intent of addressing how customers use their systems. But note to self, I've discovered that when I can find a real-life customer example of the same workload, the proposed change (fix or improvement) is enthusiastically embraced. With that small change in how a problem is proposed, I've been very impressed with how well the broader Linux community wants to improve performance for real customers.

(by the way, quick aside, some industry standard benchmarks do not pan out because really clever "benchmark specials" DO emerge, and the comparisons then can lose any customer relevant importance)

Real-life customer applications. So why not test with "real" applications more often? While some customers do have specialized example applications that they can share with vendors, usually the customer application represents the "jewels" of their business, so it's not at all surprising that they have no interest in sharing. So some time and energy is spent working to understand the workload characteristics, which can then be leveraged in identifying proxy workloads. Most performance teams have a good set of sample workloads which can serve as examples to look at improvements and bottlenecks.

On occasion, a team can run into a customer application which is unique and not nicely covered by their standard suite. Valerie Henson ran into this a while back, and created a nice little program called eBizzy, which eventually ended up on SourceForge with a third release just made available in January 2008. Numerous performance problems have been uncovered with programs like this, with fixes flowing back into the mainline development trees for Linux and the software stacks. There are many other examples floating around the web.

Proxy Workloads. Creating workloads to mimic customer applications is what I loosely call proxy workloads. These are handy when they are available as source code and unencumbered with various specialized licenses. The workloads can be modified to address what the performance team is focusing on, and more importantly, they're available for the programmers in the community who are trying to reproduce the problem and prototype the fix. Using proxy workloads is a good example of where the combination of customer, performance teams, and programmers can effectively address performance issues with application programs..

Micro-benchmarks. Some people use very small, simplistic, programs for performance assessments. These targeted programs are often packaged together and are fairly quick and easy to execute and get simplistic answers from. These are practically more useful for digging into specific areas of a system implementation, but when used to generalize an overall system's performance, can cause headaches and mis-understandings.

I have seen several over-zealous system admins at some companies take a simple single-thread single-process integer micro-benchmark, run it on his new 1U x86 server with 1GB memory, and then run it on a new very large system (anyone's, this is not vendor exclusive), and make the obvious conclusion that new little 1U server is the new preferred server system for massively complex workloads planned for the datacenter. Somehow they completely miss the point that only one thread is running on the larger system with many cores and lots of memory, thus 98% idle. It'd be funny if these people were just sitting in labs, but it's amazing how an authoritatively written email inside a company can create so much extra work in refuting and explaining why the person who wrote it, perhaps, just perhaps, was an idiot.

That said, micro-benchmarks do play a critical role with performance teams, simply because they do in fact highlight particular aspects of a system, which allows some very focused testing and improvement activity. Improvements have to be made in a bigger context though, since you may tweak one part of the system, only to hurt another part.

Marketing inspired metrics. Finally, I'll close with a quick discussion on what I term the "marketing inspired metrics". While some companies in the marketplace are really good at this, I have to chuckle and admit I'm not convinced this is one of IBM's core strengths, or for that matter, Linux.

The ability to make up new performance metrics, highlighting vague workloads in the absence of any concrete comparable system criteria, and formulate an entire marketing campaign around new "performance terminology" is an art form. I shudder when marketing teams get that gleam in their eye and start talking about performance. Uh-oh.

It's so much safer when we tell them "This is Leadership", or they simply task us with "Make it Leadership". Leadership of course is always how you phrase it, and what you choose to compare to, but at least that's usually a facts-based discussion.

Focus on repeat'able data. So performance teams focus on real data, which is supported by analysis and assessments of the data, watching for and correcting the common mistakes in gathering data from a system. In Linux performance teams I've been involved with, there is a fanatical dedication to automating the process of gathering and scrubbing the data, building in safe-guards to catch the common mistakes, so that when work is done on a workload, be it industry standard, a customer application, a proxy workload, or a micro-benchmark, we're focusing on finding and fixing the problems, not on the steps of gathering the data.

And I always try to avoid those marketing inspired metrics, that's for the marketing teams.

Thursday, January 17, 2008

Performance Plus

“Higher system performance—the traditional hallmark of UNIX evolution—is still critical, but no longer sufficient. New systems must deliver higher levels of performance plus availability plus efficiency. We call this Performance plus.” —Ross A. Mauri, general manager, IBM System p

While more AIX-centric than I would've hoped for from an article for System p, I did see an interesting article from an IBM Systems Magazine which touches on some of the challenges emerging for performance teams in today's marketplace. The article focuses on application availability and energy efficiencies as appropriate parallel focus items in addition to performance metrics and benchmarks. No surprise there, but it is interesting to see IBM's senior executives emerging with a new term - "Performance Plus" - which generally means we'll be living that as a mantra within a month or two.

The challenge comes in emerging with new metrics to numerically quantify the balancing act of system/application availability, energy usage (across cooling, power draw, and peak energy demands), increasingly virtualized servers, and the classic how fast does my application run?

If we could figure out how to make "metrics" a series of coding requests in open-source projects, we could get things going across various mailing lists and get more people cranking out code, ideas, and brand-new metrics. In the meantime, I guess we'll start getting more creative with our Linux brethren world-wide who are working on exactly all of these issues in real-life scenarios. The balancing act is already in progress, the metrics will need to emerge over time.

Friday, January 11, 2008

Executable magic - FDPR-pro

One of the products we use quite frequently is the FDPR-pro product available from IBM . It's one of those "magic" products which allow you to easily optimize a program after it has been compiled and linked.

The program, Feedback Directed Program Restructuring, is available for Linux on Power programs from IBM's alphaWorks web site. Last week the Haifa development team updated the download'able version with a new version - Version 5.4.0.17 - which is what caught my eye.

The official name is Post-Link Optimization for Linux on POWER, which makes for an acronym that has never really caught on - PLOLOP - and various other strange permutations. I seem to recall that the FDPR-pro name had a name conflict somewhere in the world, but that's the name I've been familiar with for years.

The post-link optimization has been used with database engine products, big ISV products, CPU intensive benchmarks and customer workloads. Borrowing shamelessly from the alphaWorks web site, this product is a performance tuning utility which can improve the execution time of user-level application programs. It does this with a multi-step process of running the program in a commonly used mode, where FDPR gathers run-time profiles and uses those profiles to modify/optimize the executable image.

The clever part is that FDPR can be made aware of cache line sizes and specific system characteristics and optimize (if desired) for execution on a particular set of systems. The daunting aspect of FDPR is the number of possible optimizations. As happens with compilers, there are lots of optimizations and areas to focus on.

Using FDPR can become a natural part of the build process of an application. Build, link, then run the extra steps needed by FDPR. FDPR is a supported product and re-structured executables can be supported in normal business commercial environments. Sometimes the performance gains are minimal, and in many cases we'll see performance gains in the 10% to 12% range. More is possible, but it really depends on the application. In rare cases the changes will cause an executable to core-dump, but we've found the team in Haifa extremely helpful and they usually are able to fix problems quickly.

A product to help with the daunting number of FDPR options is another alphaWorks product called ESTO - Expert System for Tuning Optimizations. ESTO automates the process of identifying options and tuning with FDPR for Linux on Power. As explained on the ESTO project page, the approach leverages "iterative compilation" and uses an "adaptive genetic algorithm with exponential population reduction". I have a feeling that the team that built the product description had fun listing the technologies and the approaches used.

In the near future, we're going to get some examples posted up here and on some LoP wikis.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.