Improving performance on Linux (mostly Power)

Various thoughts on the process of improving performance on a Linux system - in a mode of discovering just how much there is to learn. Customers use their systems uniquely - some care passionately about performance, some just want and expect the best "out-of-the-box" experience with no tweaking. I have observed that people in search of performance answers generally want the simple answer, but the practiced answer to any real performance question is: "Well, it depends..." - Bill Buros

Wednesday, October 22, 2008

Hey! Who's stealing my CPU cycles?!

I hear this every now and then on the Power systems from customers, programmers, and even peers. In the more recent distro versions, there's a new "st" column in the CPU metrics which tracks the usage of "stolen" CPU cycles, from the perspective of the CPU being measured. This "steal" column has been around for a while, but the most recent service packs of RHEL 5.2 and SLES 10 sp2 have the latest fixes which display the intended values - so the values are getting noticed more.

I believe this "cpu cycle stealing" all came into being when things like Xen were being developed and the programmers wanted a way to account for the CPU cycles which were allocated to another partition. I suspect the programmers were looking at it from the perspective of "my partition", where something devious and nefarious was daring to steal my CPU cycles. Thus the term "stolen CPU cycles". Just guessing though.

This "steal" term is a tad unfortunate. It's been suggested that a more gentle term of "sharing" would be preferred for customers. But digging around the source code I found the term "steal" is fairly pervasive. And what's in the code, tends to end up in the man pages. Ah well.

With Power hardware, there's a mode where the two hardware threads are juggled by the Linux scheduler. This is implemented via cpu pairs (for example, cpu0 and cpu1) which represent the schedule'able individual hardware threads running on the single processor core. This is the SMT mode (simultaneous multi-threaded) on Power.

The term "hardware thread" is with respect to the processor core. Each processor core can have two active hardware threads. Software threads and software processes are scheduled on the processor cores by the operating system via the schedule'able CPUs which correspond to the two hardware threads.

In the SMT realm, each SMT hardware thread can be considered a sibling (in the context of brother or sister) of each other, running on a processor core. So if the two hardware threads are flat-out-busy with work from the operating system and evenly balanced, then each of the corresponding CPUs being scheduled are generally getting 50% of the processor core's cycles.

From a performance perspective, this has tremendous advantages because the processor core can flip between the hardware threads as soon as one thread hits a short-wait for things like memory accesses. Essentially the processor core can fetch the instructions and memory accesses simultaneously for the two hardware threads which improves the efficiency of the core.

In days of old, each CPU's metrics were generally based on the premise that a CPU could get to 100% user busy. Now, the new steal column can account for the processor cycles being shared by the two SMT sibling threads, not to mention additional CPU cycles being shared with other partitions. It's still possible for an individual CPU to go to 100% user busy, while the SMT sibling thread is idle.

For example, in the vmstat output below, the rightmost CPU column is the steal column. On an idle system, this value isn't very meaningful.

# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
0  0      0 14578432 408768 943616    0    0     0     0    2    5  0  0 100  0  0
0  0      0 14578368 408768 943616    0    0     0     0   25   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0    32   12   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0     0   21   45  0  0 100  0  0

In the next example, pushing do-nothing work on every CPU... (in this case a four-core system, SMT was on, so 8 CPUs were available...), we'll see the vmstat "st" column quickly get to the point where the CPU cycles on average are 50% user and 50% steal.

Try using "top", then press the "1" key to see what's happening on a per-CPU basis easier..

while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   11   34 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574656 408704 943488    0    0     0     0   10   34 50  0  0  0 50

For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..

I just wish we could distinguish the SMT sharing of CPU cycles, and the CPU cycles being shared with other partitions.

For more details on the process of sharing the CPU cycles, especially when the CPU cycles are being shared between partitions, check out this page where we dive into more (but not yet all) of the gory details...

Measuring Stolen CPU cycles

Wednesday, October 15, 2008

Linux on Power.. links and portals..

Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.

[edit'ed 10/30/2008 - we made the Quick Links page the LinuxP home page]

Thursday, October 9, 2008

SystemTap SIGUSR2 tracing

A story of how people can get all excited about "performance issues". And how simple and easy tools can help figure out the root cause.

I was involved recently in an issue with Java performance where a system had "flipped out" and gone to 100% CPU busy across all of the processor cores after a software upgrade. Numerous technical people got involved in a flurry of activity, and of course performance teams were involved because systems tend to perform poorly when the CPUs are busy 100% of the time.

Various people dug into things and determined that something was generating way-too-many SIGUSR2 signals to Java, which apparently was driving the Java engine into constant garbage collection. Many many many Java threads running. It was a sight to behold.

Naturally, everyone involved claimed that *they* were not responsible for the spurious SIGUSR2 signals. So clearly this was a hardware issue. As hardware engineers were being brought in, some peers in the SystemTap team quietly suggested that we use the SystemTap tool to help figure out who was sending the signals. Ahh, a note of reason.

Turns out there's an off-the-shelf SystemTap script available which snags the Signal.send entry point, and is already instrumented to track who's triggering the signal, and then count up the number of times each source triggers a signal.

The script showed who was triggerig the signals, which was exactly the ah-hah!" clue the technical teams wanted and needed. Shortly there-after a fix was found and applied, and the 100% CPU busy condition was solved. With the CPUs no longer 100% busy for no particular good reason, our performance work was done. If only these were all this easy.

We documented the SystemTap steps over on a wiki page in DeveloperWorks.

http://www.ibm.com/developerworks/wikis/display/LinuxP/SystemTap+SIGUSR2+tracing

My only wish out of this exercise was that the SystemTap community had even more "off-the-shelf" scripts which were certified as "safe to use" on customer real-life systems. I have to admit I get a little leary of popping scripts like this on a customer system - so it's something we carefully tested first on a local system. A quick matrix of scripts which are tested and safe on Distro versions / platform combinations would be very nice to have.

Thursday, July 17, 2008

RHEL 5.2 and HPC performance hints

Building on the SLES 10 sp2 kernel build post from a couple of weeks ago, we got the equivalent RHEL 5.2 page posted under the developerWorks umbrella. Mostly the same conceptual steps, but a little different in the specifics. And of course, in the RHEL 5.2 example, we reverse the example from SLES 10 by building a 4KB kernel where "normally" the RHEL 5.2 kernel is based on the 64KB pages. It's a good experiment to play with when you want to see the performance gains that emerge from leveraging larger page sizes.

Re-building a RHEL 5 kernel for Power

We linked this in under the HPC Central wiki page where several of us are playing around with adding descriptive how-to's for HPC workloads based on practical experience.

See HPC Central, follow the link to the Red Hat Enterprise Linux page, which is where the kernel page is linked in. We plan to replicate these pieces for SUSE Linux Enterprise Server next month.

Tuesday, July 15, 2008

The building blocks of HPC

Top 500 again. Linpack HPL. Hitting half a teraflop on a single system.

Using RHEL 5.2 on a single IBM Power 575 system, Linux was able to hit half a teraflop with Linpack. These water-cooled systems are pretty nice. Thirty-two POWER6 cores packed into a fairly dense 2U rack form factor. These systems are designed for clusters, so 14 nodes (14 single systems) can be loaded into a single rack. Water piping winds its way into each system and over the cores (we of course had to pop one open to see how things looked and worked). The systems can be loaded with 128GB or 256GB of memory. A colleague provided a nice summary of the Linpack result over on IBM's developerWorks.

For Linux, there are several interesting pieces, especially as we look at Linpack as one of the key workloads that takes advantage of easy HPC building blocks. RHEL 5.2 comes with 64KB pages. The 64KB pages provides easy performance gains out of the box. The commercial compilers and math libraries provide the tailored and easy exploitation of the POWER6 systems. Running Linpack on clusters is the whole basis for the Top 500 workloads.

It's easy to take advantage of the building blocks in RHEL 5.2. OpenMPI in particular, the Infiniband stack, libraries tuned for the POWER hardware are all included. When we fire up a cluster node, we automatically install these components.

openmpi including -devel packages
openib
libehca
libibverbs-utils
openssl

These building blocks allow us to take the half-a-teraflop single system Linpack result and begin running it "out-of-the-box" on multiple nodes. There are cluster experts around that I'm learning from. Lots of interesting new challenges in the interconnect technologies and configurations. In this realm, I'm learning that one of the technology shifts emerging is the 10GBe (10GB Ethernet) interconnect vs Infiniband. Infiniband has all sorts of learning curves associated with it. Everytime I try to do something with Infiniband, I'm finding another thing to learn. It'll be interesting to see whether the 10GBe technology will be more like simply plugging in an Ethernet cable and off we go. A good summer project...

Sunday, June 29, 2008

Building a distro kernel on Power - not so bad

This should be simple. And when you know all the steps, it is. But I was surprised how challenging it's been to find easy examples of the steps to re-build a commercially shipping "distro" kernel, in this case the SLES 10 sp1 kernel.

It's probably documented cleverly in the end user documentation - but I'm far too addicted to the ease of googling compared to the inevitable drudgery of digging through user documentation. I always wonder when the "documentation community" will simply shift to wiki pages to document, but more importantly, maintain the correctness and accessibility, of end user documentation.

For this exercise, turns out we wanted to do something simple to a SLES 10 kernel shipping on Power. In our case, we wanted to see if we could re-build the distro kernel to support the 64KB pages available in the Power6 hardware systems. For the performance angle, 64KB pages can often significantly improve the performance of applications. Normally, when working with the Linux community, we simply snag the latest mainline kernel and work with that, but in this case, we were really interested in the specific performance differences between 4KB today and the expected performance of 64KB pages on the same base.

Out of that exercise, we created a new wiki page which documented the steps to re-build the SLES 10 kernel. A peer, Peter Wong, has already documented the RHEL 5.2 steps, we're just waiting for some web site maintenance to complete on the IBM developerWorks infrastructure to get that page posted as well.

For the SLES 10 sp1 (and sp2) kernel re-build instructions, see

http://www.ibm.com/developerworks/wikis/display/LinuxP/Re-building+a+SLES+10+kernel+for+Power

Recently Jon Tollefson was playing around on the SLES 10 sp2 kernel and found that there's a missing file in the SLES 10 sp2 kernel package, so we had to comment out a line in the kernel-ppc64.spec file (modprobe.d-qeth).

One interesting aspect is I had thought the kernel re-build process would be precise and seamless. But there were a few tricks that had to be done to make it work.

One of them was adding control to be able to run the make on all of the CPUs seen by Linux.

%define jobs %(cat /proc/cpuinfo | grep processor | wc -l)

We've been playing recently on some of the sweet top-of-the-line POWER6 systems, in one case the Power 575 system with 32 cores. When running with SMT enabled, that's 64 CPUs that Linux controls. The kernel build goes very fast on that system.

Second, and there's probably a more clever way to do this, but we ended up having to unpack, modify, and re-pack the config.tar.bz2 file for the platform.

The last interesting aspect was the built-in "kabi" protections. When we first re-built the kernel, the build failed because this failed the kabi tolerance level. Very clever. I assume various kernel interfaces are flagged with KABI values, which when changed, cause the build to fail. In our case, we knew it would change things in the kernel, so we modified the tolerance value to allow for the kernel re-build.

So. Easy to do, easy to make changes, and for a performance team, easy to minimize how much is changing from one step to the next. By starting with a known entity in the distro kernel, we make one change, verify the performance differences, and then proceed to the next change. Simple. Methodical. Straight-forward.

Monday, May 5, 2008

PowerVM Lx86. Technology that works.

PowerVM Lx86. Running Linux 32-bit x86 apps on Power systems. Slick. Super-easy with Linux. And does it work? Absolutely.

Now, as a performance analyst, I'm often asked: "Is it a performance play?" My quick answer: "Nah - not usually..." But for everything I've tried: "It just works" - which in and of itself is pretty cool. You really want the best performance for your app? Re-compile and run it natively. Duh. You want easy access to existing x86 compiled apps? Give this product a shot. And in some cases, the performance of the translated product is just fine for the user's needs.

In essence, this product is the flip-side of Transitive's translator technology for Apple which translates older Apple Power applications to run on the new x86-based Apple systems. Check out these web sites if you missed the technology introduction several years ago:

http://www.transitive.com/customers/apple and

http://www.apple.com/rosetta/

IBM and Transitive (http://transitive.com/customers/ibm) have already introduced the second release (Ver 1.2) of the IBM PowerVM Lx86 product.

Originally discussed in the press as p-AVE (for example, see an article from http://www.it-analysis.com/), IBM's product naming wizards must have been at work with the preliminary name of IBM System p Application Virtual Environment (System p AVE). Later they followed it with a newer official name under the IBM PowerVM umbrella as PowerVM Lx86 for x86 Linux applications. "p-AVE" certainly rolled off the tongue far easier than the PowerVM Lx86 name. But the PowerVM naming admit'ably fits better with the overall virtualization strengths of the Power line.

For a page full of pointers and interesting helpful hints, check out http://www-128.ibm.com/developerworks/linux/lx86/.

For example, the product is download'able from IBM's web site... gotta dig through the DeveloperWorks various registration pages - but you're looking for: IBM PowerVM Lx86 V1.2 (formerly System p Application Virtual environment or p AVE) p-ave-1.2.0.0-1.tar (8,294,400 bytes).

For a clever approach to using PowerVM Lx86, a nice demo was created which you can see on YouTube.

I really like this demo since it highlights one of several cases where the translation performance of the product isn't perceivable to an end user. The demo is an actual run, no tricks. As the demo narrative says, when you try to execute an x86 application on the Power system, the x86 executable is automatically recognized and the translation environment invoked.

The video clip goes on to show some of the highlights of the Power line where Power logical partitions can be migrated from one physical system to another. More cool stuff.

Another example of common product usage is in the world of graphing performance results. Users can check out a really nice set of charting libraries from Advanced Software Engineering (http://www.advsofteng.com/) available with the ChartDirector product. The executable run-time libraries are available for a variety of platforms, including Linux on i386, but alas, not for Linux on ppc64 systems. But when the i386 libraries are installed on a Power system running Linux with the additional PowerVM Lx86 product, Power users can use the graphing routines directly. Again, the perceptible performance differences are minimal, and the full function of the i386 routines are available to the Power users.

The IBM web site for PowerVM Virtualization Software offerings has a good description of the capabilities of the Linux product and the services available for software vendors to enable their apps for native execution while still exploiting the Power systems running Linux with their existing applications.

Keep in mind there are the normal obligatory footnotes and qualifications on what i386 applications can function under this product - check out the product web sites for that information.

Finally, as a performance team, we always tend to agonize over the corner cases which highlight the performance challenges of translating an application from one system platform base to another, and there certainly are some areas where performance can be a challenge. Java is a good example. There are too many steps of translating byte codes to executables, then those executables are translated again to execute on the Power platform, which can make for a rather poor execution path. If your Java app is a minor piece of a bigger application (the prime example is as an application installer), shrug. But whew, if you're thinking about snagging a full comprehensive Java based product and running it in in translation mode - as opposed to verifying that the Java code runs on the Power platform - I can anticipate you may be disappointed with the performance. One would've hoped that the Java world of write once, run anywhere would've panned out better than the write once test everywhere implementation.

In the meantime, if you need easy access to x86 executables and applications on your Power systems, give this a shot.

Monday, April 7, 2008

What "one thing" do you want in the Linux kernel?

An interesting question came by this week. If I could have one free wish, one thoughtful choice, one wise selection, what would I want in the Linux kernel?

I was of course assured that my choice didn't mean I'd get it, but it was an interesting hallway poll for the day. I suspect the answers were being pulled together as input for this week's Linux Foundation meetings happening this week in Austin Tx... lots of people in town.

The mind races.

So back to the Linux Weather Forecast. Whew. What to choose, what to choose.

http://www.linux-foundation.org/en/Linux_Weather_Forecast written up by Jonathan Corbet of lwn.net fame.

In no particular order,

A new completely fair scheduler - the promise of an updated scheduler is intriguing for performance teams. Course, on a practical level, the thoughts of rather extensive regression testing keeps popping up in my mind. Watching the progress of that effort is reassuring though, so this will be cool in the next revs of the distros.
Kernel markers - cool technology which will make it easier for tools to hook into the kernel, which of course a performance team is always interested in. We need to make it absolutely seamless and safe for a customer, on a production system, as a protected but definitely non-root user, to gather system metric information.
Memory use profiling- now this will be really nice - way too many times we're asked about memory usage of an application - which is particularly dependent on other things happening in the operating system.
Real-time enhancements - continued integration of the real-time work happening in parallel in the community. This is proving particularly helpful in the HPC cluster space as work continues to improve the deterministic behavior of the operating system.
Memory fragmentation avoidance - another longer-term project which positions the kernel for far more aggressive memory management and control of varying page sizes.
And numerous other choices... better filesystems in ext4 and btrfs, better virtualization and container support, better energy management, improvements in glibc, etc etc

But really? All of these are in play today. All of them are being polished, updated, with many creative minds at work.

So what I asked for was for the kernel and Linux community to continue strengthening the ability to help customers easily understand and improve the performance of their applications and the system. Out of the box. Across all of the components. No special adds, rpms, or kernels on their installed system.

The nice part is the Linux programmers I work with are all committed to this. So the work continues - fit, finish, and polish - and continue working with the longer term changes which are being developed.

Tuesday, March 18, 2008

Reboot? Real customers don't reboot.

To get clear, clean, and repeat'able performance results, performance teams in the labs generally don't think twice about rebooting the system and starting fresh (after all, who knows what someone has done on the system - particularly those kernel programmers).

In my experience though, to balance this automatic tendency, it's important to note that re-booting a system isn't normal behavior for real-life customers.

For example, when I suggest that a customer re-boot his/her system, there's usually a perceptible pause, sometimes even a quiet chuckle. Turns out most customers are quite happy with their systems running Linux, and simply don't consider the process of re-booting as anything normal. I'm not surprised, but it does mean that the tuning options practically available to customers have to be done dynamically and not a kernel boot option.

This aspect continues to be improved in the Linux with work across the operating system. The ability to control energy consumption, system resource usage, adding/removing CPUs, and adding/removing memory are all examples of cool things being worked on in the Linux community.

For the performance team, we're particularly interested in the ability to control things like SMT on and off (something needed on Power systems), the number of CPUs running, and minimizing kernel memory fragmentation.

Some pieces are there, some are emerging, and some are being invested in. I'll hunt some examples down and post them here in the coming days.

Tuesday, February 12, 2008

RHEL5.1 exploits POWER6

Whew. The months tick by quickly these days. Back in October 2007, Red Hat released a service update which officially supported and exploited the IBM POWER6 systems. The Red Hat exploitation code had been worked on by LTC programmers from around the world - providing a Linux OS version that handily and easily provides very nice performance for enterprise customers in production mode.

The performance teams submitted a series of SPEC.org (see the Appendix in the white paper link below for the list) and single-system Linpack publishes at the time, which have since been reviewed, approved, and published. One of the things we really like to focus on is the "whole software stack" when performance is looked at. So a short white paper was written up to explain the various software pieces and how they were tuned. To me this is far more interesting than the bottom-line metrics and the various sparring leadership claims that bounce around the industry. 'Course, it's always fun to have those leadership claims while they last.

The paper - an interesting mix of marketing introductions and then some technical details - really is intended to focus on the basics of leveraging the platform specific compilers (IBM's XL C/C++ and Fortran compilers), platform specific tuned math libraries (ESSL), a Linux open source project called libhugetlbfs, MicroQuill's clever SmartHeap product, FDPR-pro - which I've posted on earlier, IBM's Java, all running examples on the IBM System p 570 system.

Following these activities and working with various customers and interested programmers, there's some interesting work going on with using the oprofile analysis tools on Linux - especially as a non-root user - and working on understanding edge cases where the performance of a workload isn't purely predictable and deterministic. What's interesting is the possible performance variability of a workload when memory being used by a program isn't regularly placed local to each executing core (a classic NUMA-type issue), especially in the case of varying page sizes supported by the Linux operating systems.

I'll post some examples of using oprofile in different ways this month, and also an example of a Fortran program which can show varying run-times and how to address that.

Friday, February 1, 2008

Winning with Linux on new Power systems

IBM just announced two new POWER6 systems, the IBM System p 520 and IBM System p 550. In essence, p 520 has up to 4 cores of a 4.2GHz POWER6 processor and up to 64GB memory, while the p 550 has up to 8 cores and up to 256GB memory. The two systems are sweet little systems, and I recommend checking them out.

Elisabeth Stahl has a good summary blog post on the leadership publishes submitted earlier this week for these new systems. I point to her blog since she nicely has all of the requisite disclaimers and pointers to the data points submitted to SPEC.org and with SAP this week. It takes a couple of weeks for the various review processes to complete, so it'll be easier to comment on these once they've been reviewed, approved, and published. I have found that Linux programmers like to see the published files and walk through the specific details. Show me.

The cool part is Linux on Power continues to be a parity player in the leadership performance metrics for POWER6 customers (in this case using examples for Linpack, SPEC CPU2006, SPECjbb2005, and SAP). There are summaries of some of the submitted bottom-line numbers for AIX and Linux for the p 520 and the p 550 on IBM's website.

There's a short paper which should be out soon that discusses the simple steps and software products that can be used on Linux on Power to achieve the best performance for POWER6 using Linux. It's based on similar publishes done last October and uses actual results that are published on the SPEC.org website.

Monday, January 21, 2008

Workloads: Standards? Customer? Proxy?

So what's the "Best Workload" to measure performance with? Industry standard workloads? Real-life customer applications? Proxy workloads developed to mimic customer applications? Micro-benchmarks? Marketing inspired metrics?

Well, it depends. Each has a purpose. Each has its drawbacks.

Let's review.

Industry standard workloads, typified by the wide breadth of SPEC.org workloads, are benchmarks painstakingly agreed-to by industry committees. The workloads are specifically intended for cross-industry comparisons of systems and software stacks, which can make the committees an "interesting" position to be on. I'm not sure I would have the patience and tact needed to function effectively in that realm.

In the Linux community, I have occasionally found some resistance to the world of industry standard workloads, with quick examples of things like SPECcpu2006, SPECjbb2005, SPECweb2005, and many others. Trying not to over-generalize, but sometimes the community views unique performance improvements targeted at these workloads as "benchmark specials". Ok, sometimes I have seen some members of the community get more than a little excited about a change being proposed which were presented as "we gotta have this for xxx benchmark".

I find this attitude frustrating at times, since the workloads usually were developed with the specific intent of addressing how customers use their systems. But note to self, I've discovered that when I can find a real-life customer example of the same workload, the proposed change (fix or improvement) is enthusiastically embraced. With that small change in how a problem is proposed, I've been very impressed with how well the broader Linux community wants to improve performance for real customers.

(by the way, quick aside, some industry standard benchmarks do not pan out because really clever "benchmark specials" DO emerge, and the comparisons then can lose any customer relevant importance)

Real-life customer applications. So why not test with "real" applications more often? While some customers do have specialized example applications that they can share with vendors, usually the customer application represents the "jewels" of their business, so it's not at all surprising that they have no interest in sharing. So some time and energy is spent working to understand the workload characteristics, which can then be leveraged in identifying proxy workloads. Most performance teams have a good set of sample workloads which can serve as examples to look at improvements and bottlenecks.

On occasion, a team can run into a customer application which is unique and not nicely covered by their standard suite. Valerie Henson ran into this a while back, and created a nice little program called eBizzy, which eventually ended up on SourceForge with a third release just made available in January 2008. Numerous performance problems have been uncovered with programs like this, with fixes flowing back into the mainline development trees for Linux and the software stacks. There are many other examples floating around the web.

Proxy Workloads. Creating workloads to mimic customer applications is what I loosely call proxy workloads. These are handy when they are available as source code and unencumbered with various specialized licenses. The workloads can be modified to address what the performance team is focusing on, and more importantly, they're available for the programmers in the community who are trying to reproduce the problem and prototype the fix. Using proxy workloads is a good example of where the combination of customer, performance teams, and programmers can effectively address performance issues with application programs..

Micro-benchmarks. Some people use very small, simplistic, programs for performance assessments. These targeted programs are often packaged together and are fairly quick and easy to execute and get simplistic answers from. These are practically more useful for digging into specific areas of a system implementation, but when used to generalize an overall system's performance, can cause headaches and mis-understandings.

I have seen several over-zealous system admins at some companies take a simple single-thread single-process integer micro-benchmark, run it on his new 1U x86 server with 1GB memory, and then run it on a new very large system (anyone's, this is not vendor exclusive), and make the obvious conclusion that new little 1U server is the new preferred server system for massively complex workloads planned for the datacenter. Somehow they completely miss the point that only one thread is running on the larger system with many cores and lots of memory, thus 98% idle. It'd be funny if these people were just sitting in labs, but it's amazing how an authoritatively written email inside a company can create so much extra work in refuting and explaining why the person who wrote it, perhaps, just perhaps, was an idiot.

That said, micro-benchmarks do play a critical role with performance teams, simply because they do in fact highlight particular aspects of a system, which allows some very focused testing and improvement activity. Improvements have to be made in a bigger context though, since you may tweak one part of the system, only to hurt another part.

Marketing inspired metrics. Finally, I'll close with a quick discussion on what I term the "marketing inspired metrics". While some companies in the marketplace are really good at this, I have to chuckle and admit I'm not convinced this is one of IBM's core strengths, or for that matter, Linux.

The ability to make up new performance metrics, highlighting vague workloads in the absence of any concrete comparable system criteria, and formulate an entire marketing campaign around new "performance terminology" is an art form. I shudder when marketing teams get that gleam in their eye and start talking about performance. Uh-oh.

It's so much safer when we tell them "This is Leadership", or they simply task us with "Make it Leadership". Leadership of course is always how you phrase it, and what you choose to compare to, but at least that's usually a facts-based discussion.

Focus on repeat'able data. So performance teams focus on real data, which is supported by analysis and assessments of the data, watching for and correcting the common mistakes in gathering data from a system. In Linux performance teams I've been involved with, there is a fanatical dedication to automating the process of gathering and scrubbing the data, building in safe-guards to catch the common mistakes, so that when work is done on a workload, be it industry standard, a customer application, a proxy workload, or a micro-benchmark, we're focusing on finding and fixing the problems, not on the steps of gathering the data.

And I always try to avoid those marketing inspired metrics, that's for the marketing teams.

Thursday, January 17, 2008

Performance Plus

“Higher system performance—the traditional hallmark of UNIX evolution—is still critical, but no longer sufficient. New systems must deliver higher levels of performance plus availability plus efficiency. We call this Performance plus.” —Ross A. Mauri, general manager, IBM System p

While more AIX-centric than I would've hoped for from an article for System p, I did see an interesting article from an IBM Systems Magazine which touches on some of the challenges emerging for performance teams in today's marketplace. The article focuses on application availability and energy efficiencies as appropriate parallel focus items in addition to performance metrics and benchmarks. No surprise there, but it is interesting to see IBM's senior executives emerging with a new term - "Performance Plus" - which generally means we'll be living that as a mantra within a month or two.

The challenge comes in emerging with new metrics to numerically quantify the balancing act of system/application availability, energy usage (across cooling, power draw, and peak energy demands), increasingly virtualized servers, and the classic how fast does my application run?

If we could figure out how to make "metrics" a series of coding requests in open-source projects, we could get things going across various mailing lists and get more people cranking out code, ideas, and brand-new metrics. In the meantime, I guess we'll start getting more creative with our Linux brethren world-wide who are working on exactly all of these issues in real-life scenarios. The balancing act is already in progress, the metrics will need to emerge over time.

Friday, January 11, 2008

Executable magic - FDPR-pro

One of the products we use quite frequently is the FDPR-pro product available from IBM . It's one of those "magic" products which allow you to easily optimize a program after it has been compiled and linked.

The program, Feedback Directed Program Restructuring, is available for Linux on Power programs from IBM's alphaWorks web site. Last week the Haifa development team updated the download'able version with a new version - Version 5.4.0.17 - which is what caught my eye.

The official name is Post-Link Optimization for Linux on POWER, which makes for an acronym that has never really caught on - PLOLOP - and various other strange permutations. I seem to recall that the FDPR-pro name had a name conflict somewhere in the world, but that's the name I've been familiar with for years.

The post-link optimization has been used with database engine products, big ISV products, CPU intensive benchmarks and customer workloads. Borrowing shamelessly from the alphaWorks web site, this product is a performance tuning utility which can improve the execution time of user-level application programs. It does this with a multi-step process of running the program in a commonly used mode, where FDPR gathers run-time profiles and uses those profiles to modify/optimize the executable image.

The clever part is that FDPR can be made aware of cache line sizes and specific system characteristics and optimize (if desired) for execution on a particular set of systems. The daunting aspect of FDPR is the number of possible optimizations. As happens with compilers, there are lots of optimizations and areas to focus on.

Using FDPR can become a natural part of the build process of an application. Build, link, then run the extra steps needed by FDPR. FDPR is a supported product and re-structured executables can be supported in normal business commercial environments. Sometimes the performance gains are minimal, and in many cases we'll see performance gains in the 10% to 12% range. More is possible, but it really depends on the application. In rare cases the changes will cause an executable to core-dump, but we've found the team in Haifa extremely helpful and they usually are able to fix problems quickly.

A product to help with the daunting number of FDPR options is another alphaWorks product called ESTO - Expert System for Tuning Optimizations. ESTO automates the process of identifying options and tuning with FDPR for Linux on Power. As explained on the ESTO project page, the approach leverages "iterative compilation" and uses an "adaptive genetic algorithm with exponential population reduction". I have a feeling that the team that built the product description had fun listing the technologies and the approaches used.

In the near future, we're going to get some examples posted up here and on some LoP wikis.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.